Goals and design

Goals

BestCBooks is a good website offers plenty PDF books about computer science.
I want to get all the books download link and password, as they’re stored in Baidu Disk.

References & Result

Design

As I want to get all the books, so I need to get books by catagery, so I need to crawl all the catagory links;
Get the book link from the catagory link;
Get the book store link(on pan.baidu.com) and it’s password

How to do

Create a scrapy project

Install scrapy:
1
pip install scrapy

Create the scrapy project:

$ scrapy startproject tutorial

New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
    /tmp/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com

Crawl the catagory page

Create a spider named bestcbooks.py, the full path is:tutorial/spiders/bestcbooks.py
The project structure like below:

$ cd /tmp/tutorial ; tree
.
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── bestcbooks.py
        └── __init__.py

2 directories, 7 files

edit the bestcbooks.py content as below:

#!/usr/bin/env python
# encoding: utf-8


from urlparse import urljoin
from scrapy import Spider, Request
from tutorial.items import CatagoryItems, BookPageItems, BookItems


class DmozSpider(Spider):
    name = 'bestcbooks'
    allowed_domains = ['bestcbooks.com']
    start_urls = [
        'http://bestcbooks.com/'
    ]

    meta = {'cookiejar': 1}

    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip,deflate',
        'Accept-Language': 'en-US,en;q=0.8,zh;q=0.6,zh-CN;q=0.4,zh-TW;q=0.2',
        'Connection': 'keep-alive',
        'Connection-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.90 Safari/537.36'
    }

Edit tutorial/items.py like blow, which is used to store the datas we get

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Item, Field


class BestCBookItems(Item):
    # define the fields for your item here like:
    name = Field()
    url = Field()


class CatagoryItems(Item):
    catagory = Field()
    url = Field()


class BookPageItems(Item):
    name = Field()
    url = Field()


class BookItems(Item):
    name = Field()
    link = Field()
    password = Field()
    orig_url = Field()

Crawl the catagory pages

Analysis
Use Google Chrome inspect the catagory link, get the xpath
So the xpath for it is: ('//ul[@id="category-list"]/li')

Implement the code
Then add a parse method(default parse method for scrapy) in the bestcbooks.py

def parse(self, response):
    """docstring for parse"""
    filename = response.url.split('/')[-2]
    with open(filename, 'wb') as f:
        f.write(response.body)

    for sel in response.xpath('//ul[@id="category-list"]/li'):
        name = sel.xpath('a/text()').extract()[0]
        url = urljoin(response.url, sel.xpath('a/@href').extract()[0])

        yield Request(url, callback=self.parse_catagory_page)

        item = CatagoryItems()
        item['url'] = url
        item['catagory'] = name

The parse_catagory_page is used for parse catagory page defined below

Crawl the book pages

Analysis
The book link’s xpath is: (//div[@class="categorywell"]/h4)

Implement the code
Let’s implement the parse_catagory_page method like below

def parse_catagory_page(self, response):
    """Parse catagory page"""
    for sel in response.xpath('//div[@class="categorywell"]/h4'):
        name = sel.xpath('a/text()').extract()[0]
        url = urljoin(response.url, sel.xpath('a/@href').extract()[0])

        yield Request(url, callback=self.parse_book_page)

        item = BookPageItems()
        item['url'] = url
        item['name'] = name

The parse_book_page will implement below

Crawl the book info

Analysis
The book detail xpath is: ('//h1[@class="entry-title"]/text()')
Implement the code
Let’s implement the method parse_book_page:

def parse_book_page(self, response):
    """parse detail book page"""
    orig_url = response.url
    name = response.xpath('//h1[@class="entry-title"]/text()').extract()
    for sel in response.xpath('//blockquote'):
        link = sel.xpath('p/a/@href').extract()
        try:
            password = sel.xpath('p/text()').extract()[-1].split()[-1][-4:]
        except:
            password = ""

        item = BookItems()
        item['name'] = name
        item['link'] = link
        item['password'] = password
        item['orig_url'] = orig_url

        yield item

How to run

Start scrapy

I use this command to start scrapy, and the result will store in items.json

1	scrapy crawl bestcbooks -o items.json

Check the result

Content in items.json is the result of the book’s download link and the password