Goals and design
Goals
BestCBooks is a good website offers plenty PDF books about computer science.
I want to get all the books download link and password, as they’re stored in Baidu Disk.
References & Result
Design
- As I want to get all the books, so I need to get books by catagery, so I need to crawl all the catagory links;
- Get the book link from the catagory link;
- Get the book store link(on pan.baidu.com) and it’s password
How to do
Create a scrapy project
Install scrapy:
1
pip install scrapy
Create the scrapy project:
1
2
3
4
5
6
7
8$ scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
/tmp/tutorial
You can start your first spider with:
cd tutorial
scrapy genspider example example.com
Crawl the catagory page
Create a spider named
bestcbooks.py
, the full path is:tutorial/spiders/bestcbooks.py
The project structure like below:1
2
3
4
5
6
7
8
9
10
11
12
13$ cd /tmp/tutorial ; tree
.
├── scrapy.cfg
└── tutorial
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
├── bestcbooks.py
└── __init__.py
2 directories, 7 filesedit the
bestcbooks.py
content as below:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26#!/usr/bin/env python
# encoding: utf-8
from urlparse import urljoin
from scrapy import Spider, Request
from tutorial.items import CatagoryItems, BookPageItems, BookItems
class DmozSpider(Spider):
name = 'bestcbooks'
allowed_domains = ['bestcbooks.com']
start_urls = [
'http://bestcbooks.com/'
]
meta = {'cookiejar': 1}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en-US,en;q=0.8,zh;q=0.6,zh-CN;q=0.4,zh-TW;q=0.2',
'Connection': 'keep-alive',
'Connection-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.90 Safari/537.36'
}Edit
tutorial/items.py
like blow, which is used to store the datas we get1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Item, Field
class BestCBookItems(Item):
# define the fields for your item here like:
name = Field()
url = Field()
class CatagoryItems(Item):
catagory = Field()
url = Field()
class BookPageItems(Item):
name = Field()
url = Field()
class BookItems(Item):
name = Field()
link = Field()
password = Field()
orig_url = Field()
Crawl the catagory pages
Analysis
Use Google Chrome inspect the catagory link, get the xpath
So the xpath for it is:('//ul[@id="category-list"]/li')
Implement the code
Then add a parse method(default parse method for scrapy) in thebestcbooks.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15def parse(self, response):
"""docstring for parse"""
filename = response.url.split('/')[-2]
with open(filename, 'wb') as f:
f.write(response.body)
for sel in response.xpath('//ul[@id="category-list"]/li'):
name = sel.xpath('a/text()').extract()[0]
url = urljoin(response.url, sel.xpath('a/@href').extract()[0])
yield Request(url, callback=self.parse_catagory_page)
item = CatagoryItems()
item['url'] = url
item['catagory'] = name
The parse_catagory_page
is used for parse catagory page defined below
Crawl the book pages
Analysis
The book link’s xpath is:(//div[@class="categorywell"]/h4)
Implement the code
Let’s implement theparse_catagory_page
method like below1
2
3
4
5
6
7
8
9
10
11def parse_catagory_page(self, response):
"""Parse catagory page"""
for sel in response.xpath('//div[@class="categorywell"]/h4'):
name = sel.xpath('a/text()').extract()[0]
url = urljoin(response.url, sel.xpath('a/@href').extract()[0])
yield Request(url, callback=self.parse_book_page)
item = BookPageItems()
item['url'] = url
item['name'] = name
The parse_book_page
will implement below
Crawl the book info
Analysis
The book detail xpath is:('//h1[@class="entry-title"]/text()')
Implement the code
Let’s implement the methodparse_book_page
:
1 | def parse_book_page(self, response): |
How to run
Start scrapy
I use this command to start scrapy, and the result will store in items.json1
scrapy crawl bestcbooks -o items.json
Check the result
Content in items.json is the result of the book’s download link and the password