Douban Scrawl Tutoral 1

Douban scrawl tutorial

[TOC]

Intruction

Use request to scrawl Douban

Preparation

Create a virtualenv for python3

virtualenv -p /usr/local/bin/python3 venv

ps: need to install python3 first

then activate it: source venv/bin/activate

Install request

pip install request

Tutorial

1. Get movies of 2016 on Douban

1
2
3
4
5
6
7
8
import requests

r = requests.get('https://movie.douban.com/tag/2016')
print(r.text)
print(type(r))
print(r.status_code)
print(r.encoding)
print(r.cookies)

2. Get movie scores and comments

Use beautiful soup to parse the HTML

Install beautiful soup by pip: pip install beautifulsoup4

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import requests
from bs4 import BeautifulSoup

def main():
r = requests.get('https://movie.douban.com/tag/2016')
soup = BeautifulSoup(r.text)
article = soup.findAll('div', {'class': 'article'})[0] # div for movies
for table in article.findAll("table", {'class': 'infobox'}):
table.extract()

for div in article.findAll("div", {'class': ['clearfix', 'paginator']}):
div.extract()

# get all movie links
for link in article.find_all('a', {'class': 'nbg'}):
print(link.get('href'))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
mport requests
from bs4 import BeautifulSoup

def main():
r = requests.get('https://movie.douban.com/tag/2016')
soup = BeautifulSoup(r.text)
article = soup.findAll('div', {'class': 'article'})[0] # div for movies
for table in article.findAll("table", {'class': 'infobox'}):
table.extract()

for div in article.findAll("div", {'class': ['clearfix', 'paginator']}):
div.extract()

# get all movie links
for link in article.find_all('a', {'class': 'nbg'}):
scrawl_movie_info(link.get('href'))


def scrawl_movie_info(url='https://movie.douban.com/subject/25980443/'):
r = requests.get(url)
soup = BeautifulSoup(r.text)
link = soup.select('#comments-section > div.mod-hd > h2 > span > a')[0].get('href')
print(link.get('href'))

Code to get comments and scores of first page

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import requests
from bs4 import BeautifulSoup

def main():
r = requests.get('https://movie.douban.com/tag/2016')
soup = BeautifulSoup(r.text)
article = soup.findAll('div', {'class': 'article'})[0] # div for movies
for table in article.findAll("table", {'class': 'infobox'}):
table.extract()

for div in article.findAll("div", {'class': ['clearfix', 'paginator']}):
div.extract()

# get all movie links
for link in article.find_all('a', {'class': 'nbg'}):
scrawl_movie_info(link.get('href'))


def scrawl_movie_info(url='https://movie.douban.com/subject/25980443/'):
r = requests.get(url)
soup = BeautifulSoup(r.text)
link = soup.select('#comments-section > div.mod-hd > h2 > span > a')[0].get('href')
scrawl_movie_comments(link)


def scrawl_movie_comments(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
for item in soup.findAll('div', {'class': 'comment-item'}):
for comment in item.findAll('p'):
desc = comment.text # get comment

for rating in item.findAll('span', {'class': 'rating'}):
score = rating['class'][0][7:]
# print(score)

print(score, desc)

Result demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
20  叙事模式跟HP相似,有更多的政治影射,但剧本,人物,剪辑和节奏都很有问题,看的完全提不起兴趣且有各种不适感。最后高潮戏又故意弄得跟超级英雄电影很像。德普已经签了续集,目测校长就快出现了吧,看来华纳把心思都放在开发HP宇宙了。影帝要再这么演下去就是下一个马丁弗里曼。

30 虽然我哭了几次,但只是因为我是粉丝。。前半段老土,后半段乏力,主线情节真的幼稚到无趣,而且1926年的巫师出国都靠坐船,魔法部的大家都穿民族服装??但是罗琳阿姨真的知道怎样开始和结束一个故事,最后大雨倾盆的时候,我觉得自己也像被施了遗忘咒。这个世界再一次离开了我。

40 赫奇帕奇的毕业生果然就算有美貌加成还是无趣的男人!姐妹们,嫁人还是得找985、211的,赫奇帕奇这种二本的咱不嫁!

40 华纳(WB)片头的配乐一出,一秒回到哈利·波特的魔法世界。J·K·罗琳用这部全新的创作的故事,把魔法打通,更让观众再次梦回奇幻冒险之旅,情怀剧情惊喜连连、神奇动物目不暇接、彩蛋伏笔合不拢嘴。嗅嗅爱珠宝、护树罗锅有情绪、隐形兽化身“闪电”、鸟蛇能伸能屈…一本满足,2016外语商业片Top1.

30 节奏很奇怪,点太多太乱,不知道是要讲奇遇,找动物,还是要拯救世界。雀斑这个演技实在有点掉影帝的价,怎么和《丹麦女孩》一样眼神闪烁不正眼看人呢?视觉效果倒是比《奇异博士》还好,各种眼花缭乱。最不爽的莫过于明明上一秒还在对科林法瑞尔流口水,下一秒就要被德普的淫笑视奸。

30 把Colin Farrel变成JonnyDepp的膜法?!!搞没搞错啊啊啊!!还我科叔!我要去美国膜法国会上诉

30 Johnny Depp一出场 我的白眼照亮了整个电影院

40 三星半。即便不是哈迷,其中那些有趣的生物与奇妙的魔法,还是很吸引人。故事野心很大,要铺大世界了。小雀斑的表演很特别,总是侧着身子的感觉,有些优雅又带着点俏皮,一开始略带笨拙的样子,又有点卓别林的感觉。有几个动物的设定很可爱,轻松的地方也很轻松。看完走出影院才发现,魔法世界真好啊

20 花了30刀买的4D的票,看完感觉自己被骗了。片名叫神奇动物,结果动物只是来客串的,整个故事讲的是一个缺爱小男孩儿的愁与怨...电影的节奏让人崩溃,男主虽然不演霍金了,但是脸部肌肉还是会随时抽搐,女主则从头哭到尾...
0%