python爬取豆瓣影评,根据关键词生成词云图

背景:

python 版本:3.7.4

使用IDEA:pycharm

操作系统:Windows64

第一步:获取登录状态

爬取豆瓣评论是需要用户登录的,所以需要先拿到登陆相关 cookie。进入浏览器(IE浏览器把所有的 cookie 集合到一起了,比较方便取值,其他浏览器需要自己整合所有的 cookie)登陆豆瓣之后,按下 F12 ,拿到请求头里的 cookie 与 user-agent 的数据,保持登陆状态不要退出。python爬取豆瓣影评,根据关键词生成词云图

 第二步:分析 HTML 

简单获取《豪斯医生》的某一页影评,经过分析影评的 html 数据展示格式可知,我们需要的是 tr 标签下面的 td 下面的第二个 p 标签里面的内容:

python爬取豆瓣影评,根据关键词生成词云图

第三步:编码 

采用 BeautifulSoup 进行 HTML 解析,简版 python 编码如下:(因为输出内容字符集是 utf-8 的,所以建议指定字符集格式)

#!/usr/bin/python
# -*- coding: utf-8 -*-
import io
import sys
import requests
from bs4 import BeautifulSoup
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=‘utf8‘)
url = ‘https://movie.douban.com/subject/1442129/collections?start=20‘
headers = {
    ‘cookie‘:‘ll=118172; bid=nO_yhRGdS8c; __utma=30149280.744941980.1587025849.1587025849.1587025849.1; __utmb=30149280.7.10.1587025849; __utmz=30149280.1587025849.1.1.utmcsr=so.com|utmccn=(referral)|utmcmd=referral|utmcct=/link; __utmt=1; push_noty_num=0; push_doumail_num=0; __utmv=30149280.18122; douban-profile-remind=1; __utmc=30149280; dbcl2=181229630:peNlRIftZSU; ck=0DBS; _vwo_uuid_v2=D6F0A378B72943607FFB8D0DE9AA9E4F2|e4b22c328b795c724132d4d5a5551615; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1587025959%2C%22https%3A%2F%2Fwww.douban.com%2Fsearch%3Fsource%3Dsuggest%26q%3D%25E9%2587%258D%25E7%2594%259F%22%5D; _pk_id.100001.4cf6=55b0d18436426829.1587025959.1.1587025959.1587025959.; _pk_ses.100001.4cf6=*; __utma=223695111.917770948.1587025959.1587025959.1587025959.1; __utmb=223695111.0.10.1587025959; __utmc=223695111; __utmz=223695111.1587025959.1.1.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/search; __yadk_uid=wBD152Qkg8CojaIRAPIB7nXOYiwGgYAj‘,
    ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko‘
}

response = requests.get(url, headers=headers).text
bs4 = BeautifulSoup(response, ‘html.parser‘)
print(bs4.select("tr > td > p:nth-of-type(2)"))

爬到的影评结果如下(可以设置规则,去掉 p 标签):

[<p>看之前:不就是个医疗剧能拍出什么花??
看之后:为什么一个医疗剧可以拍出这么多花??</p>, <p>高中时期的下饭剧</p>]

第四步:将获取到的影评做成词云

主要用到的模块有:jieba、wordcloud、image,可以使用 pip 进行安装,具体词云制作代码如下:

爬到的影评的数据存放位置:F:\\python\\install_3_7_4\\txt\\haosiyisheng.txt;

网上找的一张豪斯医生的剧照的存放位置:F:\\python\\install_3_7_4\\txt\\haosiyisheng.png

词云采用的字体的存放位置:C:/Windows/Fonts/msyh.ttc

#!/usr/bin/python
# -*- coding: utf-8 -*-
import io
import sys
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import numpy as np
import jieba
import matplotlib.pyplot as plt
fig, ax=plt.subplots()

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=‘utf8‘)

def GetWordCloud():
    path_txt = "F:\\python\\install_3_7_4\\txt\\haosiyisheng.txt";
    path_img = "F:\\python\\install_3_7_4\\txt\\haosiyisheng.png";
    f = open(path_txt, ‘r‘, encoding=‘UTF-8‘).read()
    background_image = np.array(Image.open(path_img))
    cut_text = " ".join(jieba.cut(f))

    wordcloud = WordCloud(
        font_path="C:/Windows/Fonts/msyh.ttc",
        background_color="white",
        mask=background_image
    ).generate(cut_text)

    ax.imshow(wordcloud)
    ax.axis("off")
    plt.show()
    wordcloud.to_file(r"haosiyisheng_result.png")


if __name__ == ‘__main__‘:
    GetWordCloud()

词云最终效果图:

python爬取豆瓣影评,根据关键词生成词云图

第五步:编码过程中的异常与解决方案

1. 解决异常:ReadTimeoutError: HTTPSConnectionPool(host=‘files.pythonhosted.org‘, port=443): Read timed out.

使用 pip install xxx模块 时,经常会遇到这个异常:

ReadTimeoutError: HTTPSConnectionPool(host=‘files.pythonhosted.org‘, port=443): Read timed out.

可以尝试更改 pip 源,国内源:

http://pypi.douban.com/ 豆瓣
http://pypi.hustunique.com/ 华中理工大学
http://pypi.sdutlinux.org/ 山东理工大学
http://pypi.mirrors.ustc.edu.cn/ 中国科学技术大学

最简单的方式,直接指定 pip 源,如下所示指定为豆瓣的源:

pip install -i https://pypi.douban.com/simple <需要安装的包>

2. 安装 wordcloud

安装 wordcloud 遇到一点意外,正确安装方式如下:

首先进入链接:https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud

根据 python 大版本号下载对应的 wordcloud,我本机的 python 大版本是37,所以下载的是:

python爬取豆瓣影评,根据关键词生成词云图

下载 wheel 模块,因为要通过 wheel 模块进行.whl文件的安装

pip install wheel

将之前下载好的 wordcloud-1.6.0-cp37-cp37m-win32.whl 文件复制到 python 的安装目录的 /Scripts 目录下,在此位置执行:

$ pip install wordcloud-1.6.0-cp37-cp37m-win32.whl
Processing f:\python\install_3_7_4\scripts\wordcloud-1.6.0-cp37-cp37m-win32.whl
Requirement already satisfied: pillow in f:\python\install_3_7_4\lib\site-packag                                                                                                                                                                                      es (from wordcloud==1.6.0) (7.1.1)
Requirement already satisfied: numpy>=1.6.1 in f:\python\install_3_7_4\lib\site-                                                                                                                                                                                      packages (from wordcloud==1.6.0) (1.18.2)
Requirement already satisfied: matplotlib in f:\python\install_3_7_4\lib\site-pa                                                                                                                                                                                      ckages (from wordcloud==1.6.0) (3.2.1)
Requirement already satisfied: kiwisolver>=1.0.1 in f:\python\install_3_7_4\lib\                                                                                                                                                                                      site-packages (from matplotlib->wordcloud==1.6.0) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in f:\py                                                                                                                                                                                      thon\install_3_7_4\lib\site-packages (from matplotlib->wordcloud==1.6.0) (2.4.7)
Requirement already satisfied: cycler>=0.10 in f:\python\install_3_7_4\lib\site-                                                                                                                                                                                      packages (from matplotlib->wordcloud==1.6.0) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in f:\python\install_3_7_4\l                                                                                                                                                                                      ib\site-packages (from matplotlib->wordcloud==1.6.0) (2.8.1)
Requirement already satisfied: six in f:\python\install_3_7_4\lib\site-packages                                                                                                                                                                                       (from cycler>=0.10->matplotlib->wordcloud==1.6.0) (1.14.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.6.0

3. 使用 pip list 查看已安装的模块

$ pip list
Package         Version
--------------- ----------
asgiref         3.2.7
beautifulsoup4  4.9.0
bs4             0.0.1
certifi         2020.4.5.1
chardet         3.0.4
cycler          0.10.0
Django          3.0.5
idna            2.9
image           1.5.30
jieba           0.39
kiwisolver      1.2.0
matplotlib      3.2.1
numpy           1.18.2
Pillow          7.1.1
pip             19.2.3
pyparsing       2.4.7
python-dateutil 2.8.1
pytz            2019.3
requests        2.23.0
setuptools      40.8.0
six             1.14.0
soupsieve       2.0
sqlparse        0.3.1
urllib3         1.25.8
wheel           0.34.2
wordcloud       1.

相关推荐