python豆瓣爬虫爬取评论做成词云

原创

公众号bigsai 2022-08-24 14:15:37 ©著作权

文章标签 词云豆瓣爬虫 sql mysql 文章分类 后端开发

©著作权归作者所有：来自51CTO博客作者公众号bigsai的原创作品，请联系作者获取转载授权，否则将追究法律责任

前言

前一段时间学校有个project，做一个电影购票系统，当时就用springboot做了系统，用python抓了一些电影的基本信息。后来发现如果把评论做成词云那展示起来不是很酷炫么。于是乎把这个过程分享记录下来。

虽然不是什么高大上的技术，但是是自己做出来的词云，难免有些兴奋。
所用到的库：
爬虫：requests，pymysql存库。
词云生成：wordcloud（词云），jieba（中文文本分词）。matplotlib（图片展示）

数据库

ciyun

SET FOREIGN_KEY_CHECKS=0;

-- ----------------------------
-- Table structure for ciyun
-- ----------------------------
DROP TABLE IF EXISTS `ciyun`;
CREATE TABLE `ciyun` (
  `moviename` varchar(255) DEFAULT NULL,
  `id` int(11) DEFAULT NULL,
  `text` varchar(8000) CHARACTER SET utf8mb4 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

movie

SET FOREIGN_KEY_CHECKS=0;

-- ----------------------------
-- Table structure for movie
-- ----------------------------
DROP TABLE IF EXISTS `movie`;
CREATE TABLE `movie` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `type` varchar(255) DEFAULT NULL,
  `time_long` int(20) DEFAULT NULL,
  `description` varchar(1500) DEFAULT NULL,
  `img` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=205 DEFAULT CHARSET=utf8mb4;

爬虫部分：

一个爬虫重在分析网页链接的关系和结构。我要爬取的是短评。

进入豆瓣电影选恐怖的界面分析每个电影下的属性：

python豆瓣爬虫爬取评论做成词云_爬虫

并且他的下拉还会有ajax返回数据(没有加密)

python豆瓣爬虫爬取评论做成词云_豆瓣_02

python豆瓣爬虫爬取评论做成词云_爬虫_03

找到这个界面进去链接你会发现：

python豆瓣爬虫爬取评论做成词云_词云_04

他的url是由规律的。进去你在看评论

python豆瓣爬虫爬取评论做成词云_词云_05

在观看评论时候：

python豆瓣爬虫爬取评论做成词云_mysql_06

你会发现这个评论都在short类中，那么这样你就可以抓到够多的评论了。对于爬虫部分因为评论只是要抓取的一部分，那么就不具体介绍了。给出解析text页面的核心函数：

def gettext(url):
 req=requests.get(url)
 res=req.text
 soup=BeautifulSoup(res,'lxml')
 commit=soup.select(".short")
 text=''
 for team in commit:
 text =team.text ' '
 return

同一个电影把text放到一块就可以。然后先存到数据库。（多个节点的工程我更喜欢分布实现。这样更稳定）

对于词云生成部分：

我的文本信息数据库（每一个text都很长很长）

python豆瓣爬虫爬取评论做成词云_sql_07

完整核心代码为：

douban

import requests
from bs4 import BeautifulSoup
import pymysql
import time
db = pymysql.connect(host="localhost", user="root",
                     password="123456", db="project", port=3306)
# 使用cursor()方法获取操作游标
cur = db.cursor()
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

url1="https://movie.douban.com/j/search_subjects?type=movie&tag=%E5%8A%A8%E4%BD%9C&sort=time&page_limit=20&page_start=0"
req=requests.get(url=url1)
res=req.json()
a=res['subjects']
def judmovie(url,name,imgurl,id):
    req=requests.get(url,headers=header)
    res=req.text
    soup=BeautifulSoup(res,"lxml")
    timelong=soup.find(attrs={'property':'v:runtime'}).text
    timelong=str(timelong).replace('分钟','')
    introduction = soup.find(attrs={'property': 'v:summary'}).text
    introduction = str(introduction).replace(' ', '')
    print(timelong,introduction)
    # sql="insert into movie(name,type,time_long,description,img)values('%s','action','%d','%s','%s')"%(name,int(timelong),introduction,imgurl)
    # try:
    #     cur.execute(sql)
    #     db.commit()
    # except Exception as e:
    #     print(e)
    #     db.rollback()
    # sql2="insert into ciyun(moviename,id)values('%s','%s')"%(name,id)
    # try:
    #     cur.execute(sql2)
    #     db.commit()
    # except Exception as e:
    #     print(e)
    #     db.rollback()


for team in a:
    #print(team)
    id=team['id']
    img=team['cover']
    url=team['url']
    name=str(team['title']).replace(' ','')
    print(id,name,img,url)
    try:
     judmovie(url,name,img,id)
     time.sleep(1)
    except Exception as  e:
        print(e)

#judmovie("https://movie.douban.com/subject/30228425/?tag=%E6%81%90%E6%80%96&from=gaia",'fds','jj')

gettext：

import requests
from bs4 import BeautifulSoup
import time
import pymysql
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

db = pymysql.connect(host="localhost", user="root",
                     password="123456", db="project", port=3306)
# 使用cursor()方法获取操作游标
cur = db.cursor()
def gettext(url):
    req=requests.get(url)
    res=req.text
    soup=BeautifulSoup(res,'lxml')
    commit=soup.select(".short")
    text=''
    for team in commit:
        text+=team.text+' '
    return text
sql="select * from ciyun"
cur.execute(sql)
valuelist=cur.fetchall()
for value in valuelist:
  time.sleep(0.4)
  try:
    name=value[0]
    id=value[2]
    url1="https://movie.douban.com/subject/"+str(id)+"/comments?start=0&limit=20&sort=new_score&status=P"
    url2 = "https://movie.douban.com/subject/" + str(id) + "/comments?start=20&limit=20&sort=new_score&status=P"
    tex1=gettext(url1)
    tex2=gettext(url2)
    tex1+=tex2
    print(1,tex1)
    sql ="update ciyun set text='%s' where id='%s'"%(tex1,id)
    cur.execute(sql)
    db.commit()
  except Exception as e:
      print(e)
gettext("https://movie.douban.com/subject/30228425/comments?status=P")