0、知识点

 requests 发送请求

re 解析网页数据

json 类型数据提取

csv 表格数据保存

一、第三方库

requests >>> pip install requests

二、开发环境

    版 本: python  3.8

    编辑器:pycharm 2021.2

三、模块安装问题

win + R 输入cmd 输入安装命令 pip install 模块名 (如果你觉得安装速度比较慢, 你可以切换国内镜像源)

模块安装问题:

 - 如何安装python第三方模块:

     1. win + R 输入 cmd 点击确定, 输入安装命令 pip install 模块名 (pip install requests) 回车

     2. 在pycharm中点击Terminal(终端) 输入安装命令

 - 安装失败原因:

     - 失败一: pip 不是内部命令

         解决方法: 设置环境变量

     - 失败二: 出现大量报红 (read time out)

         解决方法: 因为是网络链接超时,  需要切换镜像源

             清华:https://pypi.tuna.tsinghua.edu.cn/simple

             阿里云:https://mirrors.aliyun.com/pypi/simple/

             中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/

             华中理工大学:https://pypi.hustunique.com/

             山东理工大学:https://pypi.sdutlinux.org/

             豆瓣:https://pypi.douban.com/simple/

             例如:pip3 install -i https://pypi.doubanio.com/simple/ 模块名

     - 失败三: cmd里面显示已经安装过了, 或者安装成功了, 但是在pycharm里面还是无法导入

         解决方法: 可能安装了多个python版本 (anaconda 或者 python 安装一个即可) 卸载一个就好

                 或者你pycharm里面python解释器没有设置好

四、配置pycharm里面的python解释器

1. 选择file(文件) >>> setting(设置) >>> Project(项目) >>> python interpreter(python解释器)

      2. 点击齿轮, 选择add

      3. 添加python安装路径

五、pycharm如何安装插件

1. 选择file(文件) >>> setting(设置) >>> Plugins(插件)

2. 点击 Marketplace  输入想要安装的插件名字 比如:翻译插件 输入 translation / 汉化插件 输入 Chinese

3. 选择相应的插件点击 install(安装) 即可

4. 安装成功之后 是会弹出 重启pycharm的选项 点击确定, 重启即可生效

六、爬虫基本思路

Python(简单 高效 模块 功能)

   爬虫程序

       批量采集互联网数据(文本 图片 音频 视频)

   原理:

       模拟成 浏览器 向 服务器 发送请求

爬虫案例如何实现:

   找到数据来源

​    https://s.taobao.com/search?ie=utf8&initiative_id=staobaoz_20220803&stats_click=search_radio_all%3A1&js=1&imgfile=&q=%E6%B3%B3%E8%A1%A3%E5%A5%B3%E5%A4%8F%E5%AD%A3&suggest=0_2&_input_charset=utf-8&wq=%E6%B3%B3%E8%A1%A3&suggest_query=%E6%B3%B3%E8%A1%A3&source=suggest​

七、完整代码

import requests     # 发送请求 需要安装
import re
import json
import csv

# with open('taobao.csv', mode='a', encoding='utf-8', newline='') as f:
# csv_writer = csv.writer(f)
# csv_writer.writerow(['raw_title', 'pic_url', 'detail_url', 'view_price', 'item_loc', 'view_sales', 'nick'])

# 伪装
headers = {
'cookie': 'cna=s/5FG78j/FUCAa8APiecOvNg; lgc=tb668512329; tracknick=tb668512329; thw=cn; enc=5QzxAFeTLCIaj4DdlClUUmCfmppq0mVmYnRM4MnjLLB4RjqMpvuUixwqmjkBvCn0Jgo9mK5a7GX5bTUVvYOjcKlG6Dcyihb49SfHSHh4p5w%3D; t=c1a7661aebc8b0eee31b756f0feeff62; _tb_token_=f17333878dd31; _m_h5_tk=4121cfdc611986d82be69f74d3c29f02_1659536900036; _m_h5_tk_enc=cfe86496d903c8670edb6df8d9008465; xlly_s=1; cookie2=17f624d84070bbd6d85563a647087846; _samesite_flag_=true; sgcookie=E100mMZhcjay7BJ0U6dkbhG6C500Ca%2FFJHGrQDTTkuu7sIBT4Vvt6geS1GV5dolt%2FZ14wi031qNjkp543s5U%2BulN9GYdFqm8S3V%2FxQ%2FyrrbqnGQ%3D; unb=2210627905944; uc3=lg2=Vq8l%2BKCLz3%2F65A%3D%3D&id2=UUpgRsItw%2BrsB7dvyw%3D%3D&nk2=F5RDKmf768KMcHQ%3D&vt3=F8dCv4GzuqQxFlJO5FQ%3D; csg=d213a0bb; cancelledSubSites=empty; cookie17=UUpgRsItw%2BrsB7dvyw%3D%3D; dnk=tb668512329; skt=7f611b532a6d5d3b; existShop=MTY1OTUyOTE2NA%3D%3D; uc4=nk4=0%40FY4I6earzOZXUhcMjuCdOoW0PkQqMw%3D%3D&id4=0%40U2gqyZJ81Yv14cp6ZGKPzfdzKRn7Ce%2F%2F; _cc_=VFC%2FuZ9ajQ%3D%3D; _l_g_=Ug%3D%3D; sg=94f; _nk_=tb668512329; cookie1=WvY2bcMyBjwC2%2FESfKPhqaOXs%2FXPxaxugpcVR2PVSmM%3D; v=0; mt=ci=0_1; uc1=cookie15=W5iHLLyFOGW7aA%3D%3D&cookie14=UoexOzkHHad4ew%3D%3D&cookie16=U%2BGCWk%2F74Mx5tgzv3dWpnhjPaQ%3D%3D&cookie21=U%2BGCWk%2F7pY%2FF&existShop=false&pas=0; alitrackid=www.taobao.com; lastalitrackid=www.taobao.com; JSESSIONID=6C253D0599A8D872843E8F57D3E9FBC4; tfstk=cPacBgTZHoojq0FU_rgfSRGdRUIRZlYZKPz7zsNvv9-Y8zzPixxyY4MOZxXC3h1..; l=eBrY7YtILf1CVXMtBOfwlurza77tJIRfguPzaNbMiOCP_75p5KhGW6xNhxL9CnGVn6rXR35Wn1oBBSYikyUBhJpKPJLCgsDLIdTh.; isg=BDg4Vvj3rnHVaMLWcAwzn0AICebKoZwrBlMZpXKpvHMmjdl3GrJQu5ivRYU93VQD; x5sec=7b227365617263686170703b32223a226236666434373934313735396466356131393239396366306264393430326536434a7a33715a6347454d486e74766657374e507174774561447a49794d5441324d6a63354d4455354e4451374d6a436e68594b652f502f2f2f2f384251414d3d227d',
'referer': 'https://s.taobao.com/search?q=%E9%BB%91%E4%B8%9D%E6%83%85%E8%B6%A3&suggest=0_4&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.jianhua.201856-taobao-item.2&ie=utf8&initiative_id=tbindexz_20170306&_input_charset=utf-8&wq=%E9%BB%91%E4%B8%9D&suggest_query=%E9%BB%91%E4%B8%9D&source=suggest',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
for page in range(61, 71):
print(f'----正在爬取第{page}页----')
url = f'https://s.taobao.com/search?ie=utf8&initiative_id=staobaoz_20220803&stats_click=search_radio_all%3A1&js=1&imgfile=&q=%E6%B3%B3%E8%A1%A3%E5%A5%B3%E5%A4%8F%E5%AD%A3&suggest=0_2&_input_charset=utf-8&wq=%E6%B3%B3%E8%A1%A3&suggest_query=%E6%B3%B3%E8%A1%A3&source=suggest&bcoffset=3&ntoffset=3&p4ppushleft=2%2C48&s={page * 44}'
# 1. 发送请求
response = requests.get(url=url, headers=headers)
# 2. 获取数据
html_data = response.text
# 3. 解析数据(提取数据 想要的内容取出来)
# json: 前后端数据传输的格式
# 网站开发 全栈
# 前端: 网页 页面 好看
# 后端: 功能实现 数据传输
# {"":"", "":""}
# 'g_page_config = (.*);': 规则 你要匹配什么内容
# html_data: 我需要在哪里匹配
json_str = re.findall('g_page_config = (.*);', html_data)[0]
# 转成 Python里面字典类型数据
json_dict = json.loads(json_str)
auctions = json_dict['mods']['itemlist']['data']['auctions']
for auction in auctions:
try:
raw_title = auction['raw_title']
pic_url = auction['pic_url']
detail_url = auction['detail_url']
view_price = auction['view_price']
item_loc = auction['item_loc']
view_sales = auction['view_sales']
nick = auction['nick']
print(raw_title, pic_url, detail_url, view_price, item_loc, view_sales, nick)
# 4. 保存数据
with open('taobao.csv', mode='a', encoding='utf-8', newline='') as f:
csv_writer = csv.writer(f)
csv_writer.writerow([raw_title, pic_url, detail_url, view_price, item_loc, view_sales, nick])
except:
pass

Python 爬虫 爬取淘宝店铺数据_python