作者:voiceless_Li
主页:我的主页[添加链接描述]()
学习方向:网络安全,python
作者寄语
:今天是第二天,我又来啦每日清醒
:保持清醒,才能在人生的道路上稳健前行,无惧风雨。分享内容
:Python批量图片爬取
今天要分享的是网络上图片的批量爬取,也是通过python,也算是一种爬虫技术吧。
搜索引擎:Google
环境:python环境
工具:pycharm
1.确认网站目标
2.寻找图片地址
通过查看网页详情能发现这些图片的地址,然后再在Fetch/XHR(这是存放json文件的)中找到对应的json文件(一般来说这么多图片,直接可以在文件内容最多的几个里面找,可以找到这些图片地址存放的文件,找到其对应的参数middleURL)
思路:现在存放图片地址的json文件找到了,文件内容是以字典的形式存储的,图片地址对应的键值对也找到了,所以我们可以直接用request模块访问这个json文件,再通过jsonsearch模块检索找到这个图片地址,再进行下载保存,这些图片就都拿下来啦
3.向网站发请求,获取json数据
获取url,cookies,headers,params值并导入,这里有一个快捷方法,就是如下图复制,然后到https://curlconverter.com/网站粘贴值,就能直接获取到请求数据。
代码如下:
4.检索数据
利用jsonsearch检索数据,找到middleURL对应的值,但因为middleURL中有些值可能会为空,我们不用获取哪些值,就可以用if,else筛选掉,剩下我们想要的地址,获取到图片地址后发请求,得到图片的二进制数据,写入文件中,就生成了jpg格式的文件,也就是生成的图片。
最后运行,就能把图片下载下来啦:
代码全部如下:
from jsonsearch import JsonSearch
class Photo():
def __init__(self):
self.cookies = {
'BIDUPSID': '81F4DBA266ECC64B245BE58DDF72E8BF',
'PSTM': '1711373180',
'BAIDUID': '81F4DBA266ECC64B245BE58DDF72E8BF:FG=1',
'BDUSS': 'VQd3RrOEZlMjc1UVl6dFFUelRES3ZENmtjcTBnV3YtUkVyLXdCNjB-Sm1QaTVtSVFBQUFBJCQAAAAAAAAAAAEAAADnoRyd0MS7-syr1tgzAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGaxBmZmsQZmOF',
'BDUSS_BFESS': 'VQd3RrOEZlMjc1UVl6dFFUelRES3ZENmtjcTBnV3YtUkVyLXdCNjB-Sm1QaTVtSVFBQUFBJCQAAAAAAAAAAAEAAADnoRyd0MS7-syr1tgzAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGaxBmZmsQZmOF',
'H_WISE_SIDS': '40304_40446_40080_60139',
'H_PS_PSSID': '40304_40446_40080_60139',
'BDORZ': 'B490B5EBF6F3CD402E515D22BCDA1598',
'BA_HECTOR': '2kag20050h20ak2k00a5a50462sthn1j2v7jt1s',
'BAIDUID_BFESS': '81F4DBA266ECC64B245BE58DDF72E8BF:FG=1',
'ZFY': 'gJX2nzpxgjcPT39tOWqgMoLLqZbYZmEmBShwIlhJhH8:C',
'indexPageSugList': '%5B%22%E7%BE%8E%E9%A3%9F%22%2C%22%E5%9F%8E%E4%B8%AD%E4%B9%8B%E5%9F%8E%22%2C%22%E7%88%AC%E8%99%AB%E5%A4%A7%E6%95%B0%E6%8D%AE%E6%8A%93%E5%8F%96%22%2C%22%E5%93%88%E8%8B%8FX2D%E7%9B%B8%E6%9C%BA%22%2C%22%E7%9B%B8%E6%9C%BA%22%2C%22%E8%80%B3%E6%9C%BA%22%2C%22%E5%85%85%E7%94%B5%E5%AE%9D%22%2C%22%E7%94%B5%E8%84%91%22%2C%22%E5%B9%B3%E6%9D%BF%22%5D',
'BDRCVFR[X_XKQks0S63]': 'mk3SLVN4HKm',
'userFrom': 'www.baidu.com',
'firstShowTip': '1',
'BDRCVFR[feWj1Vr5u3D]': 'I67x6TjHwwYf0',
'PSINO': '6',
'delPer': '0',
'BDRCVFR[dG2JNJb_ajR]': 'mk3SLVN4HKm',
'ab_sr': '1.0.1_ODY4MmZkMTkxMGNjNjY2NDg5YTBmZjAxNzYzYTBiODQyMmE3OTViZTU3NzNmYjQ2ZWVmNmJjMDMwMmY1NTNiYTczOTc5NTU3OGExNjUzODk4N2RkYjdiOGRlYmU0YTE4YTM0YWRlNzIxOGEwOGExNjA1YWZkMzVjNjIyMjYxOGNjNjMzZGZhNjc1NDI2MjhhYTc2ZmJiYjMzOGVmMTFhNjEzYmIzZThhNmRmNDk2ZjU0MzcyODRhYzYyMTA1MWJkOWNhZTQ2OWJkZWYzM2NmNjExZGJiNWE2YzFkZWIyY2I=',
'BDRCVFR[-pGxjrCMryR]': 'mk3SLVN4HKm',
}
self.headers = {
'Accept': 'text/plain, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
# 'Cookie': 'BIDUPSID=81F4DBA266ECC64B245BE58DDF72E8BF; PSTM=1711373180; BAIDUID=81F4DBA266ECC64B245BE58DDF72E8BF:FG=1; BDUSS=VQd3RrOEZlMjc1UVl6dFFUelRES3ZENmtjcTBnV3YtUkVyLXdCNjB-Sm1QaTVtSVFBQUFBJCQAAAAAAAAAAAEAAADnoRyd0MS7-syr1tgzAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGaxBmZmsQZmOF; BDUSS_BFESS=VQd3RrOEZlMjc1UVl6dFFUelRES3ZENmtjcTBnV3YtUkVyLXdCNjB-Sm1QaTVtSVFBQUFBJCQAAAAAAAAAAAEAAADnoRyd0MS7-syr1tgzAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGaxBmZmsQZmOF; H_WISE_SIDS=40304_40446_40080_60139; H_PS_PSSID=40304_40446_40080_60139; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=2kag20050h20ak2k00a5a50462sthn1j2v7jt1s; BAIDUID_BFESS=81F4DBA266ECC64B245BE58DDF72E8BF:FG=1; ZFY=gJX2nzpxgjcPT39tOWqgMoLLqZbYZmEmBShwIlhJhH8:C; indexPageSugList=%5B%22%E7%BE%8E%E9%A3%9F%22%2C%22%E5%9F%8E%E4%B8%AD%E4%B9%8B%E5%9F%8E%22%2C%22%E7%88%AC%E8%99%AB%E5%A4%A7%E6%95%B0%E6%8D%AE%E6%8A%93%E5%8F%96%22%2C%22%E5%93%88%E8%8B%8FX2D%E7%9B%B8%E6%9C%BA%22%2C%22%E7%9B%B8%E6%9C%BA%22%2C%22%E8%80%B3%E6%9C%BA%22%2C%22%E5%85%85%E7%94%B5%E5%AE%9D%22%2C%22%E7%94%B5%E8%84%91%22%2C%22%E5%B9%B3%E6%9D%BF%22%5D; BDRCVFR[X_XKQks0S63]=mk3SLVN4HKm; userFrom=www.baidu.com; firstShowTip=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; PSINO=6; delPer=0; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; ab_sr=1.0.1_ODY4MmZkMTkxMGNjNjY2NDg5YTBmZjAxNzYzYTBiODQyMmE3OTViZTU3NzNmYjQ2ZWVmNmJjMDMwMmY1NTNiYTczOTc5NTU3OGExNjUzODk4N2RkYjdiOGRlYmU0YTE4YTM0YWRlNzIxOGEwOGExNjA1YWZkMzVjNjIyMjYxOGNjNjMzZGZhNjc1NDI2MjhhYTc2ZmJiYjMzOGVmMTFhNjEzYmIzZThhNmRmNDk2ZjU0MzcyODRhYzYyMTA1MWJkOWNhZTQ2OWJkZWYzM2NmNjExZGJiNWE2YzFkZWIyY2I=; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm',
'Referer': 'https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&dyTabStr=MCwzLDEsMiw2LDQsNSw4LDcsOQ%3D%3D&word=%E7%BE%8E%E9%A3%9F',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua': '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
}
self.params = {
'tn': 'resultjson_com',
'logid': '7862347861228774977',
'ipn': 'rj',
'ct': '201326592',
'is': '',
'fp': 'result',
'fr': '',
'word': '美食',
'queryWord': '美食',
'cl': '2',
'lm': '-1',
'ie': 'utf-8',
'oe': 'utf-8',
'adpicid': '',
'st': '',
'z': '',
'ic': '',
'hd': '',
'latest': '',
'copyright': '',
's': '',
'se': '',
'tab': '',
'width': '',
'height': '',
'face': '',
'istype': '',
'qc': '',
'nc': '1',
'expermode': '',
'nojc': '',
'isAsync': '',
'pn': '30',
'rn': '30',
'gsm': '1e',
'1714436075065': '',
}
def get_url(self):
response = requests.get('https://image.baidu.com/search/acjson', params=self.params, cookies=self.cookies,headers=self.headers)
return response.json() #返回json数据
def parse_data(self,data):
xml=JsonSearch(data,mode='j') #data为准备导入的json数据
href=xml.search_all_value(key='middleURL')
num=1
for hrefs in href:
if 'https' in hrefs:
r=requests.get(hrefs).content #向图片地址(网站)发请求,获取二进制数据
with open(f'{num}.jpg','wb') as a: #将得到的二进制数据保存在图片中
a.write(r)
else:
continue
num+=1
if __name__ == '__main__':
a=Photo()
a.parse_data(a.get_url())
``