python 爬虫请求 python爬虫请求库

转载

云端筑梦者 2023-11-24 12:05:16

文章标签 python 爬虫请求 python json html 百度 文章分类 Python 后端开发

requests库是一个常用的用于http请求的模块，它使用python语言编写，可以方便的对网页进行爬取，是学习python爬虫的较好的http请求模块

requests库的安装

在这里，我是使用pycharm对requests库进行安装的，首先选择File->settings,找到Project pychram，点击右边的加号

python 爬虫请求 python爬虫请求库_百度

在弹出栏中输入requests选中，然后点击下面的install Package按钮，进行包的安装

python 爬虫请求 python爬虫请求库_python 爬虫请求_02

安装完成后，就能在之前一个界面查看到安装的requests

python 爬虫请求 python爬虫请求库_python 爬虫请求_03

requests库的使用

request提供的方法

方法	描述
requests.get()	向服务器发送get请求
requests.post()	向服务器发送post请求
requests.put()	向服务器发送put请求
requests.head()	获取头部信息
requests.patch()	向服务器提交局部修改的请求
requests.delete()	向html提交删除请求
requests.request()	构造一个请求，支持以下各种方法

get

get(url,params,**kwargs)

url: 需要爬取的网站地址。
params: url中的额外参数，字典或者字节流格式，可选。
**kwargs : 控制访问的参数

post

post(url, data=None, json=None, **kwargs):

url: 需要爬取的网站地址。
data:传递的内容。
json:json格式传递的内容
**kwargs : 控制访问的参数

requests

request(method, url, **kwargs):

method:需要使用的方法
url：爬行的路径
**kwargs : 控制访问的参数

名称	描述
data	(optional) Dictionary, list of tuples, bytes, or file-likeobject to send in the body of the :class:`Request`.
json	(optional) A JSON serializable Python object to send in the body of the :class:`Request`.
headers	(optional) Dictionary of HTTP Headers to send with the :class:`Request`.
cookies	(optional) Dict or CookieJar object to send with the :class:`Request`.
files	(optional) Dictionary of `'name': file-like-objects` (or `{'name': file-tuple}`) for multipart encoding upload. `file-tuple` can be a 2-tuple `('filename', fileobj)`, 3-tuple `('filename', fileobj, 'content_type')`or a 4-tuple `('filename', fileobj, 'content_type', custom_headers)`, where `'content-type'` is a string defining the content type of the given file and `custom_headers` a dict-like object containing additional headers to add for the file.
auth	(optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
timeout	(optional) How many seconds to wait for the server to send data before giving up, as a float, or a :ref:`(connect timeout, read timeout) <timeouts>` tuple. :type timeout: float or tuple
allow_redirects	(optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to `True`. :type allow_redirects: bool
proxies	(optional) Dictionary mapping protocol to the URL of the proxy.
verify	(optional) Either a boolean, in which case it controls whether we verify the server’s TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Defaults to `True`.
stream	(optional) if `False`, the response content will be immediately downloaded.
cert	(optional) if String, path to ssl client cert file (.pem). If Tuple, (‘cert’, ‘key’) pair.

Response

通过上面方法返回的是一个Response对象，该对象有一下一些属性和方法

属性/方法	描述
status_code	服务器返回的状态
text	服务器返回的字符串，requests根据自己判断进行的解码
content	服务器响应内容的二进制形式
encoding	requests猜测的相应内容编码方式,text就是根据该编码格式进行解码
json()	返回内容进行json转换

实例

实例1
首先向https://httpbin.org/get获取发送的get请求

import requests

req = requests.get('https://httpbin.org/get')
print(req.text)

执行结果如下

python 爬虫请求 python爬虫请求库_百度_04

实例2

通过get方法，使用百度查询关键字新型管状病毒

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
}

data = {
"wd":"新型冠状病毒"
}
url = "https://www.baidu.com/s"
re = requests.get(url,params=data,headers= headers)
print(re.url)
with open("index.html","w",encoding="utf-8") as f:
    count = f.write(re.content.decode("utf-8"))
    print(count)

返回的结果

python 爬虫请求 python爬虫请求库_python 爬虫请求_05

将上面结果的超链接复制到浏览器中，产生下面的效果，被百度察觉到了，

python 爬虫请求 python爬虫请求库_html_06

当我们在headers参数中加入Cookie时，就可以了。

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36",
    "Cookie": "BIDUPSID=41BB40B30F5505BBAFB383EC2890356C; PSTM=1582981521; BAIDUID=41BB40B30F5505BB1B849015D5D91E79:FG=1; BD_UPN=12314753; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDUSS=hpSllZUVQ5U1JnZkVzdVBJMVp1MmZhUGd0OG9KVTYwTnJvY1dpTVlpeVpyN0plRVFBQUFBJCQAAAAAAAAAAAEAAACXS39dw7u07V-088nxvN21vQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJkii16ZIoteeE; H_PS_PSSID=1464_31124_21090_31187_30903_31271_31228_30823_31085_26350_31164_22159; sugstore=1; H_PS_645EC=6b05Ik9MKkAEZv4IFslxBo9KEklxQ1OO4z81AbF42RTzgtIiyvevJEQieyM; BDSVRTM=0; WWW_ST=1586850915148"
}

data = {
"wd":"新型冠状病毒"
}
url = "http://www.baidu.com/s"
re = requests.get(url,params=data,headers= headers)
print(re.url)
with open("index.html","w",encoding="utf-8") as f:
    count = f.write(re.content.decode("utf-8"))
    print(count)

打开刚刚写入的文件，和我们平时百度的一样

python 爬虫请求 python爬虫请求库_python_07