python urllib2 请求网页 python urllib get请求

转载

mob64ca13fc220d 2023-09-04 10:29:22

文章标签 python urllib2 请求网页 python get请求 post请求 文章分类 Python 后端开发

我们在使用python爬虫时，需要模拟发起网络请求，主要用到的库有requests库和python内置的urllib库，一般建议使用requests，它是对urllib的再次封装，它们使用的主要区别：

requests可以直接构建常用的get和post请求并发起，urllib一般要先构建get或者post请求，然后再发起请求。

get请求：使用get方式时，请求数据直接放在url中。

post请求：使用post方式时，数据放在data或者body中，不能放在url中，放在url中将被忽略。

使用urllib

在python2中，有urllib和urllib2两个库来实现请求的发送，而在python3中，统一为rullib。Urllib是python内置的HTTP请求库，不需要额外安装。python内置urllib版块，支持header，cookie，ip代理池等操作，但是比较麻烦的就是每次都要处理编码解码问题，搞得有点繁琐。

get请求：

最简单的网页get请求，无header，cookie，ip代理池等

import urllib.request
response = urllib.request.urlopen(‘https://www.python.org’)
print(response.read().decode(‘utf-8’))

运行结果如下（打印网页的HTML源码）：

python urllib2 请求网页 python urllib get请求_python

2. 一个基本的百度请求的代码如下：

import urllib
header={"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}#模仿浏览器
request = urllib.request.Request('http://www.baidu.com',headers=header)
response = urllib.request.urlopen(request)
html= response.read().decode()
print(html)

3. 对于有url拼凑的地址，例如有：

python urllib2 请求网页 python urllib get请求_post请求_02

这样就要添加data信息（或者你直接拼凑url）。比如我要请求这个页面，就要在data字典组添加对应的查询头信息，并且还需要url编码转换成浏览器能够标识的字串。编码工作使用urllib.parse的urlencode()函数，帮我们将key:value这样的键值对转换成"key=value"这样的字符串，解码工作可以使用urllib的unquote()函数。(注意，不是urllib.urlencode())

代码为：

import urllib.request as urllib2
import urllib.parse
url = "http://tieba.baidu.com/f"
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"}
formdata = {
 "ie":"utf-8",
 "kw":"江苏科技大学",
 "fr":"search"
}
data = urllib.parse.urlencode(formdata)#要转换成url编码
newurl=url+'?'+data
print(data)
request = urllib2.Request(newurl, headers = headers)
response = urllib2.urlopen(request)
html=response.read().decode("utf-8")
print( html)

post请求:

post请求和get请求的不同之处在于传递参数的方式，get通过url拼凑进行不同的请求，而post请求则是将data放进请求列中进行模拟类似表单的请求。

import urllib.request
import urllib.parse
url = "http://tieba.baidu.com/f"
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"}
formdata = {
 "ie":"utf-8",
 "kw":"江苏科技大学",
 "fr":"search"
}
data = urllib.request.parse.urlencode(formdata).encode('utf-8')#要转换成url编码
print(data)
request = urllib.request.Request(url, data = data, headers = headers)
response = urllib.request.urlopen(request)
html=response.read().decode("utf-8")
print( html)

可利用type()方法输出响应的类型：print(type(response))。结果：<class 'http.client.HTTPResponse'>，是一个HTTPResponse类型对象，主要包含read()、readinto()、getheader(name)、getheaders()、fileno()等方法，以及msg 、version 、status 、reason 、debuglevel 、closed 等属性。

read()方法可以得到返回的网页内容。

status属性可以得到返回结果的状态码，如200表示请求成功，404表示网页未找到。

设置代理：

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
proxy_handler = ProxyHandler({
        'http':'http://127.0.0.1:9743',
        'https':'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open ('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e. reason)

使用requests

上面我们了解了urllib的使用，但是有诸多不便，这里我们将学习一个更强大的库requests。requests是从urllib编写而来，支持urllib的绝大部分操作。比如cookie管理，ip代理等等。

实例：

import requests
r = requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)

运行结果如下:

python urllib2 请求网页 python urllib get请求_python_03

这里调用get( )方法与urlopen( )相同，得到Response对象，然后分别可以输出Response的类型、状态码、相应体的类型、内容和 Cookies。

以下是requests的各种请求方式：

import
requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

GET请求

1. 基本的get请求

import requests
response = requests.get('http://httpbin.org/get')
print(response.text)

2. 带参数的get请求(将name和age传进去)

import requests
response = requests.get("http://httpbin.org/get?name=germey&age=22")
print(response.text)

或者使用params的方法：

import requests
data = {
'name': 'germey',
'age': 22
}
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)

3. 解析json

将返回值以json的形式展示

import requests
import json
response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))

返回值：

<class 'str'>
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.18.4'}, 'origin': '183.64.61.29', 'url': 'http://httpbin.org/get'}
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.18.4'}, 'origin': '183.64.61.29', 'url': 'http://httpbin.org/get'}
<class 'dict'>

4. 获取二进制数据

import requests
response = requests.get("https://github.com/favicon.ico")
print(type(response.text), type(response.content))
print(response.text)
print(response.content)

response.content返回值为二进制不必再进行展示。

5. 添加headers

import requests
headers = {
 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response = requests.get("https://www.zhihu.com/explore", headers=headers)
print(response.text)

POST请求

import requests
data = {'name': 'germey', 'age': '22'}
response = requests.post("http://httpbin.org/post", data=data)
print(response.text)

代理设置：

import requests
proxies = {
 "http": "http://127.0.0.1:9743",
 "https": "https://127.0.0.1:9743",}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

总结：以上就是python常用的两个请求库requests和urllib的简单使用，更加复杂的使用方式可在网上查看资料。

如果对你有用，点个赞手动笑脸（*_*）

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。