urllib2做为python下,在httplib之上再次封装的强大html协议实现,应用非常广泛。
虽然现在又有更新的requests,httplib2等,但urllib2胜在强大的应用基础以及众多的网络资料。
下面分别总结个人在学习中的遇到的一些tips.
1、通常写法 urllib2.Request(url,data,headers)
其中url是你要访问的网站地址,data是post方法时要提交的post报文,headers是html报文头部字典
特别注意Request必须是首字母大写。
附上一段urllib2里的Request类的源码,是不是清晰啦……
class Request:
def __init__(self, url, data=None, headers={},
origin_req_host=None, unverifiable=False):
另外要说明的是http协议中,post方法与get方法其实都是基于tcp通讯,其区别一个有post报文,一个没有。
摘一段源码:
def get_method(self):
if self.has_data():
return "POST"
else:
return "GET"
Anyway:
urllib2.py的源码文件在C:\Python27\Lib目录下,C:\Python27是我的python2.7安装目录。
1 #测试2调试开关
2 import urllib2
3
8 request=urllib2.Request('https://www.hicloud.com/others/login.action')
9 response=urllib2.urlopen(request)
10 headdata=response.info()
11
12 print 'headdata\n==========\n'
13 print headdata
14 bodydata=response.read()
15 print 'bodydata\n==========\n'
16 print bodydata.decode('utf-8')
17 print 'html报文类型\n=========\n'
18 print headdata["Content-Type"],headdata.get('Content-Type')
输出结果如下,有以下几个信息值得注意:
1、Content-Type: text/html 代表返回的是text格式的html脚本,如果请求的是一个图片,则该值变成Content-Type: image/png,代表返回png格式图片。
2、Content-Length: 227 代表html脚本的长度为227个字节。
3、Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: __bsi=16631850667377372470_00_41_R_N_2_0303_C02F_N_I_I_0; expires=Sat, 14-May-16 15:36:40 GMT; domain=www.baidu.com; path=/
代表服务器端返回两个cookie项目,分别是BD_NOT_HTTPS、__bsi。
这里要特别说明cookie的path、domain、expires几个属性,分别代表cookie的适用路径,domain代表适用网站,expires代表有效期。
后面用到cookielib时会发现,cookielib会自动匹配路径信息。具体应用请自行百度cookie的用法说明。
4、报头其实是一个字典,可以用dict["keyname"]的方式来提取值。
例如使用headdata["Content-Type"]就可以提取到"text/html",也可以用headdata.get('Content-Type')。
1 headdata
2 ==========
3
4 Server: bfe/1.0.8.14
5 Date: Sun, 15 May 2016 06:27:23 GMT
6 Content-Type: text/html
7 Content-Length: 227
8 Connection: close
9 Last-Modified: Thu, 09 Oct 2014 10:47:57 GMT
10 Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
11 Set-Cookie: BIDUPSID=8175A285DB3111973624C549E1B74B04; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
12 Set-Cookie: PSTM=1463293643; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
13 P3P: CP=" OTI DSP COR IVA OUR IND COM "
14 X-UA-Compatible: IE=Edge,chrome=1
15 Pragma: no-cache
16 Cache-control: no-cache
17 Accept-Ranges: bytes
18 Set-Cookie: __bsi=16743375626368856491_00_103_N_N_3_0303_C02F_N_N_N_0; expires=Sun, 15-May-16 06:27:28 GMT; domain=www.baidu.com; path=/
19
20 bodydata
21 ==========
22
23 <html>
24 <head>
25 <script>
26 location.replace(location.href.replace("https://","http://"));
27 </script>
28 </head>
29 <body>
30 <noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
31 </body>
32 </html>
33 html报文类型
34 =========
35
36 text/html text/html
2、打开urllib2调试开关(debuglevel)
#http协议调试,默认是0,不打印日志
httpHandler = urllib2.HTTPHandler(debuglevel=1)
#https协议调试
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
urllib2.install_opener(opener)
request=urllib2.Request('https://www.hicloud.com/others/login.action')
response=urllib2.urlopen(request)
也摘一段urllib2的源码:
class HTTPSHandler(AbstractHTTPHandler):
def __init__(self, debuglevel=0, context=None):
AbstractHTTPHandler.__init__(self, debuglevel)
self._context = context
3、direct转向自动支持
urllib2支持自动转向,假如服务器端有自动redirect,urllib2会自动去提交获取到转向链接,并执行结果。
摘一段urllib2源码中的注释:
The HTTPRedirectHandler automatically deals with HTTP 301, 302, 303 and 307
也就是说针对服务器返回的301,302,303,307等Redirect代码,urllib自动发起新的request进行转向。
#测试3转向
import urllib2
httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
urllib2.install_opener(opener)
request=urllib2.Request('http://www.hicloud.com/others/login.action')
response=urllib2.urlopen(request)
headdata=response.info()
下面是输出报文:
1、第一次访问:GET方式访问 /others/login.action,网站返回302代码,以及转向地址:
https://hwid1.vmall.com/casserver/logout?service=https://www.hicloud.com:443/logout
2、第二次访问:GET方式访问
https://hwid1.vmall.com/casserver/logout?service=https://www.hicloud.com:443/logout,网站返回302代码,继续转向
可以看到代码中,只写了访问login.action,第二次访问wap由urllib2包自动触发。
3、判断是否发生了转向,可以用response.geturl()来获取最后一次提交的url,跟原始url进行比较即可判断。
也摘一段源码说明这一点:
- info(): return a mimetools.Message object for the headers
- geturl(): return the original request URL
- code: HTTP status code
send: 'GET /others/login.action HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.hicloud.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: openresty
header: Date: Sun, 15 May 2016 06:31:07 GMT
header: Content-Length: 0
header: Connection: close
header: Set-Cookie: JSESSIONID=FBC076F023D84EADB07CDBD4B4A50D02; Path=/; Secure; HttpOnly
header: X-Frame-Options: SAMEORIGIN
header: Location: https://hwid1.vmall.com/casserver/logout?service=https://www.hicloud.com:443/logout
send: 'GET /casserver/logout?service=https://www.hicloud.com:443/logout HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: hwid1.vmall.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Cache-Control: no-store
header: Pragma: no-cache
header: Expires: Thu, 01 Jan 1970 00:00:00 GMT
header: Location: https://www.hicloud.com:443/logout
header: Content-Length: 0
header: Date: Sun, 15 May 2016 06:31:07 GMT
header: Connection: close
header: Server: myWebServer
4、Cookie支持
cooklib会自动管理网站的cookie,就是当跟一个网站有多次交互时,上次访问返回的cookie,下次访问时会自动带上。
1 #测试4cookie
2 import urllib2
3 import cookielib
4
5 cookie = cookielib.CookieJar()
6 hander=urllib2.HTTPCookieProcessor(cookie)
7 opener = urllib2.build_opener(hander)
8 urllib2.install_opener(opener)
9 request= urllib2.Request('http://www.baidu.com')
10 response = urllib2.urlopen(request)
11 print '='*80
12 print response.info()
13 print '='*80
14 for item in cookie:
15 print item.name ,item.value
下面输出报文:
1、cookielib会自动管理cookie,下次访问会自动带上上次访问返回的cookie。大家都知道很多网站靠靠cookie来判断是否登录成功。
2、cookielib会自动匹配cookie路径。就是说假设有两个cookie 都名称,其中分别是
sessionid=1 path=/login
sessionid=2 path=/info
假设下次访问 /login/login.action,则自动带上sessionid=1这一条。
3、基于前述两条,安装cooklib之后,在访问那些校验cookie的网站,基本不用关心cookie方面的问题。
================================================================================
Date: Sun, 15 May 2016 06:38:43 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: Close
Vary: Accept-Encoding
Set-Cookie: BAIDUID=C6E051C620B5D4FACEFF6C8580108AD1:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=C6E051C620B5D4FACEFF6C8580108AD1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1463294323; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=0; path=/
Set-Cookie: H_PS_PSSID=18881_1443_18280_19570_19805_19559_19808_19843_19860_15926_11783_16915; path=/; domain=.baidu.com
P3P: CP=" OTI DSP COR IVA OUR IND COM "
Cache-Control: private
Cxy_all: baidu+09d233841a4ab58847e9ffbc5fadb3f6
Expires: Sun, 15 May 2016 06:37:45 GMT
X-Powered-By: HPHP
Server: BWS/1.1
X-UA-Compatible: IE=Edge,chrome=1
BDPAGETYPE: 1
BDQID: 0x8c4280bb0043a8c8
BDUSERID: 0
================================================================================
BAIDUID C6E051C620B5D4FACEFF6C8580108AD1:FG=1
BIDUPSID C6E051C620B5D4FACEFF6C8580108AD1
H_PS_PSSID 18881_1443_18280_19570_19805_19559_19808_19843_19860_15926_11783_16915
PSTM 1463294323
BDSVRTM 0
BD_HOME 0
5、自定义agent
很多网站在校验html head值,这时候就需要自定义agent,来伪装浏览器。
1 #测试5自定义agent
2 import urllib2
3 import cookielib
4
5 loginHeaders={
6 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
7 'Referer': 'https://www.baidu.com'
8 }
9 httpHandler = urllib2.HTTPHandler(debuglevel=1)
10 httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
11 opener = urllib2.build_opener(httpHandler, httpsHandler)
12 urllib2.install_opener(opener)
13 request=urllib2.Request('http://www.suning.com.cn',headers=loginHeaders)
14 response = urllib2.urlopen(request)
15 page=''
16 page= response.read()
17 print page
下面是应答报文:
1、Refer、User-Agent内容变成自定义内容,接下来伪装浏览器基本不是问题。
send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.suning.com.cn\r\nReferer: https://www.baidu.com\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Connection: close
header: Transfer-Encoding: chunked
header: Expires: Thu, 19 Nov 1981 08:52:00 GMT
header: Date: Sun, 15 May 2016 06:39:13 GMT
header: Content-Type: text/html; charset=utf-8
header: Server: nginx/1.2.9
header: Vary: Accept-Encoding
header: X-Powered-By: ThinkPHP
header: Set-Cookie: PHPSESSID=2c42hllfi5o2boop80u0i7rdq3; path=/
header: Cache-Control: private
header: Pragma: no-cache
6、同时使用cookie、debug、自定义agent
注意,urllib2.build_opener(cookieHandler,httpHandler, httpsHandler)。
这几个handler并没有先后顺序,甚至还可以增加代理handler,这么写为什么被支持,我还是来看源码(看不懂,没关系,重点看红色那一段)。
def build_opener(*handlers):
#handlers基本可以理解成一个数组
"""Create an opener object from a list of handlers.
The opener will use several default handlers, including support
for HTTP, FTP and when applicable, HTTPS.
If any of the handlers passed as arguments are subclasses of the
default handlers, the default handlers will not be used.
"""
import types
def isclass(obj):
return isinstance(obj, (types.ClassType, type))
opener = OpenerDirector()
default_classes = [ProxyHandler, UnknownHandler, HTTPHandler,
HTTPDefaultErrorHandler, HTTPRedirectHandler,
FTPHandler, FileHandler, HTTPErrorProcessor]
if hasattr(httplib, 'HTTPS'):
default_classes.append(HTTPSHandler)
skip = set()
for klass in default_classes:
#遍历数组
for check in handlers:
if isclass(check):
if issubclass(check, klass):
skip.add(klass)
elif isinstance(check, klass):
skip.add(klass)
1 #测试6综合使用debug、cookie、自定义agent
2 import urllib2
3 import cookielib
4
5 httpHandler = urllib2.HTTPHandler(debuglevel=1)
6 httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
7 cookie = cookielib.CookieJar()
8 cookieHandler=urllib2.HTTPCookieProcessor(cookie)
9 opener = urllib2.build_opener(cookieHandler,httpHandler, httpsHandler)
10 urllib2.install_opener(opener)
11
12 loginHeaders={
13 'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.0 Chrome/30.0.1599.101 Safari/537.36',
14 'Referer': 'https://www.baidu.com'
15 }
16 request=urllib2.Request('http://www.suning.com.cn',headers=loginHeaders)
17 response = urllib2.urlopen(request)
18 page=''
19 page= response.read()
20 print response.info()
21 print page
7、urllib2源码详细解释
另外,附一篇有关urllib2源码的说明文章,作者虽然加了很多注释,但对于初学者有点难度,尤其是html协议没学好的同学。
http://xw2423.byr.edu.cn/blog/archives/794