1、初识网络爬虫
(1)优点:快速开发、跨平台、解释性、多种网络爬虫框架
(2)网络爬虫的分类:通用网络爬虫、聚焦网络爬虫、增量性网络爬虫、深层网络爬虫
介绍一下这几类爬虫的优缺点:
(1)通用网络爬虫
优点:范围广、数量多
缺点:刷新页面慢
(2) 聚焦网络爬虫
优点:选择性爬取、数量少速度快
(3)增量性网络爬虫
优点:更新改变数据
(4)深层网络爬虫
优点:使用表单爬取
网络爬虫的基本原理
2、python的网络请求
具体代码实现:
import urllib.parse
import urllib.request #导入解析器模块
# 将数据使用urlencode编码处理后,再使用encoding设置为utf-8编码
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
# 打开指定需要爬取的网页(发送post网络请求)
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
html = response.read() # 读取网页代码
print(html) # 打印读取内容
运行程序,控制台打印信息:
D:\python\config-pathclass\python.exe D:/python/project/src/request/demo01.py
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.8"\n }, \n "json": null, \n "origin": "120.229.158.147, 120.229.158.147", \n "url": "https://httpbin.org/post"\n}\n'
Process finished with exit code 0
urllib3模块
步骤如下:
首先,在cmd窗口下,安装urllib3
接下来,具体代码实现:
import urllib3
# 创建PoolManager对象,用于处理与线程池的连接以及线程安全的所有细节
http = urllib3.PoolManager()
# 对需要爬取的网页发送请求
response = http.request('GET','https://www.baidu.com/')
print(response.data) #打印读取内容
运行程序,控制台打印信息:
D:\python\config-pathclass\python.exe D:/python/project/src/request/demo02.py
b'<!DOCTYPE html><!--STATUS OK-->\r\n<html>\r\n<head>\r\n\t<meta http-equiv="content-type" content="text/html;charset=utf-8">\r\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\r\n\t<link rel="dns-prefetch" href="//s1.bdstatic.com"/>\r\n\t<link rel="dns-prefetch" href="//t1.baidu.com"/>\r\n\t<link rel="dns-prefetch" href="//t2.baidu.com"/>\r\n\t<link rel="dns-prefetch" href="//t3.baidu.com"/>\r\n\t<link rel="dns-prefetch" href="//t10.baidu.com"/>\r\n\t<link rel="dns-prefetch" href="//t11.baidu.com"/>\r\n\t<link rel="dns-prefetch" href="//t12.baidu.com"/>\r\n\t<link rel="dns-prefetch" href="//b1.bdstatic.com"/>\r\n\t<title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title>\r\n\t<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/home/css/index.css" rel="stylesheet" type="text/css" />\r\n\t<!--[if lte IE 8]><style index="index" >#content{height:480px\\9}#m{top:260px\\9}</style><![endif]-->\r\n\t<!--[if IE 8]><style index="index" >#u1 a.mnav,#u1 a.mnav:visited{font-family:simsun}</style><![endif]-->\r\n\t<script>var hashMatch = document.location.href.match(/#+(.*wd=[^&].+)/);if (hashMatch && hashMatch[0] && hashMatch[1]) {document.location.replace("http://"+location.host+"/s?"+hashMatch[1]);}var ns_c = function(){};</script>\r\n\t<script>function h(obj){obj.style.behavior=\'url(#default#homepage)\';var a = obj.setHomePage(\'//www.baidu.com/\');}</script>\r\n\t<noscript><meta http-equiv="refresh" content="0; url=/baidu.html?from=noscript"/></noscript>\r\n\t<script>window._ASYNC_START=new Date().getTime();</script>\r\n</head>\r\n<body link="#0000cc"><div id="wrapper" style="display:none;"><div id="u"><a href="//www.baidu.com/gaoji/preferences.html" οnmοusedοwn="return user_c({\'fm\':\'set\',\'tab\':\'setting\',\'login\':\'0\'})">\xe6\x90\x9c\xe7\xb4\xa2\xe8\xae\xbe\xe7\xbd\xae</a>|<a id="btop" href="/" οnmοusedοwn="return user_c({\'fm\':\'set\',\'tab\':\'index\',\'login\':\'0\'})">\xe7\x99\xbe\xe5\xba\xa6\xe9\xa6\x96\xe9\xa1\xb5</a>|<a id="lb" href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" οnclick="return false;" οnmοusedοwn="return user_c({\'fm\':\'set\',\'tab\':\'login\'})">\xe7\x99\xbb\xe5\xbd\x95</a><a href="https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" οnmοusedοwn="return user_c({\'fm\':\'set\',\'tab\':\'reg\'})" target="_blank" class="reg">\xe6\xb3\xa8\xe5\x86\x8c</a></div><div id="head"><div class="s_nav"><a href="/" class="s_logo" οnmοusedοwn="return c({\'fm\':\'tab\',\'tab\':\'logo\'})"><img src="//www.baidu.com/img/baidu_jgylogo3.gif" width="117" height="38" border="0" alt="\xe5\x88\xb0\xe7\x99\xbe\xe5\xba\xa6\xe9\xa6\x96\xe9\xa1\xb5" title="\xe5\x88\xb0\xe7\x99\xbe\xe5\xba\xa6\xe9\xa6\x96\xe9\xa1\xb5"></a><div class="s_tab" id="s_tab"><a href="http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=" wdfield="word" οnmοusedοwn="return c({\'fm\':\'tab\',\'tab\':\'news\'})">\xe6\x96\xb0\xe9\x97\xbb</a> <b>\xe7\xbd\x91\xe9\xa1\xb5</b> <a href="http://tieba.baidu.com/f?kw=&fr=wwwt" wdfield="kw" οnmοusedοwn="return c({\'fm\':\'tab\',\'tab\':\'tieba\'})">\xe8\xb4\xb4\xe5\x90\xa7</a> <a href="http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt" wdfield="word" οnmοusedοwn="return c({\'fm\':\'tab\',\'tab\':\'zhidao\'})">\xe7\x9f\xa5\xe9\x81\x93</a> <a href="http://music.baidu.com/search?fr=ps&key=" wdfield="key" οnmοusedοwn="return c({\'fm\':\'tab\',\'tab\':\'music\'})">\xe9\x9f\xb3\xe4\xb9\x90</a> <a href="http://image.baidu.com/i?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&word=" wdfield="word" οnmοusedοwn="return c({\'fm\':\'tab\',\'tab\':\'pic\'})">\xe5\x9b\xbe\xe7\x89\x87</a> <a href="http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=" wdfield="word" οnmοusedοwn="return c({\'fm\':\'tab\',\'tab\':\'video\'})">\xe8\xa7\x86\xe9\xa2\x91</a> <a href="http://map.baidu.com/m?word=&fr=ps01000" wdfield="word" οnmοusedοwn="return c({\'fm\':\'tab\',\'tab\':\'map\'})">\xe5\x9c\xb0\xe5\x9b\xbe</a> <a href="http://wenku.baidu.com/search?word=&lm=0&od=0" wdfield="word" οnmοusedοwn="return c({\'fm\':\'tab\',\'tab\':\'wenku\'})">\xe6\x96\x87\xe5\xba\x93</a> <a href="//www.baidu.com/more/" οnmοusedοwn="return c({\'fm\':\'tab\',\'tab\':\'more\'})">\xe6\x9b\xb4\xe5\xa4\x9a\xc2\xbb</a></div></div><form id="form" name="f" action="/s" class="fm" ><input type="hidden" name="ie" value="utf-8"><input type="hidden" name="f" value="8"><input type="hidden" name="rsv_bp" value="1"><span class="bg s_ipt_wr"><input name="wd" id="kw" class="s_ipt" value="" maxlength="100"></span><span class="bg s_btn_wr"><input type="submit" id="su" value="\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b" class="bg s_btn" οnmοusedοwn="this.className=\'bg s_btn s_btn_h\'" οnmοuseοut="this.className=\'bg s_btn\'"></span><span class="tools"><span id="mHolder"><div id="mCon"><span>\xe8\xbe\x93\xe5\x85\xa5\xe6\xb3\x95</span></div><ul id="mMenu"><li><a href="javascript:;" name="ime_hw">\xe6\x89\x8b\xe5\x86\x99</a></li><li><a href="javascript:;" name="ime_py">\xe6\x8b\xbc\xe9\x9f\xb3</a></li><li class="ln"></li><li><a href="javascript:;" name="ime_cl">\xe5\x85\xb3\xe9\x97\xad</a></li></ul></span><span class="shouji"><strong>\xe6\x8e\xa8\xe8\x8d\x90 : </strong><a href="http://w.x.baidu.com/go/mini/8/10000020" οnmοusedοwn="return ns_c({\'fm\':\'behs\',\'tab\':\'bdbrowser\'})">\xe7\x99\xbe\xe5\xba\xa6\xe6\xb5\x8f\xe8\xa7\x88\xe5\x99\xa8\xef\xbc\x8c\xe6\x89\x93\xe5\xbc\x80\xe7\xbd\x91\xe9\xa1\xb5\xe5\xbf\xab2\xe7\xa7\x92\xef\xbc\x81</a></span></span></form></div><div id="content"><div id="u1"><a href="http://news.baidu.com" name="tj_trnews" class="mnav">\xe6\x96\xb0\xe9\x97\xbb</a><a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a><a href="http://map.baidu.com" name="tj_trmap" class="mnav">\xe5\x9c\xb0\xe5\x9b\xbe</a><a href="http://v.baidu.com" name="tj_trvideo" class="mnav">\xe8\xa7\x86\xe9\xa2\x91</a><a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">\xe8\xb4\xb4\xe5\x90\xa7</a><a href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" name="tj_login" id="lb" οnclick="return false;">\xe7\x99\xbb\xe5\xbd\x95</a><a href="//www.baidu.com/gaoji/preferences.html" name="tj_settingicon" id="pf">\xe8\xae\xbe\xe7\xbd\xae</a><a href="//www.baidu.com/more/" name="tj_briicon" id="bri">\xe6\x9b\xb4\xe5\xa4\x9a\xe4\xba\xa7\xe5\x93\x81</a></div><div id="m"><p id="lg"><img src="//www.baidu.com/img/bd_logo.png" width="270" height="129"></p><p id="nv"><a href="http://news.baidu.com">\xe6\x96\xb0 \xe9\x97\xbb</a>\xe3\x80\x80<b>\xe7\xbd\x91 \xe9\xa1\xb5</b>\xe3\x80\x80<a href="http://tieba.baidu.com">\xe8\xb4\xb4 \xe5\x90\xa7</a>\xe3\x80\x80<a href="http://zhidao.baidu.com">\xe7\x9f\xa5 \xe9\x81\x93</a>\xe3\x80\x80<a href="http://music.baidu.com">\xe9\x9f\xb3 \xe4\xb9\x90</a>\xe3\x80\x80<a href="http://image.baidu.com">\xe5\x9b\xbe \xe7\x89\x87</a>\xe3\x80\x80<a href="http://v.baidu.com">\xe8\xa7\x86 \xe9\xa2\x91</a>\xe3\x80\x80<a href="http://map.baidu.com">\xe5\x9c\xb0 \xe5\x9b\xbe</a></p><div id="fm"><form id="form1" name="f1" action="/s" class="fm"><span class="bg s_ipt_wr"><input type="text" name="wd" id="kw1" maxlength="100" class="s_ipt"></span><input type="hidden" name="rsv_bp" value="0"><input type=hidden name=ch value=""><input type=hidden name=tn value="baidu"><input type=hidden name=bar value=""><input type="hidden" name="rsv_spt" value="3"><input type="hidden" name="ie" value="utf-8"><span class="bg s_btn_wr"><input type="submit" value="\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b" id="su1" class="bg s_btn" οnmοusedοwn="this.className=\'bg s_btn s_btn_h\'" οnmοuseοut="this.className=\'bg s_btn\'"></span></form><span class="tools"><span id="mHolder1"><div id="mCon1"><span>\xe8\xbe\x93\xe5\x85\xa5\xe6\xb3\x95</span></div></span></span><ul id="mMenu1"><div class="mMenu1-tip-arrow"><em></em><ins></ins></div><li><a href="javascript:;" name="ime_hw">\xe6\x89\x8b\xe5\x86\x99</a></li><li><a href="javascript:;" name="ime_py">\xe6\x8b\xbc\xe9\x9f\xb3</a></li><li class="ln"></li><li><a href="javascript:;" name="ime_cl">\xe5\x85\xb3\xe9\x97\xad</a></li></ul></div><p id="lk"><a href="http://baike.baidu.com">\xe7\x99\xbe\xe7\xa7\x91</a>\xe3\x80\x80<a href="http://wenku.baidu.com">\xe6\x96\x87\xe5\xba\x93</a>\xe3\x80\x80<a href="http://www.hao123.com">hao123</a><span> | <a href="//www.baidu.com/more/">\xe6\x9b\xb4\xe5\xa4\x9a>></a></span></p><p id="lm"></p></div></div><div id="ftCon"><div id="ftConw"><p id="lh"><a id="seth" onClick="h(this)" href="/" οnmοusedοwn="return ns_c({\'fm\':\'behs\',\'tab\':\'homepage\',\'pos\':0})">\xe6\x8a\x8a\xe7\x99\xbe\xe5\xba\xa6\xe8\xae\xbe\xe4\xb8\xba\xe4\xb8\xbb\xe9\xa1\xb5</a><a id="setf" href="//www.baidu.com/cache/sethelp/index.html" οnmοusedοwn="return ns_c({\'fm\':\'behs\',\'tab\':\'favorites\',\'pos\':0})" target="_blank">\xe6\x8a\x8a\xe7\x99\xbe\xe5\xba\xa6\xe8\xae\xbe\xe4\xb8\xba\xe4\xb8\xbb\xe9\xa1\xb5</a><a οnmοusedοwn="return ns_c({\'fm\':\'behs\',\'tab\':\'tj_about\'})" href="http://home.baidu.com">\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6</a><a οnmοusedοwn="return ns_c({\'fm\':\'behs\',\'tab\':\'tj_about_en\'})" href="http://ir.baidu.com">About Baidu</a></p><p id="cp">©2018 Baidu <a href="/duty/" name="tj_duty">\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb</a> \xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7 <img src="http://s1.bdstatic.com/r/www/cache/static/global/img/gs_237f015b.gif"></p></div></div><div id="wrapper_wrapper"></div></div><div class="c-tips-container" id="c-tips-container"></div>\r\n<script>window.__async_strategy=2;</script>\r\n<script>var bds={se:{},su:{urdata:[],urSendClick:function(){}},util:{},use:{},comm : {domain:"http://www.baidu.com",ubsurl : "http://sclick.baidu.com/w.gif",tn:"baidu",queryEnc:"",queryId:"",inter:"",templateName:"baidu",sugHost : "http://suggestion.baidu.com/su",query : "",qid : "",cid : "",sid : "",indexSid : "",stoken : "",serverTime : "",user : "",username : "",loginAction : [],useFavo : "",pinyin : "",favoOn : "",curResultNum:"",rightResultExist:false,protectNum:0,zxlNum:0,pageNum:1,pageSize:10,newindex:0,async:1,maxPreloadThread:5,maxPreloadTimes:10,preloadMouseMoveDistance:5,switchAddMask:false,isDebug:false,ishome : 1},_base64:{domain : "http://b1.bdstatic.com/",b64Exp : -1,pdc : 0}};var name,navigate,al_arr=[];var selfOpen = window.open;eval("var open = selfOpen;");var isIE=navigator.userAgent.indexOf("MSIE")!=-1&&!window.opera;var E = bds.ecom= {};bds.se.mon = {\'loadedItems\':[],\'load\':function(){},\'srvt\':-1};try {bds.se.mon.srvt = parseInt(document.cookie.match(new RegExp("(^| )BDSVRTM=([^;]*)(;|$)"))[2]);document.cookie="BDSVRTM=;expires=Sat, 01 Jan 2000 00:00:00 GMT"; }catch(e){}</script>\r\n<script>if(!location.hash.match(/[^a-zA-Z0-9]wd=/)){document.getElementById("ftCon").style.display=\'block\';document.getElementById("u1").style.display=\'block\';document.getElementById("content").style.display=\'block\';document.getElementById("wrapper").style.display=\'block\';setTimeout(function(){try{document.getElementById("kw1").focus();document.getElementById("kw1").parentNode.className += \' iptfocus\';}catch(e){}},0);}</script>\r\n<script type="text/javascript" src="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/jquery/jquery-1.10.2.min_f2fb5194.js"></script>\r\n<script>(function(){var index_content = $(\'#content\');var index_foot= $(\'#ftCon\');var index_css= $(\'head [index]\');var index_u= $(\'#u1\');var result_u= $(\'#u\');var wrapper=$("#wrapper");window.index_on=function(){index_css.insertAfter("meta:eq(0)");result_common_css.remove();result_aladdin_css.remove();result_sug_css.remove();index_content.show();index_foot.show();index_u.show();result_u.hide();wrapper.show();if(bds.su&&bds.su.U&&bds.su.U.homeInit){bds.su.U.homeInit();}setTimeout(function(){try{$(\'#kw1\').get(0).focus();window.sugIndex.start();}catch(e){}},0);if(typeof initIndex==\'function\'){initIndex();}};window.index_off=function(){index_css.remove();index_content.hide();index_foot.hide();index_u.hide();result_u.show();result_aladdin_css.insertAfter("meta:eq(0)");result_common_css.insertAfter("meta:eq(0)");result_sug_css.insertAfter("meta:eq(0)");wrapper.show();};})();</script>\r\n<script>window.__switch_add_mask=1;</script>\r\n<script type="text/javascript" src="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/global/js/instant_search_newi_redirect1_20bf4036.js"></script>\r\n<script>initPreload();$("#u,#u1").delegate("#lb",\'click\',function(){try{bds.se.login.open();}catch(e){}});if(navigator.cookieEnabled){document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";}</script>\r\n<script>$(function(){for(i=0;i<3;i++){u($($(\'.s_ipt_wr\')[i]),$($(\'.s_ipt\')[i]),$($(\'.s_btn_wr\')[i]),$($(\'.s_btn\')[i]));}function u(iptwr,ipt,btnwr,btn){if(iptwr && ipt){iptwr.on(\'mouseover\',function(){iptwr.addClass(\'ipthover\');}).on(\'mouseout\',function(){iptwr.removeClass(\'ipthover\');}).on(\'click\',function(){ipt.focus();});ipt.on(\'focus\',function(){iptwr.addClass(\'iptfocus\');}).on(\'blur\',function(){iptwr.removeClass(\'iptfocus\');}).on(\'render\',function(e){var $s = iptwr.parent().find(\'.bdsug\');var l = $s.find(\'li\').length;if(l>=5){$s.addClass(\'bdsugbg\');}else{$s.removeClass(\'bdsugbg\');}});}if(btnwr && btn){btnwr.on(\'mouseover\',function(){btn.addClass(\'btnhover\');}).on(\'mouseout\',function(){btn.removeClass(\'btnhover\');});}}});</script>\r\n<script type="text/javascript" src="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/home/js/bri_7f1fa703.js"></script>\r\n<script>(function(){var _init=false;window.initIndex=function(){if(_init){return;}_init=true;var w=window,d=document,n=navigator,k=d.f1.wd,a=d.getElementById("nv").getElementsByTagName("a"),isIE=n.userAgent.indexOf("MSIE")!=-1&&!window.opera;(function(){if(/q=([^&]+)/.test(location.search)){k.value=decodeURIComponent(RegExp["\\x241"])}})();(function(){var u = G("u1").getElementsByTagName("a"), nv = G("nv").getElementsByTagName("a"), lk = G("lk").getElementsByTagName("a"), un = "";var tj_nv = ["news","tieba","zhidao","mp3","img","video","map"];var tj_lk = ["baike","wenku","hao123","more"];un = bds.comm.user == "" ? "" : bds.comm.user;function _addTJ(obj){addEV(obj, "mousedown", function(e){var e = e || window.event;var target = e.target || e.srcElement;if(target.name){ns_c({\'fm\':\'behs\',\'tab\':target.name,\'un\':encodeURIComponent(un)});}});}for(var i = 0; i < u.length; i++){_addTJ(u[i]);}for(var i = 0; i < nv.length; i++){nv[i].name = \'tj_\' + tj_nv[i];}for(var i = 0; i < lk.length; i++){lk[i].name = \'tj_\' + tj_lk[i];}})();(function() {var links = {\'tj_news\': [\'word\', \'http://news.baidu.com/ns?tn=news&cl=2&rn=20&ct=1&ie=utf-8\'],\'tj_tieba\': [\'kw\', \'http://tieba.baidu.com/f?ie=utf-8\'],\'tj_zhidao\': [\'word\', \'http://zhidao.baidu.com/search?pn=0&rn=10&lm=0\'],\'tj_mp3\': [\'key\', \'http://music.baidu.com/search?fr=ps&ie=utf-8\'],\'tj_img\': [\'word\', \'http://image.baidu.com/i?ct=201326592&cl=2&nc=1&lm=-1&st=-1&tn=baiduimage&istype=2&fm=&pv=&z=0&ie=utf-8\'],\'tj_video\': [\'word\', \'http://video.baidu.com/v?ct=301989888&s=25&ie=utf-8\'],\'tj_map\': [\'wd\', \'http://map.baidu.com/?newmap=1&ie=utf-8&s=s\'],\'tj_baike\': [\'word\', \'http://baike.baidu.com/search/word?pic=1&sug=1&enc=utf8\'],\'tj_wenku\': [\'word\', \'http://wenku.baidu.com/search?ie=utf-8\']};var domArr = [G(\'nv\'), G(\'lk\'),G(\'cp\')],kw = G(\'kw1\');for (var i = 0, l = domArr.length; i < l; i++) {domArr[i].onmousedown = function(e) {e = e || window.event;var target = e.target || e.srcElement,name = target.getAttribute(\'name\'),items = links[name],reg = new RegExp(\'^\\\\s+|\\\\s+\\x24\'),key = kw.value.replace(reg, \'\');if (items) {if (key.length > 0) {var wd = items[0], url = items[1],url = url + ( name === \'tj_map\' ? encodeURIComponent(\'&\' + wd + \'=\' + key) : ( ( url.indexOf(\'?\') > 0 ? \'&\' : \'?\' ) + wd + \'=\' + encodeURIComponent(key) ) );target.href = url;} else {target.href = target.href.match(new RegExp(\'^http:\\/\\/.+\\.baidu\\.com\'))[0];}}name && ns_c({\'fm\': \'behs\',\'tab\': name,\'query\': encodeURIComponent(key),\'un\': encodeURIComponent(bds.comm.user || \'\') });};}})();};if(window.pageState==0){initIndex();}})();document.cookie = \'IS_STATIC=1;expires=\' + new Date(new Date().getTime() + 10*60*1000).toGMTString();</script>\r\n</body></html>\r\n'
Process finished with exit code 0
若:打印读取代码别内容
print(response.data.decode()) #打印读取内容
运行程序,控制台打印信息:
D:\python\config-pathclass\python.exe D:/python/project/src/request/demo02.py
<!DOCTYPE html><!--STATUS OK-->
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<link rel="dns-prefetch" href="//s1.bdstatic.com"/>
<link rel="dns-prefetch" href="//t1.baidu.com"/>
<link rel="dns-prefetch" href="//t2.baidu.com"/>
<link rel="dns-prefetch" href="//t3.baidu.com"/>
<link rel="dns-prefetch" href="//t10.baidu.com"/>
<link rel="dns-prefetch" href="//t11.baidu.com"/>
<link rel="dns-prefetch" href="//t12.baidu.com"/>
<link rel="dns-prefetch" href="//b1.bdstatic.com"/>
<title>百度一下,你就知道</title>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/home/css/index.css" rel="stylesheet" type="text/css" />
<!--[if lte IE 8]><style index="index" >#content{height:480px\9}#m{top:260px\9}</style><![endif]-->
<!--[if IE 8]><style index="index" >#u1 a.mnav,#u1 a.mnav:visited{font-family:simsun}</style><![endif]-->
<script>var hashMatch = document.location.href.match(/#+(.*wd=[^&].+)/);if (hashMatch && hashMatch[0] && hashMatch[1]) {document.location.replace("http://"+location.host+"/s?"+hashMatch[1]);}var ns_c = function(){};</script>
<script>function h(obj){obj.style.behavior='url(#default#homepage)';var a = obj.setHomePage('//www.baidu.com/');}</script>
<noscript><meta http-equiv="refresh" content="0; url=/baidu.html?from=noscript"/></noscript>
<script>window._ASYNC_START=new Date().getTime();</script>
</head>
<body link="#0000cc"><div id="wrapper" style="display:none;"><div id="u"><a href="//www.baidu.com/gaoji/preferences.html" onmousedown="return user_c({'fm':'set','tab':'setting','login':'0'})">搜索设置</a>|<a id="btop" href="/" onmousedown="return user_c({'fm':'set','tab':'index','login':'0'})">百度首页</a>|<a id="lb" href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" onclick="return false;" onmousedown="return user_c({'fm':'set','tab':'login'})">登录</a><a href="https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" onmousedown="return user_c({'fm':'set','tab':'reg'})" target="_blank" class="reg">注册</a></div><div id="head"><div class="s_nav"><a href="/" class="s_logo" onmousedown="return c({'fm':'tab','tab':'logo'})"><img src="//www.baidu.com/img/baidu_jgylogo3.gif" width="117" height="38" border="0" alt="到百度首页" title="到百度首页"></a><div class="s_tab" id="s_tab"><a href="http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=" wdfield="word" onmousedown="return c({'fm':'tab','tab':'news'})">新闻</a> <b>网页</b> <a href="http://tieba.baidu.com/f?kw=&fr=wwwt" wdfield="kw" οnmοusedοwn="return c({'fm':'tab','tab':'tieba'})">贴吧</a> <a href="http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt" wdfield="word" οnmοusedοwn="return c({'fm':'tab','tab':'zhidao'})">知道</a> <a href="http://music.baidu.com/search?fr=ps&key=" wdfield="key" οnmοusedοwn="return c({'fm':'tab','tab':'music'})">音乐</a> <a href="http://image.baidu.com/i?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&word=" wdfield="word" οnmοusedοwn="return c({'fm':'tab','tab':'pic'})">图片</a> <a href="http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=" wdfield="word" οnmοusedοwn="return c({'fm':'tab','tab':'video'})">视频</a> <a href="http://map.baidu.com/m?word=&fr=ps01000" wdfield="word" οnmοusedοwn="return c({'fm':'tab','tab':'map'})">地图</a> <a href="http://wenku.baidu.com/search?word=&lm=0&od=0" wdfield="word" οnmοusedοwn="return c({'fm':'tab','tab':'wenku'})">文库</a> <a href="//www.baidu.com/more/" οnmοusedοwn="return c({'fm':'tab','tab':'more'})">更多»</a></div></div><form id="form" name="f" action="/s" class="fm" ><input type="hidden" name="ie" value="utf-8"><input type="hidden" name="f" value="8"><input type="hidden" name="rsv_bp" value="1"><span class="bg s_ipt_wr"><input name="wd" id="kw" class="s_ipt" value="" maxlength="100"></span><span class="bg s_btn_wr"><input type="submit" id="su" value="百度一下" class="bg s_btn" οnmοusedοwn="this.className='bg s_btn s_btn_h'" οnmοuseοut="this.className='bg s_btn'"></span><span class="tools"><span id="mHolder"><div id="mCon"><span>输入法</span></div><ul id="mMenu"><li><a href="javascript:;" name="ime_hw">手写</a></li><li><a href="javascript:;" name="ime_py">拼音</a></li><li class="ln"></li><li><a href="javascript:;" name="ime_cl">关闭</a></li></ul></span><span class="shouji"><strong>推荐 : </strong><a href="http://w.x.baidu.com/go/mini/8/10000020" οnmοusedοwn="return ns_c({'fm':'behs','tab':'bdbrowser'})">百度浏览器,打开网页快2秒!</a></span></span></form></div><div id="content"><div id="u1"><a href="http://news.baidu.com" name="tj_trnews" class="mnav">新闻</a><a href="http://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a><a href="http://map.baidu.com" name="tj_trmap" class="mnav">地图</a><a href="http://v.baidu.com" name="tj_trvideo" class="mnav">视频</a><a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">贴吧</a><a href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F" name="tj_login" id="lb" οnclick="return false;">登录</a><a href="//www.baidu.com/gaoji/preferences.html" name="tj_settingicon" id="pf">设置</a><a href="//www.baidu.com/more/" name="tj_briicon" id="bri">更多产品</a></div><div id="m"><p id="lg"><img src="//www.baidu.com/img/bd_logo.png" width="270" height="129"></p><p id="nv"><a href="http://news.baidu.com">新 闻</a> <b>网 页</b> <a href="http://tieba.baidu.com">贴 吧</a> <a href="http://zhidao.baidu.com">知 道</a> <a href="http://music.baidu.com">音 乐</a> <a href="http://image.baidu.com">图 片</a> <a href="http://v.baidu.com">视 频</a> <a href="http://map.baidu.com">地 图</a></p><div id="fm"><form id="form1" name="f1" action="/s" class="fm"><span class="bg s_ipt_wr"><input type="text" name="wd" id="kw1" maxlength="100" class="s_ipt"></span><input type="hidden" name="rsv_bp" value="0"><input type=hidden name=ch value=""><input type=hidden name=tn value="baidu"><input type=hidden name=bar value=""><input type="hidden" name="rsv_spt" value="3"><input type="hidden" name="ie" value="utf-8"><span class="bg s_btn_wr"><input type="submit" value="百度一下" id="su1" class="bg s_btn" οnmοusedοwn="this.className='bg s_btn s_btn_h'" οnmοuseοut="this.className='bg s_btn'"></span></form><span class="tools"><span id="mHolder1"><div id="mCon1"><span>输入法</span></div></span></span><ul id="mMenu1"><div class="mMenu1-tip-arrow"><em></em><ins></ins></div><li><a href="javascript:;" name="ime_hw">手写</a></li><li><a href="javascript:;" name="ime_py">拼音</a></li><li class="ln"></li><li><a href="javascript:;" name="ime_cl">关闭</a></li></ul></div><p id="lk"><a href="http://baike.baidu.com">百科</a> <a href="http://wenku.baidu.com">文库</a> <a href="http://www.hao123.com">hao123</a><span> | <a href="//www.baidu.com/more/">更多>></a></span></p><p id="lm"></p></div></div><div id="ftCon"><div id="ftConw"><p id="lh"><a id="seth" onClick="h(this)" href="/" οnmοusedοwn="return ns_c({'fm':'behs','tab':'homepage','pos':0})">把百度设为主页</a><a id="setf" href="//www.baidu.com/cache/sethelp/index.html" οnmοusedοwn="return ns_c({'fm':'behs','tab':'favorites','pos':0})" target="_blank">把百度设为主页</a><a οnmοusedοwn="return ns_c({'fm':'behs','tab':'tj_about'})" href="http://home.baidu.com">关于百度</a><a οnmοusedοwn="return ns_c({'fm':'behs','tab':'tj_about_en'})" href="http://ir.baidu.com">About Baidu</a></p><p id="cp">©2018 Baidu <a href="/duty/" name="tj_duty">使用百度前必读</a> 京ICP证030173号 <img src="http://s1.bdstatic.com/r/www/cache/static/global/img/gs_237f015b.gif"></p></div></div><div id="wrapper_wrapper"></div></div><div class="c-tips-container" id="c-tips-container"></div>
<script>window.__async_strategy=2;</script>
<script>var bds={se:{},su:{urdata:[],urSendClick:function(){}},util:{},use:{},comm : {domain:"http://www.baidu.com",ubsurl : "http://sclick.baidu.com/w.gif",tn:"baidu",queryEnc:"",queryId:"",inter:"",templateName:"baidu",sugHost : "http://suggestion.baidu.com/su",query : "",qid : "",cid : "",sid : "",indexSid : "",stoken : "",serverTime : "",user : "",username : "",loginAction : [],useFavo : "",pinyin : "",favoOn : "",curResultNum:"",rightResultExist:false,protectNum:0,zxlNum:0,pageNum:1,pageSize:10,newindex:0,async:1,maxPreloadThread:5,maxPreloadTimes:10,preloadMouseMoveDistance:5,switchAddMask:false,isDebug:false,ishome : 1},_base64:{domain : "http://b1.bdstatic.com/",b64Exp : -1,pdc : 0}};var name,navigate,al_arr=[];var selfOpen = window.open;eval("var open = selfOpen;");var isIE=navigator.userAgent.indexOf("MSIE")!=-1&&!window.opera;var E = bds.ecom= {};bds.se.mon = {'loadedItems':[],'load':function(){},'srvt':-1};try {bds.se.mon.srvt = parseInt(document.cookie.match(new RegExp("(^| )BDSVRTM=([^;]*)(;|$)"))[2]);document.cookie="BDSVRTM=;expires=Sat, 01 Jan 2000 00:00:00 GMT"; }catch(e){}</script>
<script>if(!location.hash.match(/[^a-zA-Z0-9]wd=/)){document.getElementById("ftCon").style.display='block';document.getElementById("u1").style.display='block';document.getElementById("content").style.display='block';document.getElementById("wrapper").style.display='block';setTimeout(function(){try{document.getElementById("kw1").focus();document.getElementById("kw1").parentNode.className += ' iptfocus';}catch(e){}},0);}</script>
<script type="text/javascript" src="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/jquery/jquery-1.10.2.min_f2fb5194.js"></script>
<script>(function(){var index_content = $('#content');var index_foot= $('#ftCon');var index_css= $('head [index]');var index_u= $('#u1');var result_u= $('#u');var wrapper=$("#wrapper");window.index_on=function(){index_css.insertAfter("meta:eq(0)");result_common_css.remove();result_aladdin_css.remove();result_sug_css.remove();index_content.show();index_foot.show();index_u.show();result_u.hide();wrapper.show();if(bds.su&&bds.su.U&&bds.su.U.homeInit){bds.su.U.homeInit();}setTimeout(function(){try{$('#kw1').get(0).focus();window.sugIndex.start();}catch(e){}},0);if(typeof initIndex=='function'){initIndex();}};window.index_off=function(){index_css.remove();index_content.hide();index_foot.hide();index_u.hide();result_u.show();result_aladdin_css.insertAfter("meta:eq(0)");result_common_css.insertAfter("meta:eq(0)");result_sug_css.insertAfter("meta:eq(0)");wrapper.show();};})();</script>
<script>window.__switch_add_mask=1;</script>
<script type="text/javascript" src="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/global/js/instant_search_newi_redirect1_20bf4036.js"></script>
<script>initPreload();$("#u,#u1").delegate("#lb",'click',function(){try{bds.se.login.open();}catch(e){}});if(navigator.cookieEnabled){document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";}</script>
<script>$(function(){for(i=0;i<3;i++){u($($('.s_ipt_wr')[i]),$($('.s_ipt')[i]),$($('.s_btn_wr')[i]),$($('.s_btn')[i]));}function u(iptwr,ipt,btnwr,btn){if(iptwr && ipt){iptwr.on('mouseover',function(){iptwr.addClass('ipthover');}).on('mouseout',function(){iptwr.removeClass('ipthover');}).on('click',function(){ipt.focus();});ipt.on('focus',function(){iptwr.addClass('iptfocus');}).on('blur',function(){iptwr.removeClass('iptfocus');}).on('render',function(e){var $s = iptwr.parent().find('.bdsug');var l = $s.find('li').length;if(l>=5){$s.addClass('bdsugbg');}else{$s.removeClass('bdsugbg');}});}if(btnwr && btn){btnwr.on('mouseover',function(){btn.addClass('btnhover');}).on('mouseout',function(){btn.removeClass('btnhover');});}}});</script>
<script type="text/javascript" src="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/home/js/bri_7f1fa703.js"></script>
<script>(function(){var _init=false;window.initIndex=function(){if(_init){return;}_init=true;var w=window,d=document,n=navigator,k=d.f1.wd,a=d.getElementById("nv").getElementsByTagName("a"),isIE=n.userAgent.indexOf("MSIE")!=-1&&!window.opera;(function(){if(/q=([^&]+)/.test(location.search)){k.value=decodeURIComponent(RegExp["\x241"])}})();(function(){var u = G("u1").getElementsByTagName("a"), nv = G("nv").getElementsByTagName("a"), lk = G("lk").getElementsByTagName("a"), un = "";var tj_nv = ["news","tieba","zhidao","mp3","img","video","map"];var tj_lk = ["baike","wenku","hao123","more"];un = bds.comm.user == "" ? "" : bds.comm.user;function _addTJ(obj){addEV(obj, "mousedown", function(e){var e = e || window.event;var target = e.target || e.srcElement;if(target.name){ns_c({'fm':'behs','tab':target.name,'un':encodeURIComponent(un)});}});}for(var i = 0; i < u.length; i++){_addTJ(u[i]);}for(var i = 0; i < nv.length; i++){nv[i].name = 'tj_' + tj_nv[i];}for(var i = 0; i < lk.length; i++){lk[i].name = 'tj_' + tj_lk[i];}})();(function() {var links = {'tj_news': ['word', 'http://news.baidu.com/ns?tn=news&cl=2&rn=20&ct=1&ie=utf-8'],'tj_tieba': ['kw', 'http://tieba.baidu.com/f?ie=utf-8'],'tj_zhidao': ['word', 'http://zhidao.baidu.com/search?pn=0&rn=10&lm=0'],'tj_mp3': ['key', 'http://music.baidu.com/search?fr=ps&ie=utf-8'],'tj_img': ['word', 'http://image.baidu.com/i?ct=201326592&cl=2&nc=1&lm=-1&st=-1&tn=baiduimage&istype=2&fm=&pv=&z=0&ie=utf-8'],'tj_video': ['word', 'http://video.baidu.com/v?ct=301989888&s=25&ie=utf-8'],'tj_map': ['wd', 'http://map.baidu.com/?newmap=1&ie=utf-8&s=s'],'tj_baike': ['word', 'http://baike.baidu.com/search/word?pic=1&sug=1&enc=utf8'],'tj_wenku': ['word', 'http://wenku.baidu.com/search?ie=utf-8']};var domArr = [G('nv'), G('lk'),G('cp')],kw = G('kw1');for (var i = 0, l = domArr.length; i < l; i++) {domArr[i].onmousedown = function(e) {e = e || window.event;var target = e.target || e.srcElement,name = target.getAttribute('name'),items = links[name],reg = new RegExp('^\\s+|\\s+\x24'),key = kw.value.replace(reg, '');if (items) {if (key.length > 0) {var wd = items[0], url = items[1],url = url + ( name === 'tj_map' ? encodeURIComponent('&' + wd + '=' + key) : ( ( url.indexOf('?') > 0 ? '&' : '?' ) + wd + '=' + encodeURIComponent(key) ) );target.href = url;} else {target.href = target.href.match(new RegExp('^http:\/\/.+\.baidu\.com'))[0];}}name && ns_c({'fm': 'behs','tab': name,'query': encodeURIComponent(key),'un': encodeURIComponent(bds.comm.user || '') });};}})();};if(window.pageState==0){initIndex();}})();document.cookie = 'IS_STATIC=1;expires=' + new Date(new Date().getTime() + 10*60*1000).toGMTString();</script>
</body></html>
Process finished with exit code 0
快捷键:ctrl+f:搜索框,比如:搜索百度
我们可以访问浏览器搜索:
若:我们需要把Get请求改成post请求:
具体代码:
import urllib3
# 创建PoolManager对象,用于处理与线程池的连接以及线程安全的所有细节
http = urllib3.PoolManager()
# 对需要爬取的网页发送请求
response = http.request('POST','http://httpbin.org/post',fields={'word': 'hello'})
print(response.data.decode()) #打印读取内容
运行程序,控制台打印信息:
D:\python\config-pathclass\python.exe D:/python/project/src/request/demo02.py
{
"args": {},
"data": "",
"files": {},
"form": {
"word": "hello"
},
"headers": {
"Accept-Encoding": "identity",
"Content-Length": "128",
"Content-Type": "multipart/form-data; boundary=a2950c6846999c6261f759a5d1eea77c",
"Host": "httpbin.org"
},
"json": null,
"origin": "120.229.158.147, 120.229.158.147",
"url": "https://httpbin.org/post"
}
Process finished with exit code 0
Requests模块
功能特性:文件分块上传、连接超时、自动解压、自动文件解码、分块请求
网站:http://www.python-requests.org/en/master/
步骤:
首先,在cmd窗口下,安装requests
pip install requests
具体代码实现:
import requests # 导入模块
response = requests.get('http://www.baidu.com')
print('状态码',response.status_code) # 打印状态码
print('请求url',response.url) # 打印请求url
print('头部信息',response.headers) # 打印头部信息
print('cookie信息',response.cookies) # 打印cookie信息
print('文本源码,',response.text) # 以文本形式打印网页源码
print('字节流源码',response.content) # 以字节流形式打印网页源码
运行程序,控制台打印信息:
D:\python\config-pathclass\python.exe D:/python/project/src/request/demo03.py
状态码 200
请求url http://www.baidu.com/
头部信息 {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 17 Dec 2019 08:03:22 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:12 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
cookie信息 <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
文本源码, <!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产å“</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å
³äºŽç™¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度å‰å¿
读</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>æ„è§å馈</a> 京ICPè¯030173å· <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
字节流源码 b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>\xe6\x96\xb0\xe9\x97\xbb</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>\xe5\x9c\xb0\xe5\x9b\xbe</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>\xe8\xa7\x86\xe9\xa2\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>\xe8\xb4\xb4\xe5\x90\xa7</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>\xe7\x99\xbb\xe5\xbd\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">\xe7\x99\xbb\xe5\xbd\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">\xe6\x9b\xb4\xe5\xa4\x9a\xe4\xba\xa7\xe5\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>\xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88</a> \xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7 <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
Process finished with exit code 0
若:想请求方式get请求或者post请求
代码实现如下:
import requests
data = {'word': 'hello'} # 表单参数
# 对需要爬取的网页发送请求
response = requests.post('http://httpbin.org/post', data=data)
print(response.content) # 以字节流形式打印网页源码
运行程序,控制台打印信息:
D:\python\config-pathclass\python.exe D:/python/project/src/request/demo03.py
状态码 200
字节流源码 b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "python-requests/2.22.0"\n }, \n "json": null, \n "origin": "120.229.158.147, 120.229.158.147", \n "url": "https://httpbin.org/post"\n}\n'
Process finished with exit code 0
2、请求headers处理
网络请求的测试:http://www.whatismyip.com
代码实现:
import requests # 导入模块请求模块
url='http://www.whatismyip.com' #网络请求地址
response=requests.get(url) #发送网络请求
print(response.content.decode('utf-8'))
运行程序,控制台打印信息:
操作查看网站页面:
具体代码:
import requests # 导入模块请求模块
url='http://www.whatismyip.com' #网络请求地址
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
response=requests.get(url,headers=headers) #发送网络请求
print(response.content.decode('utf-8'))
运行程序,控制台打印信息:
为了检查,验证信息:
我们可以在网络请求的测试复制一些信息:
在已经启动的控制台使用ctrl+F快捷键搜索一下,已经复制的内容:
3、网络超时
示例1代码:
import requests
# 循环发送请求50次
for a in range(1, 50):
try: # 捕获异常
# 设置超时为0.5秒
response = requests.get('https://www.baidu.com/', timeout=0.1)
print(response.status_code) # 打印状态码
except Exception as e: # 捕获异常
print('异常'+str(e)) # 打印异常信息
运行程序,控制台打印信息:
D:\python\config-pathclass\python.exe D:/python/project/src/request/demo05.py
200
200
200
200
200
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
200
200
200
200
200
200
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
200
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
200
200
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
200
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
200
200
200
200
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
200
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x000001D10926C580>, 'Connection to www.baidu.com timed out. (connect timeout=0.1)'))
200
200
200
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
200
200
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
200
200
200
200
异常HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.1)
200
Process finished with exit code 0
示例2代码:
import requests
# 导入requests.exceptions模块中的三种异常类
from requests.exceptions import ReadTimeout,HTTPError,RequestException
# 循环发送请求50次
for a in range(0, 50):
try: # 捕获异常
# 设置超时为0.5秒
response = requests.get('https://www.baidu.com/', timeout=0.1)
print(response.status_code) # 打印状态码
except ReadTimeout: # 超时异常
print('timeout')
except HTTPError: # HTTP异常
print('httperror')
except RequestException: # 请求异常
print('reqerror')
运行程序,控制台打印信息:
D:\python\config-pathclass\python.exe D:/python/project/src/request/demo06.py
200
200
timeout
200
200
200
200
200
200
200
200
timeout
200
200
timeout
200
200
200
timeout
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
timeout
timeout
timeout
200
200
200
200
200
200
timeout
reqerror
timeout
timeout
timeout
timeout
Process finished with exit code 0
4、代理服务
免费代理ip首页:http://www.xicidaili.com 如图所示:
具体代码实现:
import requests
# 设置代理ip
proxy = {'http': '223.198.18.140:9999',
'https': '223.198.1.145:9999'} # 设置代理ip与对应的端口号
# 对需要爬取的网页发送请求
response = requests.get('http://www.baidu.com/', proxies=proxy)
print(response.content.decode('utf-8')) # 以字节流形式打印网页源码