点击上方蓝字关注我

财乃天地至公之物,假手于人罢了,雨打残花风卷流云。轮番更转而已,穷转富,富转穷,哪有百世富家翁?



python3 url 获取uri python 获取url参数_python url编码

前言



python3 url 获取uri python 获取url参数_python_02

网上找的一个案例自己再次复现,加上自己的理解和记录过程中遇到的一些小问题,当巩固基础。附上原案例出处链接:

https://bbs.ichunqiu.com/thread-40908-1-1.html



python3 url 获取uri python 获取url参数_python url编码

实验对象



python3 url 获取uri python 获取url参数_python_02

python3 url 获取uri python 获取url参数_python_05

python3 url 获取uri python 获取url参数_python_06

如上,目录下有全国所有的城市,点击每一个城市的链接进去都会有所属城市的所有大学的信息,本次案例就是要收集所有城市下所对应大学的名字并导出文件,文件以对应的城市命名。最后的成果如下:

python3 url 获取uri python 获取url参数_python3 url 获取uri_07



python3 url 获取uri python 获取url参数_python url编码

实验过程



python3 url 获取uri python 获取url参数_python_02

01

审查元素

f12 打开开发者工具审查源码发现整个表格信息在  标签下,每一行信息在  标签下,每个单元格信息在 标签下,而我们所要获得的学校的名字就在每个  下的第二个  中

python3 url 获取uri python 获取url参数_html_10




02




获取网页源码




① python 源码




#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlfrom bs4 import BeautifulSoupimport io,sys# from bs4 import BeautifulSoup as bf # 修改 BeautifulSoup 模块的名字为 bfsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def School(): url = "https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url=url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 print(soup)if __name__ == '__main__': School()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlfrom bs4 import BeautifulSoupimport io,sys# from bs4 import BeautifulSoup as bf # 修改 BeautifulSoup 模块的名字为 bfsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def School(): url = "https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url=url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 print(soup)if __name__ == '__main__': School()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlfrom bs4 import BeautifulSoupimport io,sys# from bs4 import BeautifulSoup as bf # 修改 BeautifulSoup 模块的名字为 bfsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def School(): url = "https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url=url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 print(soup)if __name__ == '__main__': School()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlfrom bs4 import BeautifulSoupimport io,sys# from bs4 import BeautifulSoup as bf # 修改 BeautifulSoup 模块的名字为 bfsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def School(): url = "https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url=url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 print(soup)if __name__ == '__main__': School()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlfrom bs4 import BeautifulSoupimport io,sys# from bs4 import BeautifulSoup as bf # 修改 BeautifulSoup 模块的名字为 bfsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def School(): url = "https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url=url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 print(soup)if __name__ == '__main__': School()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlfrom bs4 import BeautifulSoupimport io,sys# from bs4 import BeautifulSoup as bf # 修改 BeautifulSoup 模块的名字为 bfsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def School(): url = "https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url=url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 print(soup)if __name__ == '__main__': School()




② 结果




python3 url 获取uri python 获取url参数_html_11




以上的源码中有一句语句设置了 python 默认的编码为 gb18030




sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')




值得注意的是,如果不使用上面的语句设置 python 的默认编码为的话会报错,如下:




UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 4400: illegal multibyte sequence
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 4400: illegal multibyte sequence
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 4400: illegal multibyte sequence
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 4400: illegal multibyte sequence
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 4400: illegal multibyte sequence
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 4400: illegal multibyte sequence




对于此(类)问题:(1)出现 UnicodeEncodeError –> 说明是 Unicode 编码时候的问题;(2) ‘gbk’ codec can’t encode character –> 说明是将 Unicode 字符编码为 GBK 时候出现的问题;此时,往往最大的可能就是,本身Unicode类型的字符中,包含了一些无法转换为 GBK 编码的一些字符。(3) print()函数自身有限制,不能完全打印所有的unicode字符,python的默认编码不是 ‘utf-8’,改一下 python 的默认编码成 ‘gb18030




03




提取单页关键信息




在能够成功获取网页源码的基础上,进一步缩小获取内容的范围,根据以上分析的 HTML 标签的特性,通过使用 find_all() 搜索  标签从而来定位表格的每一行,通过搜索  标签来定位每一个单元格。

① python 源码




#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象    content = soup.find_all(name = "tr",attrs= {"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: soup_content = BeautifulSoup(str(content1),"lxml")            soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1]) # 取第二个  标签if __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象    content = soup.find_all(name = "tr",attrs= {"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: soup_content = BeautifulSoup(str(content1),"lxml")            soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1]) # 取第二个  标签if __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象    content = soup.find_all(name = "tr",attrs= {"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: soup_content = BeautifulSoup(str(content1),"lxml")            soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1]) # 取第二个  标签if __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象    content = soup.find_all(name = "tr",attrs= {"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: soup_content = BeautifulSoup(str(content1),"lxml")            soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1]) # 取第二个  标签if __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象    content = soup.find_all(name = "tr",attrs= {"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: soup_content = BeautifulSoup(str(content1),"lxml")            soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1]) # 取第二个  标签if __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象    content = soup.find_all(name = "tr",attrs= {"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: soup_content = BeautifulSoup(str(content1),"lxml")            soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1]) # 取第二个  标签if __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象    content = soup.find_all(name = "tr",attrs= {"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: soup_content = BeautifulSoup(str(content1),"lxml")            soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1]) # 取第二个  标签if __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象    content = soup.find_all(name = "tr",attrs= {"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: soup_content = BeautifulSoup(str(content1),"lxml")            soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1]) # 取第二个  标签if __name__ == '__main__': school()
 for content1 in content:
 soup_content = BeautifulSoup(str(content1),"lxml")
            soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1]) # 取第二个  标签if __name__ == '__main__': school()
 print(soup_content1[1]) # 取第二个  标签if __name__ == '__main__': school()

if __name__ == '__main__':
 school()

② 结果




输出的结果显示出现错误:列表的索引超出范围,但是还是有一个输出,也就是说,在循环过程中进行到第二个  标签的时候出现了如下错误。

python3 url 获取uri python 获取url参数_html_12




③ 审查元素




打开开发者工具审查元素之后发现,除了的第二个  标签(第二行)之外,其他  标签都有 7 个 标签(7 个单元格),所以当 for 循环到第二个  标签的时候出现错误。这里的解决方法可以使用 python 的异常处理,绕过错误继续运行。

python3 url 获取uri python 获取url参数_xml_13




④ python 源码修正




#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoup# from bs4 import BeautifulSoup as bssys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.html" r = requests.get(url)    soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name = "tr",attrs= {"height":"29"})  # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
 for content1 in content:
 try:
 soup_content = BeautifulSoup(str(content1),"lxml")
 soup_content1 = soup_content.find_all(name="td") # 查找所有  标签            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
            print(soup_content1[1]) # 获取第二个  标签 except IndexError: passif __name__ == '__main__': school()
 except IndexError:
 pass

if __name__ == '__main__':
 school()




⑤ 结果




python3 url 获取uri python 获取url参数_python3 url 获取uri_14




04




提取多页关键信息




以上是提取到一个城市的学校名,接下来需要提权所有城市的学校名信息。通过比较每个城市网页的链接发现,只是文件名最后的数字不同。所以可以使用 for 循环切换网页链接。




https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html
https://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-2.htmlhttps://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-3.html




同时在网页的最下端也发现所需提取城市的数量范围是 2 - 32




python3 url 获取uri python 获取url参数_python_15




① python 源码




#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url)        soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 content = soup.find_all(name="tr",attrs={"height":"29"}) # 查找所有标签为 ,且标签内对象为 "height"=29 for content1 in content: try: soup_content = BeautifulSoup(str(content1),"lxml") soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
 for content1 in content:
 try:
 soup_content = BeautifulSoup(str(content1),"lxml")
 soup_content1 = soup_content.find_all(name="td") # 查找所有  标签 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
 print(soup_content1[1].string) # 获取第二个  标签中的字符串 except IndexError: passif __name__ == '__main__':    school()
 except IndexError:
 pass

if __name__ == '__main__':
    school()




② 结果




结果就是将所有城市的所有大学都打印出来了




python3 url 获取uri python 获取url参数_python3 url 获取uri_16




05




将爬取的信息保存到本地




基于上面的基础,将获取的每一个城市的大学名存储在 txt 文件中,并以所属城市名命名,以 find_all 方法搜索 标签定位到第二行的城市名。




① python 源码




由于在遍历每个城市的时候又出现 列表索引超出范围 的错误,所以在 for 循环之后又使用了一次异常处理。




#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个
#!/usr/bin/env python# -*- coding: utf-8 -*-import requestsimport lxmlimport io,sysfrom bs4 import BeautifulSoupsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') # 将页面的编码强制为 gb18030def school(): for i in range(2,34,1): try: url = "http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html"%(str(i)) r = requests.get(url) soup = BeautifulSoup(r.content,"lxml") #利用beautifulsoup解析页面内容,将返回内容赋值给soup,得到一个beautifulsoup文档对象 filename = soup.find_all(name="td",attrs={"colspan":"7"})[0].string # 获取第一个