Python学习笔记--Python 爬虫入门 -17-9 BeautifulSoup4

翻译

Aimmon 2024-11-26 15:44:42 博主文章分类：Python

文章标签 python beautifulsoup Navigablestring xml html 文章分类 Html/CSS 前端开发

# CSS选择器 BeautifulSoup4

- 现在使用BeautifulSoup4
- 官方文档 http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
- 几个常用提取信息工具的比较：
- 正则：很快，不好用，不许安装
- beautifulsoup：慢，使用简单，安装简单
- lxml：比较快，使用简单，安装一般
- 案例v33.py

from bs4 import  BeautifulSoup
from urllib import  request

url = "http://www.baidu.com"
rsp = request.urlopen(url)
content = rsp.read()
print(content) #bytes类型
soup = BeautifulSoup(content,'lxml')

print(type(soup)) #<class 'bs4.BeautifulSoup'>
print(soup)
print("*"*20)
con = soup.prettify()
print(type(con))  #<class 'str'>
print(con)

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
- Tag
- NavigableString
- BeautifulSoup
- Comment

- Tag
- 对应Html中的标签
- 可以通过soup.tag_name
- tag两个重要属性
- name
- attrs
- 案例a34

from urllib import  request
from bs4 import  BeautifulSoup
import  re

url = "http://www.baidu.com"
req = request.urlopen(url)
content = req.read()

soup = BeautifulSoup(content,'lxml')

print("**"*20)

tags = soup.find_all(re.compile("^me"))
print(type(tags)) #<class 'bs4.element.ResultSet'>
for tag in tags:
    print(tag)      #<meta content="always" name="referrer"/>

"""
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<meta content="#2932e1" name="theme-color"/>
<meta content="0; url=/baidu.html?from=noscript" http-equiv="refresh"/>
"""
"""
>>>tag.attrs
{'http-equiv': 'content-type', 'content': 'text/html;charset=utf-8'}
{'http-equiv': 'X-UA-Compatible', 'content': 'IE=Edge'}
{'content': 'always', 'name': 'referrer'}
{'name': 'theme-color', 'content': '#2932e1'}
{'http-equiv': 'refresh', 'content': '0; url=/baidu.html?from=noscript'}
"""

- NavigableString
- 对应内容值

- BeautifulSoup
- 表示的是一个文档的内容，大部分可以把他当做tag对象
- 一般我们可以用soup来表示
- Comment
- 特殊类型的NavagableString对象，
- 对其输出，则内容不包括注释符号
- 遍历文档对象
- contents: tag的子节点以列表的方式给出
- children：子节点以迭代器形式返回
- descendants：所子孙节点
- string
- 案例34
- 搜索文档对象
- find_all(name, attrs, recursive, text, ** kwargs)
- name:按照那个字符串搜索，可以传入的内容为
- 字符串
- 正则表达式
- 列表
- kewwortd参数，可以用来表示属性
- text：对应tag的文本值
- 案例34

- css选择器
- 使用soup.select, 返回一个列表
- 通过标签名称: soup.select("title")
- 通过类名： soup.select(".content")
- id查找: soup.select("#name_id")
- 组合查找: soup.select("div #input_content")
- 属性查找: soup.select("img[class='photo'])
- 获取tag内容： tag.get_text
- 案例35

from urllib import  request
from bs4 import  BeautifulSoup

url = "http://www.baidu.com"
req = request.urlopen(url)
content = req.read()

soup = BeautifulSoup(content,'lxml')

# print(soup.prettify())
print(soup.select("title")[0].text)  #百度一下，你就知道

print(soup.select("#wrapper")[0])

print(soup.select("meta[content='always']")[0])