XPath,全称 XML Path Language,即 XML 路径语言,它是一门在XML文档中查找信息的语言。XPath 最初设计是用来搜寻XML文档的,但是它同样适用于 HTML 文档的搜索
官方文档:https://www.w3.org/TR/xpath/
XPath常用规则:
nodename 选取此节点的所有子节点
/ 从当前节点选取直接子节点
// 从当前节点选取子孙节点
. 选取当前节点
.. 选取当前节点的父节点
@ 选取属性
实例引入:
#!/user/bin/env python
#-*- coding:utf-8 -*-
from lxml import etree
def test1():
content = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
# 调用etree模块的HTML类构造一个XPath解析对象
html = etree.HTML(content)
result = etree.tostring(html)
print(result.decode('utf-8'))
def test2():
html = etree.parse('test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))
# 下面是详解了嘿嘿——————————————————————————————————————————
# 所有节点:利用//开头的xpath规则选取所有符合要求的节点
def demo1():
html = etree.parse('test.html',etree.HTMLParser())
result = html.xpath('//*')
print(result)
# *代表匹配所有的结点
# 只获取li节点
# 要选取所有 li 节点可以使用 //,然后直接加上节点的名称即可,调用时直接调用 xpath() 方法即可提取
def demo2():
html = etree.parse('test.html',etree.HTMLParser())
result = html.xpath('//li')
print(result)
# 子结点,获取li节点下的a节点
def demo3():
html = etree.parse('test.html',etree.HTMLParser())
result = html.xpath('//li/a')
print(result)
# 实例:查找ul节点下的所有的子孙a节点
def demo4():
html = etree.parse('test.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)
# 父节点 :可以通过..来获取父节点
# 例:获取href 是 link4.html 的 a 节点的父节点的class属性
def demo5():
# html = etree.parse('test.html',etree.HTMLParser())
# result = html.xpath("//a[@href='link4.html']/../@class")
#2
html = etree.parse('test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)
# 属性匹配 :@符号可以进行匹配属性
def demo6():
# 注意下,demo是获取属性值,这里是属性匹配,/@class 是获取属性值
html = etree.parse('test.html',etree.HTMLParser())
result = html.xpath("//li[@class='item-0']")
print(result)
#文本获取:利用xpath中的text()方法可以获取节点中的文本
# 实例:获取li节点下的文本
def demo7():
html = etree.parse('test.html', etree.HTMLParser())
# result = html.xpath('//li[@class="item-0"]/text()')
# print(result)
# 假如我们获取a节点的内容
# 方法1
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)
# 方法2
result = html.xpath('//li[@class="item-0"]//text()')
print(result)
# 属性获取
# 实例:获取li节点下所有a节点的href属性
def demo8():
html = etree.parse('test.html',etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)
# 属性多值匹配
# 匹配有多个属性值的节点,需要用contains()函数
# 语法:contains(@属性名称,属性值)
def demo9():
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,"li")]/a/text()')
print(result)
# 多属性匹配
# 根据多个属性才能确定一个节点,需要使用运算符and来连接
def demo10():
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')
print(result)
# 按序选择
def demo11():
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()') #最后一个 li 节点
print(result)
result = html.xpath('//li[position()<3]/a/text()') # 小于3的
print(result)
result = html.xpath('//li[last()-2]/a/text()') #中括号中传入 last()-2即可,因为 last() 是最后一个,所以 last()-2 就是倒数第三个
print(result)
# 节点轴选择
def demo12():
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
# 获取所有祖先节点
result = html.xpath('//li[1]/ancestor::*')
print(result)
# 获取div的祖先节点
result = html.xpath('//li[1]/ancestor::div')
print(result)
# 获取属性值
result = html.xpath('//li[1]/attribute::*')
print(result)
# 获取直接子节点
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
# 获取所有子孙节点
result = html.xpath('//li[1]/descendant::span')
print(result)
# 获取当前节点之后的所有节点
result = html.xpath('//li[1]/following::*[2]')
print(result)
# 获取当前节点之后的所有同级节点
result = html.xpath('//li[1]/following-sibling::*')
print(result)
# 轴的使用,用法参考:http://www.w3school.com.cn/xpath/xpath_axes.asp
XPath 中的运算符,另外还有很多运算符,如 or、mod 等等,在此总结如下:
http://www.w3school.com.cn/xpath/xpath_operators.asp
xpath 就写完了,后面会更新更加多的内容