python3网络爬虫开发实践ptf 崔庆才 python网络爬虫权威指南第2版

转载

编程小匠人之魂 2024-08-31 21:32:58

文章标签 python html Python 文章分类 Python 后端开发

第 7 章读取文档

文档编码

文档编码告诉程序 如何读取文档
文档编码的方式通常可以根据 文件的扩展名 进行判断
所有文档都是由 0 和 1 编码而成的，而 编码算法 会定义 “每个字符多少位” 或 “每个像素的颜色值用多少位” 之类的事情

纯文本

from urllib.request import urlopen

textPage = urlopen('http://www.pythonscraping.com/'
                   'pages/warandpeace/chapter1.txt')
print(textPage.read())

文本编码类型简介

网站通常会声明编码
<meta charset="utf-8" />

ASCII
UTF-8（Universal Character Set —— Transformation Format 8 bit，统一字符集 — 转换格式 8 位）

Unicode 联盟（The Unicode Consortium）

ISO 标准：为每种语言创建一种编码（约占 9% 的网站）

ISO-8859-1：为拉丁字母设计
ISO-8859-2：德语等语言
ISO-8859-9：土耳其语
ISO-8859-15：法语等语言

将字符串显示转换成 UTF-8 格式

from urllib.request import urlopen

textPage = urlopen('http://www.pythonscraping.com/'
                   'pages/warandpeace/chapter1-ru.txt')
print(str(textPage.read(), 'utf-8'))

用 BeautifulSoup 和 Python 3.x 对文档进行 UTF-8 编码

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id': 'mw-content-text'}).get_text()
content = bytes(content, 'UTF-8')
content = content.decode('UTF-8')

CSV

Python 标准库
https://docs.python.org/3/library/csv.html

Python 的 csv 库主要是面向本地文件，CSV 文件得存储在电脑上

手动把 CSV 文件下载到本机，然后用 Python 定位文件位置
写 Python 程序下载文件，读取文件，之后把源文件删除
从网上直接把文件读成一个字符串，然后转换成一个 StringIO 对象，使它具有文件的属性

把 CSV 文件保存在内存里，读成字符串，然后封装成 StringIO 对象，让 Python 把它当做文件来处理，以避免先保存文件

from urllib.request import urlopen
from io import StringIO
import csv

data = urlopen('http://pythonscraping.com/files/MontyPythonAlbums.csv')\
    .read().decode('ascii', 'ignore')
dataFile = StringIO(data)
csvReader = csv.reader(dataFile)

for row in csvReader:
    # print(row)
    print('The album"' + row[0] + '" was released in ' + str(row[1]))

# 创建、处理和打印 DictReader 对象相对花更多的时间
dictReader = csv.DictReader(dataFile)
print(dictReader.fieldnames)

for row in dictReader:
    print(roe)

PDF

PDF（Portable Document Format，便携式文档格式）

示例运用的库 PDFMiner3K

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open


def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)

    process_pdf(rsrcmgr, device, pdfFile)
    device.close()

    content = retstr.getvalue()
    retstr.close()
    return content


# pdfFile = open('../chapter1.pdf', 'rb')
pdfFile = urlopen('http://pythonscraping.com/'
                  'pages/warandpeace/chapter1.pdf')
outputString = readPDF(pdfFile)
print(outputString)
pdfFile.close()

微软 Word 和 .docx

.doc 文件格式 是二进制文件格式，.docx 基于 Office Open XML 的标准

from zipfile import ZipFile
from urllib.request import urlopen
from io import BytesIO
from bs4 import BeautifulSoup

wordFile = urlopen('http://pythonscraping.com/pages/AWordDocument.docx').read()
wordFile = BytesIO(wordFile)
document = ZipFile(wordFile)
xml_content = document.read('word/document.xml')

wordObj = BeautifulSoup(xml_content.decode('utf-8'), 'xml')
textStrings = wordObj.find_all('w:t')

for textElem in textStrings:
    style = textElem.parent.parent.find('w:pStyle')
    if style is not None and style['w:val'] == 'Title':
        print('Title is: {}'.format(textElem.text))
    else:
        print(textElem.text)

第 8 章数据清洗

编写代码清洗数据

语言学里有一个模型叫 n-gram，表示文字或语言中 n 个连续的单词组成的序列
在进行自然语言分析时，使用 n-gram 或者虚招常用词组，可以很容易地把一句话分解成若干个文字片段
本节重点介绍 如何获取格式合理的 n-gram，在第 9 章，再用 2-gram 和 3-gram 来做文本摘要提取和分析

第一次清洗

getNgrams 函数 把一个待处理的字符串分成单词序列，然后增加到 n-gram 模型 里形成以每个单词开始的二元数组

from urllib.request import urlopen
from bs4 import BeautifulSoup


def getNgrams(content, n):
    content = content.split(' ')
    output = []

    for i in range(len(content) - n + 1):
        output.append(content[i:i + n])

    return output


html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id': 'mw-content-text'}).get_text()
ngrams = getNgrams(content, 2)
print(ngrams)
print('2-grams count is: ' + str(len(ngrams)))

第二次清洗

用正则表达式来移除转义字符（\n），再把 Unicode 字符过滤掉

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


def getNgrams(content, n):
    content = re.sub('\n|[[\d+\]]', ' ', content)
    content = bytes(content, 'UTF-8')
    content = content.decode('ascii', 'ignore')
    content = content.split(' ')
    content = [word for word in content if word != '']
    output = []

    for i in range(len(content) - n + 1):
        output.append(content[i:i + n])

    return output


html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id': 'mw-content-text'}).get_text()
ngrams = getNgrams(content, 2)
print(ngrams)
print('2-grams count is: ' + str(len(ngrams)))

第三次清洗

cleanInput()
移除所有的换行符和引用
引入 “句子” 的概念，通过 句点 + 空格 来分割句子，以避免形如 ['management', 'It'] 这种无效的词组
cleanSentence()
将句子分割成单词，去除标点符号和空白
去除 I 和 a 之外的单字符单词
创建 n-gram 的代码被移动到 getNgramsFromSentence() 中，在每个句子中通过 getNgrams() 被调用
保证了 n-gram 不会再句子之间创建
用 string.punctuation 和 string.whitespace 来获取 Python 所有的标点符号和白空格
>>> import string >>> string.punctuation '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' >>> string.whitespace ' \t\n\r\x0b\x0c'

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string


def cleanSentence(sentence):
    sentence = sentence.split(' ')
    sentence = [word.strip(string.punctuation + string.whitespace)
                for word in sentence]
    sentence = [word for word in sentence if len(word) > 1
                or (word.lower() == 'a' or word.lower() == 'i')]

    return sentence


def cleanInput(content):
    content = re.sub('\n|[[\d+\]]', ' ', content)
    content = bytes(content, 'UTF-8')
    content = content.decode('ascii', 'ignore')
    sentences = content.split('. ')

    return [cleanSentence(sentence) for sentence in sentences]


def getNgramsFromSentence(content, n):
    output = []

    for i in range(len(content) - n + 1):
        output.append(content[i:i + n])

    return output


def getNgrams(content, n):
    content = cleanInput(content)
    ngrams = []

    for sentence in content:
        ngrams.extend(getNgramsFromSentence(sentence, n))

    return ngrams


html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id': 'mw-content-text'}).get_text()
ngrams = getNgrams(content, 2)
print(ngrams)
print('2-grams count is: ' + str(len(ngrams)))

数据标准化

数据标准化过程要确保 清洗后的数据在语言学或逻辑上是等价的

示例中数据清洗方案为 collections 库 中的 Counter 对象
还可以用字典等方式解决

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
from collections import Counter


def cleanSentence(sentence):
    sentence = sentence.split(' ')
    sentence = [word.strip(string.punctuation + string.whitespace)
                for word in sentence]
    sentence = [word for word in sentence if len(word) > 1
                or (word.lower() == 'a' or word.lower() == 'i')]

    return sentence


def cleanInput(content):
    content = content.upper()
    content = re.sub('\n|[[\d+\]]', ' ', content)
    content = bytes(content, 'UTF-8')
    content = content.decode('ascii', 'ignore')
    sentences = content.split('. ')

    return [cleanSentence(sentence) for sentence in sentences]


def getNgramsFromSentence(content, n):
    output = []

    for i in range(len(content) - n + 1):
        output.append(content[i:i + n])

    return output


def getNgrams(content, n):
    content = cleanInput(content)
    ngrams = Counter()

    for sentence in content:
        newNgrams = [' '.join(ngram) for ngram in
                     getNgramsFromSentence(sentence, 2)]
        ngrams.update(newNgrams)

    return ngrams


html = urlopen('http://en.wikipedia.org/wiki/Python_(programming_language)')
bs = BeautifulSoup(html, 'html.parser')
content = bs.find('div', {'id': 'mw-content-text'}).get_text()
ngrams = getNgrams(content, 2)
print(ngrams)
print('2-grams count is: ' + str(len(ngrams)))

数据存储后再清洗（使用时再补）

OpenRefine

第 9 章自然语言处理

概括数据

用一些关键词来去除部分 ”没有意义的“ 单词，如 “of the”, “in the” 等
示例中用 isCommon() 来进行概括

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string
from collections import Counter


def isCommon(ngram):
    commonWords = ['THE', 'BE', 'AND', 'OF', 'A', 'IN', 'TO', 'HAVE', 'IT', 'I', 'THAT', 'FOR', 'YOU', 'HE', 'WITH', 'ON', 'DO', 'SAY', 'THIS', 'THEY', 'IS', 'AN', 'AT', 'BUT', 'WE', 'HIS', 'FROM', 'THAT', 'NOT', 'BY', 'SHE', 'OR', 'AS', 'WHAT', 'GO', 'THEIR', 'CAN', 'WHO', 'GET', 'IF', 'WOULD', 'HER', 'ALL', 'MY', 'MAKE', 'ABOUT', 'KNOW', 'WILL', 'AS', 'UP', 'ONE', 'TIME', 'HAS', 'BEEN', 'THERE', 'YEAR', 'SO', 'THINK', 'WHEN', 'WHICH', 'THEM', 'SOME', 'ME', 'PEOPLE', 'TAKE', 'OUT', 'INTO', 'JUST', 'SEE', 'HIM', 'YOUR', 'COME', 'COULD', 'NOW', 'THAN', 'LIKE', 'OTHER', 'HOW', 'THEN', 'ITS', 'OUR', 'TWO', 'MORE', 'THESE', 'WANT', 'WAY', 'LOOK', 'FIRST', 'ALSO', 'NEW', 'BECAUSE', 'DAY', 'MORE', 'USE', 'NO', 'MAN', 'FIND', 'HERE', 'THING', 'GIVE', 'MANY', 'WELL']
    for word in ngram:
        if word in commonWords:
            return True
    return False


def cleanSentence(sentence):
    sentence = sentence.split(' ')
    sentence = [word.strip(string.punctuation + string.whitespace) for word in sentence]
    sentence = [word for word in sentence if len(word) > 1 or (word.lower() == 'a' or word.lower() == 'i')]
    return sentence


def cleanInput(content):
    content = content.upper()
    content = re.sub('\n', ' ', content)
    content = bytes(content, 'UTF-8')
    content = content.decode('ascii', 'ignore')
    sentences = content.split('. ')
    return [cleanSentence(sentence) for sentence in sentences]


def getNgramsFromSentence(content, n):
    output = []
    for i in range(len(content)-n+1):
        if not isCommon(content[i:i+n]):
            output.append(content[i:i+n])
    return output


def getNgrams(content, n):
    content = cleanInput(content)
    ngrams = Counter()
    ngrams_list = []
    for sentence in content:
        newNgrams = [' '.join(ngram) for ngram in getNgramsFromSentence(sentence, n)]
        ngrams_list.extend(newNgrams)
        ngrams.update(newNgrams)
    return (ngrams)


content = str(
    urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt').read(),
    'utf-8')
ngrams = getNgrams(content, 3)
print(ngrams)

马尔科夫模型

马尔科夫文本生成器

娱乐
生成逼真的垃圾邮件来愚弄检测系统

马尔科夫模型

每个节点引出的所有概率之和必须等于 100
只有当前节点的状态会影响下一个节点的状态
有些节点可能比其他节点更难到达（指向该结点的概率之和小于 100%）

示例

buildWordDict() 将文本进行清洗标准化，生成字典
retrieveRnadomWord() 根据字典中单词频次的权重，随机获得一个单词

from urllib.request import urlopen
from random import randint


def wordListSum(wordList):
    sum = 0
    for word, value in wordList.items():
        sum += value
    return sum


def retrieveRandomWord(wordList):
    randIndex = randint(1, wordListSum(wordList))
    for word, value in wordList.items():
        randIndex -= value
        if randIndex <= 0:
            return word


def buildWordDict(text):
    # 剔除换行符和引号
    text = text.replace('\n', ' ');
    text = text.replace('"', '');

    # 保证每个标点符号都被当做一个“单词”
    # 这样就不会被剔除，而是会保留在马尔科夫链中
    punctuation = [',', '.', ';', ':']
    for symbol in punctuation:
        text = text.replace(symbol, ' {} '.format(symbol));

    words = text.split(' ')
    # 过滤空单词
    words = [word for word in words if word != '']

    wordDict = {}
    for i in range(1, len(words)):
        if words[i - 1] not in wordDict:
            # 为单词新建一个字典
            wordDict[words[i - 1]] = {}
        if words[i] not in wordDict[words[i - 1]]:
            wordDict[words[i - 1]][words[i]] = 0
        wordDict[words[i - 1]][words[i]] += 1
    return wordDict


text = str(urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt')
           .read(), 'utf-8')
wordDict = buildWordDict(text)

# 生成链长为 100 的马尔科夫链
length = 100
chain = ['I']
for i in range(0, length):
    newWord = retrieveRandomWord(wordDict[chain[-1]])
    chain.append(newWord)

print(' '.join(chain))

维基百科六度分隔

BFS

自然语言工具包（后续再看）

自然语言工具包（Natural Language Toolkit，NLTK）

安装与设置

用 NLTK 做统计分析

用 NLTK 做词性分析

第 10 章穿越网页表单与登录窗口进行抓取

Python Requests 库（直接去看 Requests 库文档）

Requests 库 擅长处理复杂的 HTTP 请求、cookie、header 等内容
网址：https://docs.python-requests.org/zh_CN/latest/

第 11 章抓取 JavaScript

Ajax 和动态 HTML

Ajax（Asynchronous JavaScript and XML，异步 JavaScript 和 XML）
动态 HTML（dynamic HTML，DHTML）

在 Python 中用 Selenium 执行 JavaScript

Selenium 最初是为网站自动化测试而开发的
现今广泛用于获取精确的网页快照，因为网站可以直接运行在浏览器中

Selenium 可以让浏览器自动加载网站，获取需要的数据，甚至对页面截屏，或者判断网站上是否发生了某些操作

Selenium 本身不带浏览器，需要与第三方浏览器集成才能运行

PhantomJS 是一个 无头浏览器（headless browser）

把网站加载到内存并执行页面上的 JavaScript，但是不会向用户展示网页的图形界面
PhantomJS 需要从官网下载并安装才能使用

Selenium 库 是一个在 WebDriver 对象上调用的 API

书上示例用的为 PhantomJS，已被弃用

PhantomJS 下载地址：https://phantomjs.org/download.html

from selenium import webdriver
import time

driver = webdriver.PhantomJS(executable_path='[Path]\\PhantomJS.exe')
driver.get('http://pythonscraping.com/pages/javascript/ajaxDemo.html')
time.sleep(3)
print(driver.find_element_by_id('content').text)
driver.close()

改用 Headless Chrome

Chrome 版本 查看：chrome://version/
Chrome 内核 下载地址：http://npm.taobao.org/mirrors/chromedriver/

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

chrome_options = Options()
chrome_options.headless = True
driver = webdriver.Chrome(executable_path='[Path]\chromedriver.exe', options=chrome_options)
driver.get('http://pythonscraping.com/pages/javascript/ajaxDemo.html')
time.sleep(3)
print(driver.find_element_by_id('content').text)
driver.close()

Selenium 的选择器

官方文档：https://www.selenium.dev/documentation/en/

选取单个元素

driver.find_element_by_css_selector('#content')
driver.find_element_by_tag_name('div')

选取多个元素

driver.find_elements_by_css_selector('#content')
driver.find_elements_by_tag_name('div')

用 BeautifulSoup 来解析网页内容

pageSource = driver.page_source
bs = BeautifulSoup(pageSource, 'html.parser')
print(bs.find(id='content').get_text())

隐式等待

WebDriverWait
expected_conditions

定义触发的期望条件（示例中为：‘loadedButton’）
元素用 定位器（locator） 指定

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.headless = True
driver = webdriver.Chrome(executable_path='[Path]\chromedriver.exe', options=options)
driver.get('http://pythonscraping.com/pages/javascript/ajaxDemo.html')
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, 'loadedButton'))
    )
finally:
    print(driver.find_element_by_id('content').text)
    driver.close()

Selenium 的定位器

定位器是一种 抽象的查询语言，用 By 对象 表示

定位器的基础使用

expected_conditions.presence_of_element_located(By.ID, 'loadedButton')

配合 WebDriver 的 find_element() 使用，来创建选择器

print(driver.find_element(By.ID, 'content').text)
print(driver.find_element_by_id('content').text)

定位器通过 By 对象进行选择的策略

ID：通过 HTML 的 id 属性
CLASS_NAME：通过 HTML 的 class 属性
CSS_SELECTOR：通过 CSS 的 class、id、tag 属性名 来查找元素，#idName、.className、tagName 表示
LINK_TEXT：通过链接文字查找 HTML 的 <a> 标签
PARTIAL_LINK_TEXT：与 LINK_TEXT 类似，通过 部分链接文字 来查找
NAME：通过 name 属性 来查找 HTML 标签
TAG_NAME：通过 标签的名称 来查找 HTML 标签
XPATH：用 XPath 表达式 选择匹配的元素

XPath 语法

BeautifulSoup 不支持 XPath
Selenium 和 Scrapy 等库都支持
XPath 语法中的 4 个重要概念

根节点和非根节点

/div 选择 div 节点，只有当它是文档的 根节点 时
//div 选择文档中 所有的 div 节点（包括非根节点）

通过属性选择节点

//@href 选择带 href 属性的 所有节点
//a[@href='http://google.com'] 选择文档中所有 指向 Google 网站的链接

通过位置选择节点

//a[3] 选择文档中的第 3 个链接
//table[last()] 选择文档中的最后一个表
//a[position() < 3] 选择文档中前 3 个链接

星号（*）匹配任意字符或节点，可以在不同情况下使用

//table/tr/* 选择所有表格中 tr 标签的所有子节点
//div[@*] 选择带任意属性的所有 div 标签

处理重定向

服务器端重定向

通过 Python 的 urllib 库 解决

客户端重定向

在服务器将页面内容发送到浏览器之前，由浏览器执行 JavaScript 完成的页面跳转
使用可以执行 JavaScript 的工具，如：Selenium

主要问题：怎么判断一个页面已经完成重定向

“监视“ DOM 中的一个元素，并重复调用这个元素，直到 Selenium 抛出 StaleElementReferenceException 异常

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import StaleElementReferenceException
import time


def waitForLoad(driver):
    elem = driver.find_element_by_tag_name('html')
    count = 0
    while True:
        count += 1
        if count > 20:
            print('Timing out after 10 seconds and returning')
            return

        time.sleep(.5)
        try:
            elem == driver.find_element_by_tag_name('html')
        except StaleElementReferenceException:
            return


options = Options()
options.headless = True
driver = webdriver.Chrome(executable_path='[Path]\chromedriver.exe', options=options)
driver.get('http://pythonscraping.com/pages/javascript/redirectDemo1.html')
waitForLoad(driver)
print(driver.page_source)

循环检查当前页面的 URL，直到 URL 发生改变，或者匹配到你寻找的特定的 URL

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

options = Options()
options.headless = True
driver = webdriver.Chrome(executable_path='[Path]\chromedriver.exe', options=options)
driver.get('http://pythonscraping.com/pages/javascript/redirectDemo1.html')
try:
    bodyElement = WebDriverWait(driver, 15).until(EC.presence_of_element_located(
        (By.XPATH, '//body[contains(text(), "This is the page you are looking for!")]')
    ))
    print(bodyElement.text)
except TimeoutException:
    print('Did not find the element')

第 12 章利用 API 抓取数据

API 概述

API 的响应通常是 JSON 或者 XML 格式的

HTTP 方法和 API

http://[URL]?[parameter=xxx]
从定义上看，GET 请求对服务器数据库的信息不会有任何影响

不存储任何信息，也不修改任何信息，只是读取信息

POST

将表单内容或提交信息到 Web 服务器的后端程序

在与网站交互时不常用，但在 API 里有时会用到
PUT 请求用来更新一个对象或信息

DELETE

DELETE 请求用来删除对象

解析 JSON 数据

Python 的 JSON 解析库

import json

jsonString = '{"arrayOfNums":[{"num": 0}, {"num": 1}, {"num": 2}],' \
             '"arrayOfFruits":[{"fruit": "apple"}, {"fruit": "banana"}, {"fruit": "pear"}]}'
jsonObj = json.loads(jsonString)

print(jsonObj.get('arrayOfNums'))
print(jsonObj.get('arrayOfNums')[1])
print(jsonObj.get('arrayOfNums')[1].get('num') +
      jsonObj.get('arrayOfNums')[2].get('num'))
print(jsonObj.get('arrayOfFruits')[2].get('fruit'))

无文档的 API

动态网站的 Web 服务器的职责

处理来自请求网站页面的用户的 GET 请求
从数据库检索页面需要呈现的数据
按照 HTML 模板组织页面数据
发送带格式的 HTML 给用户

由于 JavaScript 框架变得越来越普遍，很多 HTML 创建任务从原来的由服务器处理变成了由 浏览器处理

由于整个 内容管理系统 基本上已经移到了浏览器端，所以网站的内容大小和 HTTP 请求数量激增
当使用 Selenium 时，用户不需要的 “额外信息” 也被加载

因为服务器不再将数据处理成 HTML 格式，所以它们通常作为 数据库本身的一个弱包装器

该弱包装器简单地从数据库抽取数据，并 通过一个 API 将数据返回给页面

查找并记录无文档的 API

API 调用的几个特征

它们通常包含 JSON 或 XML。可以通过 搜索/过滤字段 过滤请求列表
利用 GET 请求，URL 中会包含一个传递给它们的参数

寻找一个返回搜索结果或者加载特定页面数据的 API 调用
用你使用的 搜索词、页面 ID 或者其他的识别信息，过滤结果即可

它们通常是 XHR 类型 的

每个 API 调用都可以通过留心以下几个字段识别和记录下来

使用的 HTTP 方法
输入

路径参数
请求头（包括 cookie）
正文内容（对于 PUT 和 POST 调用）

输出

响应头（包括 cookie 集合）
响应正文类型
响应正文字段

自动查找和记录 API（使用时再补）

https://github.com/REMitchell/apiscraper

API 与其他数据源结合

以一种新颖的方式将两个或多个数据源组合起来
把 API 作为一种工具，从全新的视角对抓取到的数据进行解释

递归查看维基百科中的匿名编辑者的 IP 地址

from urllib.error import HTTPError
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
import datetime
import random
import re
import time

random.seed(datetime.datetime.now())


def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id': 'bodyContent'}).findAll('a',
                                                         href=re.compile('^(/wiki/)((?!:).)*$'))


def getHistoryIPs(pageUrl):
    # 编辑历史页面的 URL 链接格式是：
    # http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=history
    pageUrl = pageUrl.replace('/wiki/', '')
    historyUrl = 'http://en.wikipedia.org/w/index.php?title={}&action=history'.format(pageUrl)
    print('history url is: {}'.format(historyUrl))
    html = urlopen(historyUrl)
    bs = BeautifulSoup(html, 'html.parser')
    # 找出 class 属性是 "mw-anonuserlink" 的链接
    # 它们用 IP 地址替代用户名
    ipAddresses = bs.findAll('a', {'class': 'mw-anonuserlink'})
    addressList = set()
    for ipAddress in ipAddresses:
        addressList.add(ipAddress.get_text())
    return addressList


# 返回的信息有误，不为 JSON 格式，该函数无法使用
def getCountry(ipAddress):
    try:
        response = urlopen(
            'http://freegeoip.net/json/{}'.format(ipAddress)).read().decode('utf-8')
    except HTTPError:
        return None
    time.sleep(3)
    responseJson = json.loads(response)
    return responseJson.get('country_code')


links = getLinks('/wiki/Python_(programming_language)')

while len(links) > 0:
    for link in links:
        print('-' * 20)
        historyIPs = getHistoryIPs(link.attrs['href'])
        for historyIP in historyIPs:
            print(historyIP)

    newLink = links[random.randint(0, len(links) - 1)].attrs['href']
    links = getLinks(newLink)

# while len(links) > 0:
#     for link in links:
#         print('-' * 20)
#         historyIPs = getHistoryIPs(link.attrs["href"])
#         for historyIP in historyIPs:
#             country = getCountry(historyIP)
#             if country is not None:
#                 print('{} is from {}'.format(historyIP, country))
#
#     newLink = links[random.randint(0, len(links) - 1)].attrs['href']
#     links = getLinks(newLink)

第 13 章图像识别与文字处理（使用时再补）

将图像转化成文字被称为 光学字符识别（optical character recognition，OCR）

OCR 库概述

Pillow 库 执行第一步，清洗和过滤图像
Tesseract 库 尝试将图像中的形状与库里面存储的文字相匹配

Pillow

from PIL import Image, ImageFilter

kitten = Image.open('kitten.jpg')
blurryKitten = kitten.filter(ImageFilter.GaussianBlur)
blurryKitten.save('kitten_blurred.jpg')
blurryKitten.show()

Tesseract

Tesseract 有极高的精确度和灵活性。可以通过训练识别出任何字体，也可以识别出任何 Unicode 字符

安装 Tesseract
安装 pyresseract

NumPy

用来训练 Tesseract 识别图片，以及完成简单的数学任务（如计算加权平均值）

处理格式规范的文字

自动调整图像

从网站图片中抓取文字

读取验证码与训练 Tesseract

获取验证码并提交答案

第 14 章避开抓取陷阱

让网络机器人看着像人类用户

修改请求头

测试对服务器可见的浏览器属性：https://www.whatismybrowser.com/

用 JavaScript 处理 cookie

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")

driver = webdriver.Chrome(
    executable_path='drivers/chromedriver', 
    chrome_options=chrome_options)
driver.get('http://pythonscraping.com')
driver.implicitly_wait(1)

savedCookies = driver.get_cookies()
print(savedCookies)

driver2 = webdriver.Chrome(
    executable_path='drivers/chromedriver',
    chrome_options=chrome_options)

driver2.get('http://pythonscraping.com')
driver2.delete_all_cookies()
for cookie in savedCookies:
    driver2.add_cookie(cookie)

driver2.get('http://pythonscraping.com')
driver.implicitly_wait(1)
print(driver2.get_cookies())

常见表单安全措施

隐含输入字段值

在 HTML 表单中，“隐含” 字段让字段的值对浏览器可见，但是对用户不可见（除非看源代码）

用隐含字段阻止网页抓取

表单页面上的一个字段可以用服务器生成的随机变量填充

绕过这个问题的最佳方法：首先抓取表单所在页面上生成的随机变量，然后再提交到表单处理页面

“蜜罐”（honey pot）：表单里包含一个具有普通名称的隐含字段。服务器会忽略所有隐含字段的真实值

避免蜜罐

3 种不同的方式隐藏元素

<a href="http://xxx" style="display:none"> xxx </a>
<input type="hidden" name="xxx" />
<style> .customHidden { position:absolute; right:500000px; } </style>

用 Selenium 的 is_displayed() 来判断元素在页面上是否可见

from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.chrome.options import Options

driver = webdriver.Chrome(
    executable_path='drivers/chromedriver',
    chrome_options=chrome_options)
driver.get('http://pythonscraping.com/pages/itsatrap.html')
links = driver.find_elements_by_tag_name('a')
for link in links:
    if not link.is_displayed():
        print('The link {} is a trap'.format(link.get_attribute('href')))

fields = driver.find_elements_by_tag_name('input')
for field in fields:
    if not field.is_displayed():
        print('Do not change value of {}'.format(field.get_attribute('name')))

第 15 章用爬虫测试网站（后续再看）

第 16 章并行网页抓取（后续再看）

第 17 章远程抓取（后续再看）

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：vue nginx 配置static

下一篇：java目录表设计 java教程目录

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

python3网络爬虫开发实践ptf 崔庆才 python网络爬虫权威指南 第2版

python3网络爬虫开发实践ptf 崔庆才 python网络爬虫权威指南 第2版

第 7 章 读取文档

文档编码

纯文本

文本编码类型简介

CSV

PDF

微软 Word 和 .docx

第 8 章 数据清洗

编写代码清洗数据

第一次清洗

第二次清洗

第三次清洗

数据标准化

数据存储后再清洗（使用时再补）

OpenRefine

第 9 章 自然语言处理

概括数据

马尔科夫模型

维基百科六度分隔

自然语言工具包（后续再看）

安装与设置

用 NLTK 做统计分析

用 NLTK 做词性分析

第 10 章 穿越网页表单与登录窗口进行抓取

Python Requests 库（直接去看 Requests 库文档）

第 11 章 抓取 JavaScript

Ajax 和 动态 HTML

在 Python 中用 Selenium 执行 JavaScript

Selenium 的选择器

隐式等待

Selenium 的定位器

XPath 语法

处理重定向

第 12 章 利用 API 抓取数据

API 概述

HTTP 方法和 API

解析 JSON 数据

无文档的 API

查找并记录无文档的 API

自动查找和记录 API（使用时再补）

API 与其他数据源结合

第 13 章 图像识别与文字处理（使用时再补）

OCR 库概述

Pillow

Tesseract

NumPy

处理格式规范的文字

自动调整图像

从网站图片中抓取文字

读取验证码与训练 Tesseract

获取验证码并提交答案

第 14 章 避开抓取陷阱

让网络机器人看着像人类用户

修改请求头

用 JavaScript 处理 cookie

常见表单安全措施

隐含输入字段值

避免蜜罐

第 15 章 用爬虫测试网站（后续再看）

第 16 章 并行网页抓取（后续再看）

第 17 章 远程抓取（后续再看）

51CTO博客

python3网络爬虫开发实践ptf 崔庆才 python网络爬虫权威指南第2版

python3网络爬虫开发实践ptf 崔庆才 python网络爬虫权威指南第2版

第 7 章读取文档

第 8 章数据清洗

第 9 章自然语言处理

第 10 章穿越网页表单与登录窗口进行抓取

第 11 章抓取 JavaScript

Ajax 和动态 HTML

第 12 章利用 API 抓取数据

第 13 章图像识别与文字处理（使用时再补）

第 14 章避开抓取陷阱

第 15 章用爬虫测试网站（后续再看）

第 16 章并行网页抓取（后续再看）

第 17 章远程抓取（后续再看）