python 爬取新浪新闻 python爬取百度新闻数据

转载

技术极客 2023-07-11 10:59:38

文章标签 python 爬取新浪新闻 html python 存到文件 文章分类 Python 后端开发

1.安装beauitfulsoup4  cmd-> pip install beautifulsoup4
python提供了一个支持处理网络链接的内置模块urllib,beatuifulsoup是用来解析html

python 爬取新浪新闻 python爬取百度新闻数据_python

验证安装是否成功

python 爬取新浪新闻 python爬取百度新闻数据_html_02

2. pycharm配置

python 爬取新浪新闻 python爬取百度新闻数据_存到文件_03

python 爬取新浪新闻 python爬取百度新闻数据_python 爬取新浪新闻_04

3.代码如下

import urllib.request
from bs4 import BeautifulSoup
class Scraper:
    def __init__(self,site):
        self.site=site

    def scrape(self):
        r=urllib.request.urlopen(self.site)
        html=r.read()
        parser="html.parser"
        sp=BeautifulSoup(html,parser)
        for tag in sp.find_all("a"):
            url=tag.get("href")
            if url is None:
                continue
            if "html" in url:
                print("\n"+url)

news="http://news.baidu.com/"
Scraper(news).scrape()


4.运行结果就是获取了百度新闻的链接

python 爬取新浪新闻 python爬取百度新闻数据_html_05

5. 如何把获取的链接保存到文件里呢？

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        response = urllib.request.urlopen(self.site)
        html = response.read()
        soup = BeautifulSoup(html, 'html.parser')
        with open("output.txt", "w") as f:
            for tag in soup.find_all('a'):
                url = tag.get('href')
                if url and 'html' in url:
                    print("\n" + url)
                    f.write(url + "\n")

Scraper('http://news.baidu.com/').scrape()

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：docker run主机名 docker run --hostname

下一篇：java hashtable 数据结构 java的hashset

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

python 爬取新浪新闻 python爬取百度新闻数据

python 爬取新浪新闻 python爬取百度新闻数据

51CTO博客