python 爬取github仓库的数据 github爬虫

转载

编程小达人之心 2023-11-20 22:30:55

文章标签 python 爬虫 github api request 文章分类 Python 后端开发

用API搜索GitHub中star数最多的前十个库，并用post方法登陆并点击收藏

一用API搜索GitHub中star数最多的前十个库

利用GitHub提供的API爬取前十个star数量最多的Python库

GitHub提供了很多专门为爬虫准备的API接口，通过接口可以爬取到便捷，易处理的信息。（这是GitHub官网的各种api介绍）

我将要使用的链接是https://api.github.com/search/repositories?q=language:python&sort=stars。 ‘language:’后是要搜索的语言，‘sort=’后是搜索的排序方式，这个链接以star数量返回前三十个Python库的信息。

使用到的库

import requests

通过get请求到网页的信息

response = requests.get('https://api.github.com/search/repositories?q=language:python&sort=stars')
#检测是否请求成功，若成功，状态码应该是200
if(response.status_code != 200):
    print('error: fail to request')

若我们自己点入上方的链接，会发现一个特别的网页，没有界面，只有由简单的字符组成。

仔细观察，会发现字符和字典的结构是相同的，最上层是三个关键词，其中 'items'关键词存储有一个List，里面有多组字典信息，每一个字典存储有一个python库的详细信息。

python 爬取github仓库的数据 github爬虫_python

所以直接提取相应信息即可

#获取的是一个json格式的字典对象
j = response.json()
#'items'下包括了前三十个库的所有详细信息
items= j['items']

#存储前十个数据
message = []
for i in range(10):
    pro = items[i]
    message.append(pro['full_name'])#库的'作者/名字'
    #依次打印
    print('top%d:' % (i+1), pro['name'])#打印库的名字

打印结果：

top1: awesome-python
top2: system-design-primer
top3: models
top4: public-apis
top5: youtube-dl
top6: flask
top7: thefuck
top8: httpie
top9: django
top10: awesome-machine-learning

完整代码：

'''
用于打印并返回前十个最受欢迎的库
传入的是搜索的语言，这个程序需要传入'python',
返回的是一个存有前十个库的'作者名/库名'的List
'''
def find_top10(language):
    #利用github提供的api接口进行搜索
    response = requests.get(
        "https://api.github.com/search/repositories?q=language:%s&sort=stars"%(language))

    #检测是否请求成功
    if(response.status_code != 200):
        print('error! in find_top10(): fail to request')
        return None

    #获取的是一个json格式的字典对象
    j = response.json()
    #keyword 'items'下的包括了前三十个库的所有详细信息
    progress = j['items']

    #存储前十个数据
    message = []
    for i in range(10):
        pro = j['items'][i]
        message.append(pro['full_name'])
        #依次打印
        print('top%d:' % (i+1), pro['name'])
    return message

二用post方法登陆GitHub

先撸一撸post需要的东西。首先我们肯定要知道我们的url是什么，其次我们得知道我们需要提交哪些信息（data）

我们先点开GitHub的登陆界面，尝试着登陆，利用谷歌的“右键->检查->network"查看登陆时的信息。

找到了登陆需要使用的url，以及登陆时需要上传的信息（data）

python 爬取github仓库的数据 github爬虫_ request_02

python 爬取github仓库的数据 github爬虫_github_03

但我们发现，Data中有一个authenticity_token，知乎也有这个东西，貌似是用来保证安全用的。但重点是，我们从哪儿去找这个值呢？

authenticity_token的值往往隐藏在上一级网页内

果然，在登陆界面的网页代码中找到了隐藏的authenticity_token信息（记住这个方法）

python 爬取github仓库的数据 github爬虫_爬虫_04

我们根据上方信息进行登陆：（authenticity_token值需要用爬虫寻找而不能直接复制浏览器中的值！）

用get方法获取登陆界面的信息，得到cookies以及authenticity_token

r1 = requests.get('https://github.com/login')

    soup = BeautifulSoup(r1.text, features='lxml')
     #获取authenticity_token
    att = soup.find(name='input', attrs={'name': 'authenticity_token'}).get('value')
    #获取cookies
    cookie = r1.cookies.get_dict()

利用得到的信息构造Data并登陆

r2 = requests.post(
    'https://github.com/session',
    data={
        'commit': 'Sign in',
        'utf8': '✓',
        'authenticity_token': att,
        'login': username,
        'password': password
        },
    cookies=cookie
)
#返回登陆后的cookies可用于以登陆身份访问GitHub
return r2.cookies.get_dict()

完整代码：

'''
用于登陆github的函数
传入账号密码，登陆并返回登陆后的cookies
'''
def login(username, password):
    #get登陆页面并获取 authenticity_token 和 cookies（登陆需要）
    r1 = requests.get('https://github.com/login')

    if(r1.status_code != 200):
        print('error! in login(): fail to request')
        return None

    soup = BeautifulSoup(r1.text, features='lxml')
     #获取authenticity_token
    att = soup.find(name='input', attrs={
                        'name': 'authenticity_token'}).get('value')
    #获取cookies
    cookie = r1.cookies.get_dict()
    #利用获取的信息登陆github
    r2 = requests.post(
        'https://github.com/session',
        data={
            'commit': 'Sign in',
            'utf8': '✓',
            'authenticity_token': att,
            'login': username,
            'password': password
        },
        cookies=cookie
    )

    if(r2.status_code != 200):
        print('error! in login(): fail to request')
        return None
        
    print('successed login')
    return r2.cookies.get_dict()

三用post方法收藏star数量最多的前十个python仓库

我们已经找到了这十个库，也成功登陆了GitHub

所以我们现在需要做的就是寻找收藏某个一个库的方法

我随便点开了一个库‘thefuck'，可以在右上角找到star的按钮

python 爬取github仓库的数据 github爬虫_api_05

于是我用登陆时同样的方法，点击按钮，并用谷歌分析信息，

成功的找到了post的url地址，以及需要提交的data

python 爬取github仓库的数据 github爬虫_github_06

python 爬取github仓库的数据 github爬虫_github_07

我们又看到了熟悉的authenticity_token！真是麻烦啊，但我们已经学会了如何寻找这个值

果然，我在这个网页的代码里面找到了隐藏的authenticity_token。

所以我们需要用get方法得到这个网页，并在网页代码中找到这个值

python 爬取github仓库的数据 github爬虫_github_08

通过get获取authenticity_token，这里传入了登陆之后得到的cookies

get2 = requests.get(
        'https://github.com/nvbn/thefuck',
        cookies=cookie
    )
    #更新cookies
    cookie = get2.cookies.get_dict()
    
    #得到authenticity_token
    soup2 = BeautifulSoup(get2.text, features='lxml')
    #据分析，star的authenticity_token是放于class属性为unstarred js-social-form的块中
    soup = soup2.find('form', {'class': 'unstarred js-social-form'})
    att2 = soup.find(name='input', attrs={
        'name': 'authenticity_token'}).get('value')

提交post请求进行收藏，提交成功后便收藏成功

#提交Post请求，请求之后即收藏了该库
r = requests.post(
    'https://github.com/nvbn/thefuck/star',
        data={
         'utf8': '✓',
        'authenticity_token': att2,
        'context': 'repository'
    },
    cookies=cookie
 )

完整代码：

'''
利用登陆的cookies点击star收藏项目

这里我分析了某个库的网页代码，按钮'star'中，提供了一‘authenticity_token’和一个用于post的链接（https://github.com/作者名/库名/star），只要获取‘authenticity_token’并用登陆的cookies访问链接，即可实现点击该按钮的同样效果。

所以只需要传入cookies 和 作者名/库名即可
'''
def get_stared(cookie, name):
    #get到该库的界面并得到用于 点击star的post提交的authenticity_token
    get2 = requests.get(
        #利用传入的‘作者名/库名’进行操作
        'https://github.com/%s'%(name),
        cookies=cookie
    )
    cookie = get2.cookies.get_dict()
    #检验请求情况
    if(response.status_code != 200):
        print('error! in get_stared(): get: fail to request')
        return None
    
    #得到authenticity_token
    soup2 = BeautifulSoup(get2.text, features='lxml')
    soup = soup2.find('form', {'class': 'unstarred js-social-form'})
    att2 = soup.find(name='input', attrs={
        'name': 'authenticity_token'}).get('value')

    #提交Post请求，请求之后即收藏了该库
    r = requests.post(
        'https://github.com/%s/star'%(name),
        data={
            'utf8': '✓',
            'authenticity_token': att2,
            'context': 'repository'
        },
        cookies=cookie
    )
    #检验请求情况，这里比较特殊，返回错误码400的时候收藏成功
    if(response.status_code != 400):
        print('error! in get_stared(): get: fail to request')
        return None

全部代码：加入了运行需要的函数和句子

'''
1.利用爬虫登陆github
2.搜所前十个最受欢迎的Python库（star数作为排序标准）
3.并收藏它们

名字：周开颜
学号：2017141461386

经测试，所有功能成功实现
'''

import requests
from bs4 import BeautifulSoup

'''
用于登陆github的函数
传入账号密码，登陆并返回登陆后的cookies
'''


def login(username, password):
    #get登陆页面并获取 authenticity_token 和 cookies（登陆需要）
    r1 = requests.get('https://github.com/login')
    soup = BeautifulSoup(r1.text, features='lxml')

    #获取authenticity_token
    att = soup.find(name='input', attrs={
        'name': 'authenticity_token'}).get('value')
    #获取cookies
    cks = r1.cookies.get_dict()

    #利用获取的信息登陆github
    r2 = requests.post(
        'https://github.com/session',
        data={
            'commit': 'Sign in',
            'utf8': '✓',
            'authenticity_token': att,
            'login': username,
            'password': password
        },
        cookies=cks
    )
    print("successed login!")
    return r2.cookies.get_dict()


'''
用于打印并返回前十个最受欢迎的库
传入的是搜索的语言，这个程序需要传入'python',
返回的是一个存有前十个库的'作者名/库名'的List
'''


def find_top10(language):
    #利用github提供的api接口进行搜索
    response = requests.get(
        "https://api.github.com/search/repositories?q=language:%s&sort=stars" % (language))

    #检测是否请求成功
    if(response.status_code != 200):
        print('error! in find_top10(): fail to request')
        return None

    #获取的是一个json格式的字典对象
    j = response.json()
    #keyword 'items'下的包括了前三十个库的所有详细信息
    progress = j['items']

    #存储前十个数据
    message = []
    for i in range(10):
        pro = j['items'][i]
        message.append(pro['full_name'])
        #依次打印
        print('top%d:' % (i+1), pro['name'])
    return message


'''
利用登陆的cookies点击star收藏项目

这里我分析了某个库的网页代码，按钮'star'中，提供了一‘authenticity_token’和一个用于post的链接（https://github.com/作者名/库名/star），只要获取‘authenticity_token’并用登陆的cookies访问链接，即可实现点击该按钮的同样效果。

所以只需要传入cookies 和 作者名/库名即可
'''


def get_stared(cookie, name):
    #get到该库的界面并得到用于 点击star的post提交的authenticity_token
    get2 = requests.get(
        #利用传入的‘作者名/库名’进行操作
        'https://github.com/%s' % (name),
        cookies=cookie
    )
    cookie = get2.cookies.get_dict()
    #检验请求情况
    if(get2.status_code != 200):
        print('error! in get_stared(): get: fail to request')
        return None

    #得到authenticity_token
    soup2 = BeautifulSoup(get2.text, features='lxml')
    soup = soup2.find('form', {'class': 'unstarred js-social-form'})
    att2 = soup.find(name='input', attrs={
        'name': 'authenticity_token'}).get('value')

    #提交Post请求，请求之后即收藏了该库
    r = requests.post(
        'https://github.com/%s/star' % (name),
        data={
            'utf8': '✓',
            'authenticity_token': att2,
            'context': 'repository'
        },
        cookies=cookie
    )
    #检验请求情况，这里比较特殊，返回错误码400的时候收藏成功
    if(r.status_code != 400):
        print('error! in get_stared(): post: fail to request')
        return None
    print(name, 'stared')


#主函数
if __name__ == '__main__':
    #登陆
    id = input('please input your github id\n')
    ps = input('please input your github password\n')
    coks = login(id, ps)

    #获取前十个最受欢迎的python库
    messages = find_top10('python')

    #为这十个库点star（收藏）
    for m in messages:
        if(type(m) == str):
            get_stared(coks, m)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。