twitter 推送

(An In-depth Twitter Scraping Tutorial)

(Overview)

Sometimes only scraping text and using simple search queries on tweets is not enough. After receiving a fair amount of inquiries about topics not fully covered in my basic Twitter scraping article, I decided it was worth writing a follow-up article that explains the many attributes and methods offered with Tweepy and GetOldTweets3 to further refine one’s search queries for tweets. Hopefully, this article will cover any lingering questions or provide new ideas. Like my previous article, I’ll be using Tweepy, a python library for accessing Twitter’s API, and Dmitry Mottl’s GetOldTweets3, a python library to scrape data without requiring API keys.

有时仅抓取文本并在推特上使用简单的搜索查询是不够的。 在收到有关我的基本Twitter抓取文章中未完全涵盖的主题的大量查询之后,我认为值得写一篇后续文章,解释Tweepy和GetOldTweets3提供的许多属性和方法,以进一步完善推文的搜索查询。 希望本文将涵盖所有缠绵的问题或提供新的想法。 就像我以前的文章一样,我将使用Tweepy (一个用于访问Twitter的API的python库)和Dmitry Mottl的GetOldTweets3 (一个无需API密钥即可抓取数据的python库)。

This tutorial is meant to be a straightforward article that jumps right into the coding, showcasing the data available in the various objects from each library and methods to further refine one's query when they’re scraping tweets.

本教程旨在成为一篇简单易懂的文章,直接跳到编码中,展示每个库和方法中各种对象中可用的数据,以在用户抓取推文时进一步完善查询。

(Before Continuing/ Pre-requisites)

This article assumes you have familiarity with Python and either Tweepy or GetOldTweets3. If this is your first time scraping with these libraries or you need a refresher, I recommend reading my previous article that covers the basics and setting up.

本文假定您熟悉Python和Tweepy或GetOldTweets3。 如果这是您第一次使用这些库,或者需要复习,则建议阅读我的上一篇文章 ,其中介绍了基础知识和设置。

Before we start, a quick overview of the two libraries I’ll be using in this article.

在开始之前,快速概述一下我将在本文中使用的两个库。

Tweepy:

Tweepy:

  • Can interact with Twitter’s API and create tweets, favorite, etc.
  • Requires signing up to receive credentials
  • Access to deeper tweet information and user information
  • Has limitations in regards to scraping a large number of tweets and access to tweets older than a week using the standard search API

GetOldTweets3

GetOldTweets3

  • Only used for scraping tweets
  • No signup required
  • Limited access to user information
  • No restrictions for scraping a large number of tweets or accessing older tweets

Those are the major characteristics of the two libraries, I provide a more detailed comparison in my previous article if you’re interested. If you find yourself wanting Tweepy level access to information on large amounts of data I recommend looking into Twitter’s Premium/Enterprise Search APIs. This article does a great job of showcasing a library called searchtweets that allows you to use these APIs with Python. If you find Twitter’s Premium/Enterprise APIs too expensive, there’s a workaround where you can use Tweepy and GetOldTweets3 together that I discuss at the end of this article.

这些是这两个库的主要特征,如果您感兴趣,我会在上一篇文章中提供更详细的比较。 如果您发现自己希望Tweepy级别访问有关大量数据的信息,我建议您研究Twitter的Premium / Enterprise Search API。 本文出色地展示了一个名为searchtweets的库,该库使您可以将这些API与Python结合使用。 如果您发现Twitter的Premium / Enterprise API太昂贵,则有一种解决方法,可以在本文结尾处一起使用Tweepy和GetOldTweets3。

This article covers different topic niches and because of that, I don’t recommend reading start to end unless you absolutely want a deep understanding of both Tweepy and GetOldTweets3. As such, I’ll include a linked table of contents so you can decide which information is relevant to you and jump to the respective section. Also, as you will see, most of my functions are set to scrape for 150 tweets. I recommend testing your queries on smaller samples first, then scraping your desired tweet amount as large queries may take time to finish.

本文涵盖了不同的主题领域,因此,除非您绝对希望深入了解Tweepy和GetOldTweets3,否则不建议您从头开始阅读。 这样,我将包含一个链接的目录,以便您可以确定与您相关的信息,然后跳到相应的部分。 而且,正如您将看到的,我的大多数功能都设置为抓取150条推文。 我建议先对较小的样本进行查询,然后再刮取所需的推文数量,因为大型查询可能需要一些时间才能完成。

Quick note, if you’re scraping private accounts or previously private accounts you may find that you can not access the tweets or may be limited in the number of tweets you can scrape.

快速说明,如果您要抓取私人帐户或以前的私人帐户,您可能会发现您无法访问这些推文,或者您可以抓取的推文数量受到限制。

If you want to check out my code beforehand or have something to follow along with while reading each section I have created an article notebook that follows the code examples presented in this article; and a companion notebook that provides a lot more examples to better flesh out the code. You can access my Jupyter Notebooks for this tutorial on my GitHub here.

如果您想事先检查我的代码,或者在阅读每一节时都需要了解一些内容,请按照我在本文中介绍的代码示例创建一个笔记本。 以及一个附带的笔记本,该笔记本提供了更多示例以更好地充实代码。 您可以访问我的Jupyter笔记本电脑在本教程中我的GitHub 这里 。

Without further ado let’s jump into the scraping!

事不宜迟,让我们跳进来!

Photo Source 照片来源

(Table of Contents:)

用Tweepy刮更多 (Scraping More With Tweepy)

  • Getting More Information From Tweets, e.g. favorites, retweets 从推文获取更多信息,例如收藏夹,转发
  • Getting User Information From Tweets, e.g. follower count, tweet counts 从推文获取用户信息,例如关注者计数,推文计数
  • Scraping With Advanced Queries, e.g. scraping from specific locations, or in specific languages 使用高级查询进行抓取,例如从特定位置或特定语言进行抓取
  • Putting It All Together 放在一起

使用GetOldTweets3进行更多爬取 (Scraping More With GetOldTweets3)

  • Getting More Information From Tweets, e.g. favorites, retweets 从推文获取更多信息,例如收藏夹,转发
  • Getting User Information From Tweets, LIMITED 从Tweets, LIMITED获取用户信息
  • Scraping With Advanced Queries, e.g. scraping from specific locations, or in specific languages 使用高级查询进行抓取,例如从特定位置或特定语言进行抓取
  • Putting It All Together 放在一起

如何在GetOldTweets3中使用Tweepy (How to Use Tweepy With GetOldTweets3)

(Scraping More with Tweepy)

If you want code to follow along with this section I have an article Jupyter Notebook here. If you want more code examples and easy to use functions I’ve created a companion Jupyter Notebook here.

如果您想让代码与本节一起进行,请在此处阅读Jupyter Notebook。 如果您想要更多的代码示例和易于使用的功能,我在这里创建了一个配套的Jupyter Notebook。

入门 (Getting Started)

Before you can utilize Tweepy it requires authorization. I won’t go into detail about setting this up since I’ve covered it before in my previous article.

您必须先获得授权,然后才能使用Tweepy。 因为我在上一篇文章中已经介绍过,所以我将不对其进行详细设置。

consumer_key = "XXXXX"
consumer_secret = "XXXXX"
access_token = "XXXXX"
access_token_secret = "XXXXXX"auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

从推文获取更多信息 (Getting More Information From Tweets)

The picture provided below, while not an exhaustive list, covers a majority of information available in Tweepy’s tweet object. If you’re curious about what other data is available, the full list of information is shown here on Twitter’s developer site. As you’ll see in this section and in the user information section. I use string versions of Ids instead of integer in order to keep data integrity since storing the integer value may lead to data loss when accessing it at a later date.

下面提供的图片虽然不是详尽列表,但涵盖了Tweepy的tweet对象中的大部分可用信息。 如果您想了解哪些其他数据可用时,显示的信息的完整列表在这里对Twitter的开发者网站。 正如您将在本部分和用户信息部分中看到的那样。 为了保持数据完整性,我使用Ids的字符串版本而不是整数,因为在以后访问它时,存储整数值可能会导致数据丢失。

List of tweet information available with Tweepy

Tweepy可用的推文信息列表

For the most part every tweet.attribute will have some value and are easily accessible. However, the tweet.coordinates and tweet.place attributes require some extracting because they’re data dictionaries that contain other data. Also, because they can be null it’s important to check first if a tweet has the information in the first place. The functions below check if the attribute has data and extracts the relevant information. This code assumes you’re following my column naming convention as we’ll see in the next section. If you change how you name Tweet Coordinates or Place Info you’ll need to adjust the code below accordingly.

在大多数情况下,每个tweet.attribute都将具有一定的价值并且易于访问。 但是,tweet.coordinates和tweet.place属性需要提取,因为它们是包含其他数据的数据字典。 另外,因为它们可以为null,所以首先检查推文是否首先包含信息很重要。 下面的函数检查属性是否有数据并提取相关信息。 此代码假定您遵循我的列命名约定,我们将在下一部分中看到。 如果您更改了“鸣叫坐标”或“位置信息”的命名方式,则需要相应地调整以下代码。

# Function created to extract coordinates from tweet if it has coordinate info
# Tweets tend to have null so important to run check
# Make sure to run this cell as it is used in a lot of different functions below
def extract_coordinates(row):
    if row['Tweet Coordinates']:
        return row['Tweet Coordinates']['coordinates']
    else:
        return None# Function created to extract place such as city, state or country from tweet if it has place info
# Tweets tend to have null so important to run check
# Make sure to run this cell as it is used in a lot of different functions below
def extract_place(row):
    if row['Place Info']:
        return row['Place Info'].full_name
    else:
        return None

Now that we can move forward and be able to extract useful information from the tweet.coordinates and tweet.place attributes. Let’s jump into actually scraping.

现在,我们可以继续前进,并能够从tweet.coordinates和tweet.place属性中提取有用的信息。 让我们进入实际抓取过程。

Example of a search query pulling all tweet related information from a user’s tweets:

搜索查询示例,它从用户的推文中提取所有与推文相关的信息:

username = 'random'
max_tweets = 150

tweets = tweepy.Cursor(api.user_timeline,id=username).items(max_tweets)

# Pulling information from tweets iterable object
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.user.screen_name, tweet.coordinates, tweet.place, tweet.retweet_count, tweet.favorite_count, tweet.lang, tweet.source, tweet.in_reply_to_status_id_str, tweet.in_reply_to_user_id_str, tweet.is_quote_status] for tweet in tweets]

# Creation of dataframe from tweets_list
# Add or remove columns as you remove tweet information
tweets_df = pd.DataFrame(tweets_list,columns=['Tweet Text', 'Tweet Datetime', 'Tweet Id', 'Twitter @ Name', 'Tweet Coordinates', 'Place Info', 'Retweets', 'Favorites', 'Language', 'Source', 'Replied Tweet Id', 'Replied Tweet User Id Str', 'Quote Status Bool'])

# Checks if there are coordinates attached to tweets, if so extracts them
tweets_df['Tweet Coordinates'] = tweets_df.apply(extract_coordinates,axis=1)

# Checks if there is place information available, if so extracts them
tweets_df['Place Info'] = tweets_df.apply(extract_place,axis=1)

The above query pulls 150 tweets from Twitter user @random and utilizes the extract_coordinates and extract_place functions to check if there is information and to extract the relevant information if there is. As shown above there is a lot of information available in the tweet objects. To modify the information you just need to add or remove whatever tweet.attribute you want to the list comprehension shown in tweets_list. In doing so it’s also important to adjust the column names when you’re creating the dataframe tweets_df .

上面的查询从Twitter用户@random提取了150条推文,并使用extract_coordinates和extract_place函数检查是否存在信息,并提取相关信息。 如上所示,tweet对象中有很多可用信息。 要修改信息,您只需要向tweets_list中显示的列表理解添加或删除任何您想要的tweet.attribute。 这样做时,在创建数据框tweets_df时调整列名也很重要。

(Getting User Information From Tweets)

User information is where Tweepy excels compared to GetOldTweets3. The picture provided below, while not an exhaustive list, covers a majority of information available in Tweepy’s user object. If you’re curious about what other data is available, the full list of information is available here on Twitter’s developer site.

与GetOldTweets3相比,Tweepy擅长于用户信息。 下面提供的图片虽然不是详尽列表,但涵盖了Tweepy用户对象中可用的大多数信息。 如果您想了解什么其他数据是可用的,信息的完整列表可在这里对Twitter的开发者网站。

List of user information available with Tweepy

Tweepy可用的用户信息列表

Example of a search query pulling all user-related information from tweets:

从推文中提取所有与用户相关的信息的搜索查询示例:

text_query = 'Coronavirus'
max_tweets = 150

# Creation of query method using parameters
tweets = tweepy.Cursor(api.search,q=text_query).items(max_tweets)

# Pulling information from tweets iterable object
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.user.name, tweet.user.screen_name, tweet.user.id_str, tweet.user.location, tweet.user.url, tweet.user.description, tweet.user.verified, tweet.user.followers_count, tweet.user.friends_count, tweet.user.favourites_count, tweet.user.statuses_count, tweet.user.listed_count, tweet.user.created_at, tweet.user.profile_image_url_https, tweet.user.default_profile, tweet.user.default_profile_image] for tweet in tweets]

# Creation of dataframe from tweets_list
# Did not include column names to simplify code 
tweets_df = pd.DataFrame(tweets_list)

The above query searches for 150 recent tweets that contain the word coronavirus. Similar to the other code snippet shown earlier. To modify the information that is available you add or remove any tweet.user.attribute to the list comprehension shown in tweets_list.

上面的查询搜索了包含词冠状病毒的150条最近的推文。 与之前显示的其他代码段相似。 要修改可用信息,您可以向tweets_list中显示的列表理解添加或删除任何tweet.user.attribute。

通过高级查询进行爬取 (Scraping With Advanced Queries)

Tweepy provides several different methods to refine your queries. The picture below shows a list of search parameters accessible through Tweepy. Again, while this is not the exhaustive list it covers a majority of it. If you want to see more information regarding the search API it is available here on Twitter’s developer site. As you might notice with geocode, Tweepy does not have a method that will take in string versions of cities and search around them. You’ll have to enter a specific pair of latitude, longitude coordinates to restrict querying by geographic location.

Tweepy提供了几种不同的方法来完善您的查询。 下图显示了可通过Tweepy访问的搜索参数列表。 再一次,虽然这不是详尽的清单,但它涵盖了大部分。 如果你想看到有关搜索API的更多信息它可在这里对Twitter的开发者网站。 您可能会在使用地址解析时注意到,Tweepy没有一种方法可以接收城市的字符串版本并在城市周围进行搜索。 您必须输入一对特定的纬度,经度坐标以限制地理位置查询。

List of search parameters available with Tweepy Tweepy可用的搜索参数列表

Example of a search query using advanced queries:

使用高级查询的搜索查询示例:

# Example may no longer show tweets if until_date falls outside 
# of 7-day period from when you run cell
coordinates = '19.402833,-99.141051,50mi'
language = 'es'
result_type = 'recent'
until_date = '2020-08-10'
max_tweets = 150

# Creation of query method using parameters
tweets = tweepy.Cursor(api.search, geocode=coordinates, lang=language, result_type = result_type, until = until_date, count = 100).items(max_tweets)

# List comprehension pulling chosen tweet information from tweets iterable object
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.favorite_count, tweet.user.screen_name, tweet.user.id_str, tweet.user.location, tweet.user.url, tweet.user.verified, tweet.user.followers_count, tweet.user.friends_count, tweet.user.statuses_count, tweet.user.default_profile_image, 
tweet.lang] for tweet in tweets]

# Creation of dataframe from tweets_list
# Did not include column names to simplify code 
tweets_df = pd.DataFrame(tweets_list)

The above query pulls 150 recent tweets in Mexico City in Spanish with the latest date being August 10th, 2020. This code snippet is a little different than the other two shown before. In order to refine search parameters, you’ll have to add the different parameters shown in the picture above to tweepy.Cursor(geocode = coordinates, lang=language, etc.) and pass it a variable or hardcode it. That way you can refine your search by either location, language, whatever you want to do.

上面的查询在墨西哥城以西班牙语发布了150条最近的推文,最新日期为2020年8月10日。此代码段与之前显示的其他两个略有不同。 为了优化搜索参数,您必须将上图所示的不同参数添加到tweepy.Cursor(geocode =坐标,lang =语言等)中,并将其传递给变量或对其进行硬编码。 这样,您可以根据位置,语言或您想做的任何事情来优化搜索。

(Putting It All Together)

Great, I’ve seen a lot of separate things, but why does this matter? Whether you want to scrape tweets from a specific user searching for keywords or search for tweets within a 50-mile radius of Las Vegas, NV (Lat 36.169786, Long -115.139858) that have the keyword Coronavirus. Your tweet scraping is only limited by your imagination and the attributes and methods available in Tweepy. Below I’ll show you how easy it is to pick and choose the methods and information you want by showing the query mentioned above.

太好了,我看到了很多不同的东西,但是为什么这很重要? 您是否要从特定的用户抓取推文中搜索关键词,还是要在内华达州拉斯维加斯(纬度36.169786,长-115.139858)的50英里半径范围内搜索包含关键词Coronavirus的推文。 Tweet抓取仅受您的想象力以及Tweepy中可用的属性和方法的限制。 下面,我将通过显示上述查询向您展示选择和选择所需方法和信息的简便程度。

Example of a query pulling tweet and user information with an advanced query:

使用高级查询提取推文和用户信息的查询示例:

text_query = 'Coronavirus'
coordinates = '36.169786,-115.139858,50mi'
max_tweets = 150

# Creation of query method using parameters
tweets = tweepy.Cursor(api.search, q = text_query, geocode = coordinates, count = 100).items(max_tweets)

# Pulling information from tweets iterable object
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.text, tweet.created_at, tweet.id_str, tweet.favorite_count, tweet.user.screen_name, tweet.user.id_str, tweet.user.location, tweet.user.followers_count, tweet.coordinates, tweet.place] for tweet in tweets]

# Creation of dataframe from tweets_list
# Did not include column names to simplify code
tweets_df = pd.DataFrame(tweets_list)

As you see above just modify the code in tweepy.Cursor(api.search,) with search parameters to further refine your search by location, top_tweets, etc. If you want to modify the information that you receive from tweets just add or remove a tweet.chosen_attribute in the tweets_list list comprehension I created below the query method.

如您在上方看到的,只需使用搜索参数修改tweepy.Cursor(api.search,)中的代码,以进一步根据位置,top_tweets等来优化搜索。如果要修改从tweet收到的信息,只需添加或删除一个我在查询方法下面创建的tweets_list列表理解中的tweet.chosen_attribute。

(Scraping More With GetOldTweets3)

GetOldTweets3 only requires a pip install. After importing the library you should be able to utilize it right away. If you want code to follow along with this section I have an article Jupyter Notebook here. If you want more code examples and easy to use functions I’ve created a companion Jupyter Notebook here.

GetOldTweets3只需要安装pip。 导入库后,您应该可以立即使用它。 如果您想让代码与本节一起进行,请在此处阅读Jupyter Notebook。 如果您想要更多代码示例和易于使用的功能,我在这里创建了一个配套的Jupyter Notebook。

(Getting More Information From Tweets)

The picture provided below is an exhaustive list of information available to you through GetOldTweets3's tweet object. As you can see there is a fair amount of information that a single tweet contains, and accessing it is not hard at all.

下面提供的图片是GetOldTweets3的tweet对象为您提供的详尽信息列表。 如您所见,一条推文包含大量信息,访问这些信息并不难。

List of tweet information available with GetOldTweets3 GetOldTweets3可用的推文信息列表

Example of a search query pulling all tweet related information from a user’s tweets:

搜索查询示例,它从用户的推文中提取所有与推文相关的信息:

username = 'jack'
count = 150

# Creation of tweetCriteria query object with methods to specify further
tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
.setMaxTweets(count)

# Creation of tweets iterable containing all queried tweet data
tweets = got.manager.TweetManager.getTweets(tweetCriteria)

# List comprehension pulling chosen tweet information from tweets
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites, tweet.replies, tweet.date, tweet.formatted_date, tweet.hashtags, tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]

# Creation of dataframe from tweets_list
# Did not include column names to simplify code 
tweets_df = pd.DataFrame(tweets_list)

The above query pulls 150 recent tweets from twitter user @jack. As shown above, to modify the information available you just add or remove whatever tweet.attribute you want to the list comprehension shown in tweets_list.

上面的查询从Twitter用户@jack提取了150条最近的推文。 如上所示,要修改可用信息,您只需向tweets_list中显示的列表理解添加或删除任何tweet.attribute。

(Getting User Information From Tweets)

GetOldTweets3 is limited in the user information available compared to Tweepy. This library's tweet object only contains a tweet author’s username and user_id. If you want more user information than is available I recommend either using Tweepy for all of your scraping or using Tweepy methods with GetOldTweets3 in order to utilize both libraries to their strengths. To do that I do showcase some workarounds in my “How to Use Tweepy with GetOldTweets3” section below.

与Tweepy相比,GetOldTweets3的可用用户信息有限。 该库的推文对象仅包含推文作者的用户名和user_id。 如果您想要获得更多的用户信息,我建议您对所有抓取使用Tweepy或对GetOldTweets3使用Tweepy方法,以充分利用这两个库的优势。 为此,我在下面的“如何在GetOldTweets3中使用Tweepy”部分中展示了一些解决方法。

(Scraping With Advanced Queries)

The picture provided below is an exhaustive list of methods available to refine your queries through GetOldTweets3.

下面提供的图片是可用于通过GetOldTweets3优化查询的方法的详尽列表。

List of search methods available with GetOldTweets3 GetOldTweets3可用的搜索方法列表

Example of a search query using advanced queries:

使用高级查询的搜索查询示例:

username = "BarackObama"
text_query = "Hello"
since_date = "2011-01-01"
until_date = "2016-12-20"
count = 150

# Creation of tweetCriteria query object with methods to specify further
tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
.setQuerySearch(text_query).setSince(since_date)\
.setUntil(until_date).setMaxTweets(count)

# Creation of tweets iterable containing all queried tweet data
tweets = got.manager.TweetManager.getTweets(tweetCriteria)

# List comprehension pulling chosen tweet information from tweets
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.text, tweet.retweets, tweet.favorites,tweet.replies,tweet.date] for tweet in tweets]

# Creation of dataframe from tweets list
# Did not include column names to simplify code 
tweets_df = pd.DataFrame(tweets_list)

The above query tries to pull 150 tweets from Barack Obama that say hello between the date range January 1st, 2011, and December 20th, 2016. As shown in the above code snippet, if you want to use any of the methods to create more specific queries all you need to do is add them at the end of TweetCriteria(). For example, I can further refine my search in the above code by adding .setNear(“Washington, D.C.”) at the end of .setMaxTweets() if I want to query tweets made around that area.

上面的查询试图从巴拉克·奥巴马(Barack Obama)提取150条在2011年1月1日至2016年12月20日期间的打招呼。如上面的代码片段所示,如果您想使用任何一种方法来创建更具体的信息,查询您需要做的就是在TweetCriteria()的末尾添加它们。 例如,如果我想查询围绕该区域制作的推文,可以通过在.setMaxTweets()末尾添加.setNear(“ Washington,DC”)来进一步优化上述代码中的搜索。

(Putting It All Together)

Great I can pull tweet information or pull tweets from 2016. Why does it matter? Whether you want to scrape all available information from a specific user searching for keywords or search for top tweets in Washington, D.C. that have the keyword Coronavirus between August 5th, 2020, and August 10th, 2020. Your tweet scraping is only limited by your imagination and the attributes and methods available in this package. Below I’ll show you how easy it is to pick and choose the methods and information you want by creating the query mentioned above.

太好了,我可以发布推文信息或从2016年起发布推文。为什么重要? 您是否要从特定用户中检索所有可用信息,以搜寻关键字或搜索华盛顿特区的热门推文,这些推文的关键词为2020年8月5日至2020年8月10日之间,冠状病毒。以及此包中可用的属性和方法。 下面,我将向您展示通过创建上述查询来选择所需的方法和信息有多么容易。

Example of a query pulling tweet and user information with an advanced query:

使用高级查询提取推文和用户信息的查询示例:

text_query = 'Coronavirus'
since_date = '2020-08-05'
until_date = '2020-08-10'
location = 'Washington, D.C.'
top_tweets = True
count = 150

# Creation of tweetCriteria query object with methods to specify further
tweetCriteria = got.manager.TweetCriteria()\
.setQuerySearch(text_query).setSince(since_date)\
.setUntil(until_date).setNear(location).setTopTweets(top_tweets)\
.setMaxTweets(count)

# Creation of tweets iterable containing all queried tweet data
tweets = got.manager.TweetManager.getTweets(tweetCriteria)

# List comprehension pulling chosen tweet information from tweets
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.to, tweet.text, tweet.retweets, tweet.favorites, tweet.replies, tweet.date, tweet.mentions, tweet.urls, tweet.permalink,] for tweet in tweets]

# Creation of dataframe from tweets list
# Add or remove columns as you remove tweet information
tweets_df = pd.DataFrame(tweets_list)

As this last example shows, refining your queries or modifying the tweet data available to you is rather simple. You refine your queries by modifying the tweetCriteria object by adding set methods onto the end and passing in values. You remove or add more information from your tweets by modifying the tweet attributes listed in tweets_list.

如最后一个示例所示,完善查询或修改可用的推文数据非常简单。 通过在末尾添加设置方法并传入值来修改tweetCriteria对象,可以优化查询。 通过修改tweets_list中列出的tweet属性,可以从tweet中删除或添加更多信息。

(How to Use Tweepy With GetOldTweets3)

What happens if you use GetOldTweets3 but want access to user information? Or you used Tweepy but need access to older tweets. Well, thankfully there’s a workaround where you can use methods that Tweepy has, to access more tweet or user information. Allowing you to use both libraries together.

如果您使用GetOldTweets3但想访问用户信息会怎样? 或者您使用了Tweepy,但需要访问较旧的推文。 幸运的是,有一个解决方法,您可以使用Tweepy拥有的方法来访问更多推文或用户信息。 允许您同时使用两个库。

It’s important to remember that Tweepy’s API always requires one to sign-up to receive credentials, if you need to do so my previous article covers that. Also, these methods are still subject to Tweepy’s request limitations, so I don’t recommend using this workaround on datasets greater than 20k tweets unless you don’t mind letting your computer run for a couple of hours. I’ve personally used these two libraries together with a dataset of 5k tweets and it took me around 1–2hrs to finish running everything. If you don’t have the time to wait or don’t mind paying, it’s worth looking into utilizing Twitter’s Premium/Enterprise APIs. This article covers searchtweets, a library in Python that allows access to the Premium/Enterprise APIs if you decide to go that route.

重要的是要记住,Tweepy的API始终需要一个API来注册才能接收凭据,如果需要的话, 我的上一篇文章对此进行了介绍。 同样,这些方法仍然受Tweepy的请求限制,因此,除非您不介意让计算机运行几个小时,否则我不建议在大于20k tweet的数据集上使用此替代方法。 我个人使用了这两个库以及5k条推文的数据集,花了我大约1-2小时才能完成所有操作。 如果您没有时间等待或不介意付款,那么值得考虑使用Twitter的Premium / Enterprise API。 本文介绍searchtweets,它是Python中的一个库,如果您决定采用这种方法,则可以访问Premium / Enterprise API。

If you want code to follow along with this section I have a Jupyter Notebook available here.

如果您希望本节遵循代码,我可以在这里找到Jupyter Notebook。

Credentials

证书

Before you can utilize Tweepy it requires authorization. I won’t go into detail about setting this up since I’ve covered it before in my previous article.

您必须先获得授权,然后才能使用Tweepy。 因为我在上一篇文章中已经讨论过,所以我将不对其进行详细设置。

# consumer_key = "XXXXX"
# consumer_secret = "XXXXX"
# access_token = "XXXXX"
# access_token_secret = "XXXXXX"auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

Preparation

制备

Tweepy’s API offers two methods that allow a GetOldTweets3 user to access more information.

Tweepy的API提供了两种方法,这些方法允许GetOldTweets3用户访问更多信息。

api.get_status() takes in a tweet id and returns all information associated with that tweet

api.get_status()接收推文ID并返回与此推文相关的所有信息

api.get_user() takes either a user id or username and returns all information associated with that user

api.get_user()接受用户ID或用户名,并返回与该用户相关的所有信息

So let’s make sure we have this data available and scrape for tweet.id, tweet.author_id, and tweet.username using GetOldTweets3.

因此,请确保我们有可用的数据,并使用GetOldTweets3抓取了tweet.id,tweet.author_id和tweet.username。

text_query = 'Hello'
since_date = "2020-7-20"
until_date = "2020-7-21"count = 150

# Creation of tweetCriteria query object with methods to specify further
tweetCriteria = got.manager.TweetCriteria()\
.setQuerySearch(text_query).setSince(since_date)\
.setUntil(until_date).setMaxTweets(count)

# Creation of tweets iterable containing all queried tweet data
tweets = got.manager.TweetManager.getTweets(tweetCriteria)

# List comprehension pulling chosen tweet information from tweets
# Add or remove tweet information you want in the below list comprehension
tweets_list = [[tweet.id, tweet.author_id, tweet.username, tweet.text, tweet.retweets, tweet.favorites, tweet.replies, tweet.date] for tweet in tweets]

# Creation of dataframe from tweets list
# Add or remove columns as you remove tweet information
tweets_df = pd.DataFrame(tweets_list, columns = ['Tweet Id', 'Tweet User Id', 'Tweet User', 'Text', 'Retweets', 'Favorites', 'Replies', 'Datetime'])

Functions

功能

Alright perfect, we’ve got our hands on some data that has tweet.id, tweet.author_id, and tweet.username. Let’s test out these methods from Tweepy to make sure they work.

好的,我们已经掌握了一些包含tweet.id,tweet.author_id和tweet.username的数据。 让我们从Tweepy中测试这些方法以确保它们有效。

api.get_status(1285363852851511301)api.get_user(811267164476841984)api.get_user('realJakeLogan')

As you’ll see with the above code there’s a lot of information that comes in from the requests. It’s also important to note that Tweepy’s tweet objects contain a user object already. If you only need user information from Tweepy you can use either method but if you want Tweepy’s tweet information you’ll have to use the get_status method. Let’s take this a bit further. The code above is only useful for searching one thing at a time, if you have a whole dataset you’ll need to create a function to extract the data.

正如您将在上面的代码中看到的那样,请求中包含很多信息。 同样重要的是要注意Tweepy的tweet对象已经包含一个用户对象。 如果仅需要Tweepy的用户信息,则可以使用这两种方法,但是如果要Tweepy的tweet信息,则必须使用get_status方法。 让我们更进一步。 上面的代码仅适用于一次搜索一件事,如果您有一个完整的数据集,则需要创建一个函数来提取数据。

def extract_tweepy_tweet_info(row):
    tweet = api.get_status(row['Tweet Id'])
    return tweet.source

tweets_df['Tweet Source'] = tweets_df.apply(extract_tweepy_tweet_info,axis=1)

This is great, however, what happens if you want to return more than a single attribute from the tweet object? Maybe you also want tweet.user.location and tweet.user.followers_count. Well, there are two ways to go about that. Create a function to store that data in a list then add all the data to the data frame. Or create a function that will create a series and return it, then use panda’s apply method to use the function on the data frame. I’ll showcase the former as it’s easier to grasp.

但是,如果您要从tweet对象返回多个单个属性,会发生什么呢? 也许您还想要tweet.user.location和tweet.user.followers_count。 好吧,有两种方法可以解决这个问题。 创建一个函数以将该数据存储在列表中,然后将所有数据添加到数据框中。 或创建一个将创建系列并返回它的函数,然后使用panda的apply方法在数据框中使用该函数。 我将展示前者,因为它更容易理解。

# Creation of list to store scrape tweet data
tweets_holding_list = []

def extract_tweepy_tweet_info_efficient(row):
    # Using Tweepy API to request for tweet data
    tweet = api.get_status(row['Tweet Id'])

    # Storing chosen tweet data in tweets_holding_list to be used later
    tweets_holding_list.append((tweet.source, tweet.user.statuses_count,      
    tweet.user.followers_count, tweet.user.verified))

# Applying the extract_tweepy_tweet_info_efficient function to store tweet data in the tweets_holding_list
tweets_df_test_efficient.apply(extract_tweepy_tweet_info_efficient, axis=1)

# Creating new columns to store the data that's currently being held in tweets_holding_list
tweets_df_test_efficient[['Tweet Source', 'User Tweet Count', 'Follower Count', 'User Verified Status']] = pd.DataFrame(tweets_holding_list)

Perfect, instead of having to request Tweepy’s API every time you have another column you can send a single request and modify your data frame one time. This is a lot more efficient and will end up saving you a lot of time if you find yourself needing multiple attributes.

完美,您不必在每次有另一列时都要求Tweepy的API,而是可以发送一个请求并一次修改数据框。 如果您发现自己需要多个属性,这将大大提高效率,最终将为您节省大量时间。

That is all for my advanced tutorial of both Tweepy and GetOldTweets3 and my workaround for using Tweepy with GetOldTweets3. Hopefully, most of your questions should be addressed by my previous article or this article. If you have specific questions and can’t find answers on Google, or want to reach out to me. I’m available on LinkedIn.

这就是我有关Tweepy和GetOldTweets3的高级教程,以及我将Tweepy与GetOldTweets3结合使用的解决方法。 希望您的大部分问题都可以在我以前的文章或这篇文章中得到解决。 如果您有特定问题,但无法在Google上找到答案,或者想与我联系。 我可以在LinkedIn上找到 。

I’m planning to work on scripts or a desktop app to simplify everything and allow non-coders to have access to these tools. If you’re interested in helping, reach out to me so we can collaborate.

我计划使用脚本或桌面应用程序来简化一切,并允许非编码人员访问这些工具。

翻译自: https://towardsdatascience.com/how-to-scrape-more-information-from-tweets-on-twitter-44fd540b8a1f

twitter 推送