Retrieve Data from Twitter API and Explore Visualizations in Python

Xinqian Zhai
6 min readApr 1, 2022

--

Photo by Joshua Hoehne on Unsplash

Note- This tutorial assumes that you have obtained your Twitter API key and Token credentials for authentication. If you haven’t already, you’ll need to apply for a developer account and get your account approved. Check out the Twitter Developer site to get started.

First Part: Retrieving data from Twitter API

1. Import necessary libraries

We will use tweepy python library to connect and retrieve data from Twitter API, then use pandas library to manipulate the retrieved data. To better understand the structure of Twitter data, the Twitter API documentation is a great resource to guide us.

import tweepy
import json
import pandas as pd
import twitter_credentials

2. Authenticate with credentials and connect to Twitter API

Then we make a function to connect to Twitter API using our credentials. To keep our authentication credentials safe, we can save the credentials in a python file and import it for later reference without revealing the key and tokens. You may notice that we have already saved and imported the Twitter credentials in the twitter_credentials file in the previous step.

def twitter_api_authentication():
""" authenticate with crediential and connect to Twitter API """

auth = tweepy.OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

return api
api = twitter_api_authentication()

3. Retrieve tweets data from API

Now, we can retrieve tweets data using the tweepy Cursor, which returns a Status object that can iterate over. here we will retrieve three timeline tweets of Elon Musk as an example.

# get user timeline tweets
def get_user_timeline_tweets(user_handle, num_tweets):
tweets = []
for tweet in tweepy.Cursor(api.user_timeline, screen_name = user_handle, tweet_mode = 'extended').items(num_tweets):
tweets.append(tweet)

return tweets
user_handle = 'elonmusk'
num_tweets = 3
tweets = get_user_timeline_tweets(user_handle, num_tweets)

4. Handle rate limits

Twitter API limits the number of requests a developer can make at a time interval (more info about rate limits check here). The limits are different for different API calls. If an endpoint exceeds the rate limits, normally, it will throw an ugly error message and we need to wait 15 minutes for the next try. To avoid the ugly error message and handle the rate limits, we can make a function to notice us when reached the rate limits.

def limit_handled(cursor):
"""
Handle twitter rate limits.
If the rate limit is reached, print the error message and exit the procedure.
"""
n=0
while True:
print(".", end="")
try:
yield cursor.next()
n += 1
except Exception as e:
if tweepy.TooManyRequests: #if TRUE this means we hit a rate limit error
print(f"Reached rate limits after {n} iterations. Please wait 15 minutes for the next try.")
print(f'Error message: {e}')
else:
print("Some unknown reason occured, HEEEEELP!")
break
return None

For example, if we retrieve Elon Musk followers' data and hit the rate limit, this is the output:

With the rate limit handled, we can repeatedly make requests every 15 minutes to retrieve more tweet data.

5. Save retrieved tweets data to a JSON file

In case we need to save the retrieved tweet data for later tweet analysis, we can save the data to a JSON file and load it when needed.

# save retrieved tweets data as json object
def save_tweets_data_to_json_file(tweets, json_file_path):
""" save retrieved tweets data to a json file
"""
# get all json files from Staus object of retrieved tweets
tweets_list = []
for tweet in tweets:
tweets_list.append(tweet._json)
# write retrieved tweets to json file
with open(json_file_path, 'w') as json_file:
json.dump(tweets_list,json_file)
# read json file
def read_json_file(json_file_path):
""" read tweets data from the json file
return tweets data in json format (dictionary)
"""
with open(json_file_path) as json_file:
tweets = json.load(json_file)

return tweets

6. Create a twitter dataframe

Finally, let’s create a Twitter data frame using the pandas library. Whether your tweet data are from raw tweets which is a Status object, or from an archived JSON file which is a Dictionary object, you can use the data to make a data frame.

def create_tweets_dataframe(tweets, key_list):
"""
Create a dataframe from a given sequence of tweets, and the columns are the key_list.
Each tweet can be a tweepy Status object or a json object (dictionary).
"""
df_array = []

# if tweets are the data from tweets json file (dictionary object)
if type(tweets[0]) is dict:
for tweet in tweets:
row=[]
for key in key_list:
row.append(tweet[key])
df_array.append(row)

# if tweets are the data directly retrieved from raw tweets (Status object)
else:
for tweet in tweets:
row=[]
for key in key_list:
row.append(tweet._json[key])
df_array.append(row)

t_df = pd.DataFrame(df_array, columns = key_list)
t_df['created_at'] = pd.to_datetime(t_df['created_at'])

return t_df

We retrieved the last 20 tweets from Elon Musk, including the time the tweet was created, the content of the tweet, the count of retweets and favorites, and the language of the tweet, and put the data into a data frame. Below is an example of the first 5 rows.

Second Part: Exploring retrieved Tweets data

In this part, we’ll use Oprah tweet data, which was previously retrieved and saved in a JSON file, to do the exploration to gain some insights. For the visualizations, we use Altair to plot.

First, let’s load the data and create a data frame to see what’s in the data.

def create_tweet_df():# load tweets data
path = 'assets/oprah_retrieved_tweets_data.json'
with open(path, 'r') as jf:
tweets = json.load(jf)

# create tweet dataframe with specified columns
key_list =['created_at','full_text','retweet_count', 'favorite_count','lang']
df = pd.DataFrame(tweets, columns= key_list)
df['created_at'] = pd.to_datetime(df['created_at'])
return df
  1. Heatmap of what time tweets are created

Here is a heatmap showing the time distribution of Oprah Winfrey’s most recent 5,000 tweets. The x-axis represents hours (0–24), and the y-axis represents weekdays (Monday to Sunday). As we can see from the heatmap, many of her tweets were posted around 2 am on Thursday (the darkest color block). Wednesday, Thursday, Sunday 1–3 am and Sunday 3 pm seem to be her most frequent tweeting times.

Heatmap of 5000 Oprah Winfrey’s most recent tweets

2. Bar chart of what date tweets are created

This is a bar chart showing the number of Oprah Winfrey’tweets grouped by weekday and date. To better show the graph, we only show her 100 most recent tweets. As we can see, she tweeted more than 25 tweets on Friday. Judging from the date, in 2022, she posted 7 tweets on a single day on February 4th and February 8th, and no more than 3 tweets on other days.

Date distribution bar chart of 100 Oprah Winfrey’s most recent tweets

3. Bar chart of where followers are located

Also, we retrieved data from her 300 followers and gathered the locations they shared on Twitter to see where the followers are located. As we can see, her followers are widely distributed, including in the United States, South Africa, Haiti, Nigeria, United Kingdom, Canada, France, Philippines, etc. One thing to note is that some geographic locations might be inaccurate or even invalid. For example, one of the locations we retrieved is labeled as “worldwide”.

Location distribution of 300 Oprah Winfrey followers

Summary

This is a basic tweet data exploration. Once you know how to retrieve tweet data from the Twitter API, you can further make various API requests to get other types of tweet data, such as media, trends, and geography. Using the tweet data, we can further perform sentiment analysis, topic modeling, social network analysis, etc.

My repository for the complete code can be found here. The files for this article are retrieve_data_from_Twitter_API.ipynb and twitter_data_analysis_1.ipynb.

--

--

Xinqian Zhai

Graduate student at the University of Michigan and a new learner on the road to data science.