Over the past year and a half, I've been building tools to collect, analyze, and visualize large quantities of tweets. These tools have helped me (and my colleagues at Data for Democracy) monitor trends and uncover disinformation campaigns in the French presidential election, the 2017 Virginia election, the #unitetheright rally in Charlottesville, and the #MeToo movement, among others. Over the past few months, I've been building a pre-packaged dashboard kit that will help me spin up something quickly, so I can get an at-a-glance view of trends surrounding a certain hashtag, topic, or movement right away, as I start to analyze these trends, often in the moment.
While I'm sure there will be more updates, that package — tweetmineR — is complete! I've licensed it open-source and hosted it on GitHub, so anyone who wants can download it and use it to generate their own Twitter dashboard. While you need to have certain coding packages installed, you don't actually need to be a coder to make it work. Just edit a couple lines with your search terms, enter a couple commands at the terminal/shell/command-line to run it, and voilà, ... instant dashboard!
Here's how it works.
Setting up the environment
My tweetmineR code requires a few open-source packages to make it work. And if I'm honest, installing these packages is the hardest part of using tweetmineR. Once all the prerequisites are up and running, it's pretty smooth sailing. So if you're new to these things, stick with it! You only have to do the hard stuff once. :)
First, tweet collection and pre-processing takes place via Python. Make sure you've got Python 3 installed. While you can install it from the Python website, if you're planning on developing your data analysis skills with Python, I recommend installing the Anaconda Distribution, which includes a number of other data analysis and visualization tools for Python, as well as several development environments.
You'll also need the Python module Tweepy to connect to the Twitter API. The easiest way to install it is using PIP (included with Anaconda):
pip install tweepy
The last thing you'll need to start collecting tweets is a Twitter developer account. To get setup, follow Steps 1–3 of Zach Whalen's How to Make a Twitter Bot instructions. Note that instead of using his Google Sheet and entering your credentials there, you'll be putting those credentials in your Python code.
To do the statistical analysis, text mining, and create the dashboard, tweetmineR uses R. You'll need to install the R language, and I highly recommend RStudio, as both a development environment and an easy way to launch the dashboard locally. (It also interacts easily with Shiny Apps, should you want to publish your dashboard to the web for free.)
You'll also need to following packages installed in R:
If these aren't already in your R environment, use:
etc. to install these packages in your system. (They will all be well worth it!)
The hard part is now over! All the prerequisites are installed!
Installing tweetmineR and collecting tweets
Now we're ready to download the tweetmineR package from GitHub. Just download the zip file and unpack it on your computer — you can put it anywhere.
The first time you run it, you need to add your Twitter developer credentials in order to connect to Twitter and download data. Take the access tokens created when setting up your developer account and put them between the quotes in the twitter_authentication.py file.
Now you need to set the parameters for your tweet collection. There are two options: searching and streaming. Searching goes backwards in time, and can collect a large batch of tweets all at once, but is very limited, only going back about 10 days. Also, note that Twitter limits the amount of tweets you can collect at once, so the script will pause every 15 minutes to wait until it is again allowed to pull tweets. But don't worry, it can go on for days if it needs to!
Streaming means listening to Twitter for new tweets and collecting them as they happen. It should go on indefinitely, but if the system goes down (or gets put to sleep), or if other things are going on simultaneously that overload the processor or the memory, it could force the script to stop. (If you notice it right away, you can use search to go back and collect the tweets that your stream missed. There will likely be some overlap, and you'll have to remove the duplicate tweets.)
To search for tweets, use the twitter_search.py script. Edit the following lines in that script first (near the top of the file):
# Enter each search term inside quotes, between 'OR'. # The whole search query should be inside a single string, which is the single item in a list. # (I know this is weird.) search_query = ['"Twitter" OR "@twitter" OR "#ilovehashtags"'] filename = 'data/sources/tweets.csv' maxTweets = 10000 # Some arbitrary large number
search_query to contain the terms you are searching for. You can add as many terms, handles, and/or hashtags as you like, and you can use boolean operators like
AND, as well as advanced search features like
from:username. Just make sure that each individual query is inside quotes, and that the whole string of queries (connected with
OR) is inside quotes. (Individual queries should be inside one kind of quotes — I used double quotes — and the whole string inside a different kind of quotes — I used single quotes.) Update the file name as appropriate, but don't change the folder path (unless you plan on making changes to the other scripts in tweetmineR). Finally, update
maxTweets to the maximum number of tweets you want to collect.
To stream tweets, use the file twitter_stream.py and edit the comparable lines at the top of the file with your query parameters:
search_query = ['Twitter', '@twitter', '#ilovehashtags'] filename = 'data/sources/stream-' + str(datetime.datetime.now()).replace(' ', '_').split('.') + '.csv'
Note two things here. First, a search query for Twitter's streaming API takes a list of queries, not a string containing boolean operators (
OR). Each individual search term/handle/hashtag should be inside quotes, and in a comma-separated list inside square brackets (a standard Python list). Second, I have Python generating the filename automatically with a datestamp. This is because I've had enough streaming queries fail on me, and I want to avoid accidentally overwriting a previous collection of tweets when I spin it back up. This will ensure all files have unique names (and the dashboard code will collect data from every CSV file inside the
sources folder). I recommend keeping this feature, but you may wany to change
stream to something topical, if you are collecting lots of different data streams in the same place.
When you've set your parameters, you're ready to collect tweets! Go to your command line, and fire it up! For searching:
If you close your terminal window (or press CTRL-C), the script will stop. If you want to keep it running with the window closed (like if you have it running on a server and just want to get it started from your computer), use
nohup python /path/to/twitter_stream.py
on Mac or Linux. This will export any messages to a file called
nohup.out, and will keep the process running until it stops on its own, or until you kill it. To kill it, use
ps -ef | grep python
to find the process ID number for the script. Then use
kill -9 process_ID
(with the appropriate process ID number) to kill the process.
Building the dashboard
Once you have your tweets collected in the
sources folder, there are three simple steps to viewing your dashboard.
First, open the file mine_tweets.R and update the line
source_folder <- 'data/'
with the path to your data folder (something like
/Users/username/tweetmineR/data/ — don't forget the trailing slash!).
mine_tweets.R. You can do this by opening it in RStudio, selecting all the code in the file, and clicking "Run", or in the command line:
This will create a lot of files in your
data folder containing various summary statistics from your Twitter dataset, many of which are helpful on their own (most retweeted tweets, most prolific users, most common bigrams, trigrams, etc.). They are also the data sources for the dashboard.
Finally, open either
ui.R in RStudio and click "Run App". This will load the dashboard in a new RStudio window. (You can also deploy your dashboard to Shiny Apps to display it publicly for free.)
This dashboard will give you a bird's-eye-view of the trends in your Twitter data, and direct your attention to accounts, tweets, terms, and dates/times that are worth a closer look. By itself it can be helpful, but it's the connection of "distant reading" and "close reading" that leads to the greatest insights.
Perhaps more importantly for me, it allows that analytical environment to be setup rapidly. Once the prerequisites are installed, and you've got the hang of constructing a Twitter query (and running the files in the right order!), the workflow of setting up a new dashboard is almost trivial:
- Make a fresh clone of tweetmineR.
- Copy in your Twitter developer account credentials.
- Craft your search query.
- Run the search/stream script.
Once it's running (or finished running), you can generate the dashboard in two steps:
ui.R, and click "Run App".
You can even do this mid-stream (or mid-search), without waiting for the archive creation to finish.
I used to construct each of the graphs/tables on this dashboard one at a time. Even using copied-and-pasted code, it took a lot of time, and I would risk losing track of things along the way. This dashboard has saved me a lot of time and energy analyzing Twitter campaigns. And, honestly, it has allowed me to perform analyses that I simply wouldn't have done otherwise (a few of which made me some money).
I hope it helps you, too!
P.S. I am greatly indebted to Ben Starling for his help working out the kinks of my Twitter streaming script, as well as to developers like Julia Silge, David Robinson, and others who developed the tools like TidyText and Tweepy that this dashboard is built on. Thanks for all your hard work! Now it's time for me to pay it forward.