Initial Thoughts on Twitter’s Academic Research Product

Now that Twitter has released their Academic Research product, I thought I would jot down my initial thoughts. These observations are based on beta testing I was allowed to do in 2020 with the Search endpoint and reading documentation of the other endpoints.

I have used Twitter’s API for about 7 years now. Most of the time has been with v1.1 of their API using the filtered stream and various endpoints to download accounts’ old tweets plus follower and following relationships. The Academic Research product is associated with their rollout of v2 of their API. v2, befitting the version number, represents a major change to how Twitter delivers data. Overall, I like the changes, though it will require a major update to researchers’, mine included, data collection pipeline.

The major change, and one that is exclusive to academics, is access to the full tweet archive. This is a major major major improvement and one for which Twitter should be lauded. Previously, one had to pay lots of money to get tweets from the Search endpoint older than 7 days. A workaround was to download a user’s timeline, but Twitter only delivered the most recent 3,200 tweets; for many accounts, especially popular ones that are the focus of much research and attention, 3,200 does not go very far back. The ability to get every tweet ever — that is still public — since Jack Dorsey’s first on March 21, 2006 is AWESOME.

The catch is that Twitter has instituted a project wide cap on the number of tweets returned per month. During beta, it was 7.5 million, and it now appears to be 10 million. While this number is big, researchers need to carefully monitor the number of tweets they have downloaded, and this monitoring should occur in real time. For example, I wanted to download tweets of a few hundred popular accounts, and my stick appeared stuck on one account. It turns out, in fact, there was an account associated with KLM, the Dutch airline, that tweeted hundreds of thousands of times every month. (I forget the exact name and do not think it was an official KLM account; however, it was very popular in my sample.) Once I realized this, I built logic into my script so that I would stop downloading an account’s tweets after some threshold number was reached. Tracking which accounts hit this threshold, one can then manually assess at a later date whether or not to keep downloading. Another trick to not download too many tweets is to download from a narrow date range or, if starting from the most recent tweet, stop once a date is reached.

There are also two major changes to the streaming endpoints, filtered and sample, that affect my data collection pipeline. The first is that the 10 million cap does apply to tweets delivered via the filtered stream endpoint. In v1.1, there was no cap on the filtered or sample stream. The filtered stream is the primary way I collect data because it would deliver up to 5,000 accounts, all tweets containing one of hundreds (500? 400? I forget) keywords, or, even better, those falling inside of a GPS bounding box. I used these capabilities to stream in real time geolocated tweets and, as projects arose, tweets from specific accounts (media or political) or languages.

While a researcher can still do that, the 10 million cap means one now has to carefully monitor the volume of tweets returned. The rule of thumb is that 1% of tweets, which is the maximum the filtered stream will deliver, is equivalent to 5 million tweets. The filtered stream could therefore be cut off on the second day of a researcher’s monthly cycle. One therefore needs to carefully monitor the expected volume when testing new queries to the filtered stream and in real time, in case volumes surge. This engineering is a new burden.

Tweets delivered via the sample stream endpoint, the equivalent of the raw 1% from v1.1, do not count towards the 10 million cap. What is unclear to me right now is exactly how the sample and filtered streams differ. It is currently unclear to me because the documentation for the sample stream also details the same query parameters that are in the filtered stream documentation. I know how these parameters work from my experience with the Search endpoint during beta testing, so I assume they work the same way for the filtered stream. My guess is that for the sample stream, one passes to the query those parameters desired in the delivered tweet object. For example, if a tweet has a a country place.field, the Search API will return tweets that contain a given country code. I assume the filtered stream is the same but the sample stream will only return the country when it exists.

In addition, and we are still on what I consider the first major change, is how complex queries to the filtered endpoint can be. Now, the search and stream endpoints have the same query structure, and the search is more restricted. (It was more restrictive in v1.1 so the v2 change just aligns them; the discrepancy was another reason to prefer the streaming endpoints in v1.1.) In v1.1, which is what my research has been built on, one can receive all tweets from up to 5,000 accounts, all tweets containing one of hundreds (500? 400? I forget) keywords, or, even better, those falling inside of a GPS bounding box. Now, one can pass up to 1,000 rules of 1,024 characters each. What is not clear to me based on Twitter’s documentation for building these rules if 1 rule can contain multiple requests, e.g. 5 places or user IDs per rule. Allowing 1,024 characters suggests yes, but the complicated rules that limit seems designed for seem to be with booleans. This confusion should not matter for requesting tweets from countries, but they will if from smaller geographic units. All these questions can be answered with experimenting with the two streams, which is something I – and you! – should do.

Okay, now I can talk about the second major change introduced with v2 of the API. (These observations are based on my experience with the Search API, and I assume that how the data are delivered there are how they delivered via other endpoints.) Before, the basic unit I worked with was a tweet or user object, and each are delivered in their own JSON shell. (Other objects, like follower lists, are delivered as lists. I do not know yet if it is delivered differently in v2.) Now, a query returns what to me is an incomplete tweet object. For example, if I ask for tweets from the United Kingdom, I am given a dictionary of tweets where each tweet has a place ID. I then receive a separate dictionary where each entry is a place ID and data about the place. The researcher then has to merge the tweets with their respective place ID. Places are the example I am most familiar with, but I remember a similar structure for other parts of what used to be the tweet object. Perhaps a part of the user profile or the whole user profile?

That is all for now. I am sure there is a lot I am missing and will discover going forward. This change makes me feel old: my initial reaction is to decry it, but that is simply inertia speaking. I will have to change my data collection pipeline, but change is the only constant in this world. Overall, I have been very encouraged by the support for academic research Twitter has provided since I started using the service for research in 2013, and that feeling includes this update. Please review Twitter’s announcement for more detail. There will be more updates to the Academic Research product throughout 2021, so I look forward to continued net positive changes to the service.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.