Crawling Followers with Intelligent Stopping

Like almost every other academic, I have started a Covid-19 project.  I think my team has a unique angle because of the kind of data I collect.  One dynamic we are interested in is patterns of following, and being able to analyze that across enough accounts required me to work with Twitter endpoints I have not touched since the waning days of my dissertation.  I am really proud of two changes I made to code that at that point was almost 5 years old.  Thanks are due to Pablo Barbera and Alexandra Siegel for answering my questions during this time.

The first change was to hit multiple endpoints per 15 minute window.  Perhaps the chief difficulty in working with Twitter data is rate limits, as most API endpoints only allow 15 queries per 15 minutes.  Even if each query returns multiple results, such as 200 tweets or hydrated followers, the amount of data that can be downloaded in a reasonable amount of time is usually not enough for a research project.  I also thought the limit was by IP address, not token, since only one connection to the streaming API for IP address was allowed when I first started using Twitter for research.  (Various developer forum threads and a Stack Overflow post suggests at least two connections per IP address are now allowed.)  Disabused of this limitation, I realized I can now download about 120,000 followers per hour versus 60,000.

This doubling is because the previous faster way I knew about was to download 75,000 user IDs (just a number, no profile information) in 15 minutes, then hydrate (get the account info) 30,000 of them (it could be 90,000 using user authentication, but I stick to app authentication), and repeating that for the second set of 30 minutes.  But now that I know I can download 75,000 from GET followers/ids and then 30,000 from GET users/lookup per 15 minutes, I can now get 120,000 followers per hour.

The second change is to stop downloading followers once a certain date has passed, saving time by preventing the download of unnecessary users.  In my work, the date I care about is the date Account B follows Account A.  My Political Analysis paper shows that this date can be estimated reliably for popular accounts.    After downloading the hydrated users, I calculate the latest possible date each user could have started following the account.  The download stops once the most recently downloaded follower is after that threshold.  This calculation works on each chunk of 5,000 users, so the stopping often includes a few days earlier than the threshold date.

Testing on a few accounts with 1.5-2.0 million followers, this logic stopped the download, for a date six months prior, after about 400,000 followers.  The fewer accounts a follower has, the greater percentage of followers are downloaded.  For an account with about 50,000 followers, 25,000 were downloaded until the stop date was reached.

Overall, between downloading 2x the number of users and stopping up to 1/3 earlier than normal, I can now download followers 6x faster than before.  Not bad!

I am happy to provide the Python code upon request.  I have not put it on Github because my cursor handling could be more elegant and I think the way I sleep upon complete a user unnecessarily lengthens the download time.  It will probably be awhile before I fix those issues — like after this project — and I do not like to put rough code online.  But the code works and I am happy to share as needed.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.