I recently spent a lot of time reviewing my Twitter data collection infrastructure in order to start some more collections. In that process, I discovered some tokens and streams I forgot about. The purpose of this post is to document what data I am collecting as of 04.29.2020 so that I have an easy reference later.
- Global geostream – This connection is my main collection process. I provide a bounding box for the world and collect tweets. Started August 2013.
- Turkish keywords – This connection collects tweets containing a random sample of Turkish keywords, political relevant keywords, or prominent Turkish politicians. Started June 2018.
- Media streaming – This connection follows approximately 4,700 media accounts. 4,500 were identified using DocNow’s news outlet dataset, and the rest are from an iterative process of downloading Twitter media lists to identify accounts. Started December 2019.
- Random sample – It is the streaming endpoint with no parameters. A colleague started this stream around March 2015.
- US geotagged – It is the streaming endpoint with a United States bounding box. A colleague started this stream around March 2015.