My primary source of data is tweets I get from Twitter’s POST statuses/filter endpoint, what I believe was called the “Streaming Endpoint” when I started working with Twitter data eons ago. While it has always been straightforward to use a bounding box to get tweets with geographic information, exactly what Twitter reports and how it reports often changes or often is not the level of detail that people who have not worked with Twitter data think. For example, this older post of mine talks about country code, and this one discusses tweets’ timestamp. I have since started working with users’ self-reported location and use this post to discuss what I have learned.
There is a set of countries for which subnational variation is easily determined by Twitter and so the metadata provided by tweets does not require much researcher processing. While I have not analyzed this behavior enough to try to reverse engineer what Twitter is doing, several of my projects have found the subnational detail the service provides through the API satisfies. For example, my new paper, currently accepted and being processed into page proofs, analyzes the 2017 Women’s March. The article provides more detail about the moderate post-processing required to make the tweets locations match other datasets, and the locations were not good enough to provide intracity variation. Another paper, published with Jesse Driscoll, shows that the longitude and latitude coordinates provided with tweets from Ukraine can be used to measure oblast-level variation in online discourse, and this variation reveals an important dynamic that may influence policymakers. In another working paper that theories why repression has heterogenous effects and then tests those predictions with images shared in geolocated tweets, Jungseock Joo and I show that Twitter’s identification of cities provides substantial subnational variation in location for Spain, Venezuela, and South Korea. In a working paper about event data, I also show that Twitter provides sufficient subnational detail in Chile.
I suspect that most countries, however, require the researcher to process the user’s self-reported location. With Levi Boxell, a working paper analyzes the effect of taxing social media in Uganda. Though the main effects are at the country level, we also identify accounts likely to be from students by looking at whether their location or biography mentions words related to university. We show that students are least effected by the tax, though we are not able to look at which mechanisms (income, access to labs, greater grievances) drive that result. I am also starting work on another paper that looks at location in a populous authoritarian country, a competitive authoritarian country, and a European democracy. Twitter assigns city well in only the European democracy. The populous authoritarian country provides enough detail for the province level, though the names require some processing to account for place names in different languages. The competitive authoritarian requires using the user self-reported location, so I provide a list of city names and modify common spelling variants users give. My language limitation means I can only do this for profile locations in English. I believe that restriction still leaves me a majority of tweets, though someone who knows the language could easily do what I do in English.
I also just learned about the carmen package in Python. It estimates the location of a tweet using coordinates given by Twitter and the string an account provides in its location. I have not used it but have heard good things, and it is certainly more sophisticated than my coding.
Overall, there is a ton of interesting political science work you can do at the subnational level using Twitter. I am sure there are lots of papers I am missing, and this post is certainly too weighted to my work. Please share with me other papers that take advantage of subnational variation in tweets!