[UPDATE: THE BELOW CODE WAS DESIGNED FOR TWEETS THAT HAD BEEN ALREADY FORMATTED TO .CSV. I HAVE UPDATED THE CODE TO WORK WITH RAW TWEETS. THE GITHUB PAGE HAS LIKEWISE BEEN UPDATED.]
When looking at tweets, it is often important to know where the tweet was created. For tweets with GPS coordinates, Twitter is nice enough to provide metadata about those coordinates; one piece of metadata is the ISO 3166-1-alpha-2 country code, making it very easy to find tweets from any country. Unfortunately, Twitter appears to not have always assigned country codes to many countries, especially those outside of Europe, the Anglo-Saxon world, and Northeast Asia. This post explains that pattern and how to correct for it manually.
For a project, I am interested in tweets from Iran and Venezuela from early 2014. Below are figures that show the number of GPS-tagged tweets from those two countries during that time. Notice how none exist at first, then a few tweets start to arrive on March 14th, there is a large increase starting on the 26th, and by April 1st it appears that a normal amount of tweets come from each country.
What is going on? The connection to the streaming API returns tweets before March 14th, and other countries, such as Ukraine and France, have large numbers of labeled tweets before April 1st. A plot of tweets from Pakistan looks similar to Venezuela and Iran. This leads me to believe that Twitter was not assigning country codes to all countries before at least April 1st, 2014. Moreover, the lack of assignment was not at random. I inspected random hours of data from before April 1st, and the countries labelled are large and tend to be wealthy, e.g. the United States, France, Japan, or Turkey.
Fortunately, it is not difficult to determine manually the country where a tweet is authored. R’s maps package has a convenient function, map.where, which returns the country that contains a GPS coordinate.* (Many thanks to Pablo Barbera for pointing me to the package.) I therefore wrote an R script which opens each file of tweets, ignores those that already have country codes, reads the (longitude, latitude) coordinate of each tweet, and returns the matching country. The default map is not precise enough to resolve tweets that are on coasts of a country (many tweets from Bahrain, a small island, were labeled as N/A, for example), so it is important to use the worldHires map from the mapdata package.** Without manual labeling, 15%-20% of tweets lack country codes; without the worldHires map, 1%; with the worldHires map, only a few hundred tweets per hour lack a country code.
This code is available via this github repository. Because my data are stored as thousands of flat files, the code is designed to run in parallel on multiple cores. If you do not have multiple cores available, delete lines 48-50 and 52, then change line 51 to
lapply(files, function(x) processFile(file_path=x)).
*It can also return the United States state or county of a GPS coordinate.
**map.where returns the full country name, so I turn to the countrycode package to map between country name and the ISO-3166-1-alpha-2 code.