So, you want historic events data

As far as I am aware, there are no contemporary machine-coded events data if you do not want to use GDELT.*  Phil Schrodt and his colleagues are working on a GDELT replacement that promises to reduce event duplication and provide better geospatial resolution.  Once that project, Phoenix, goes live, it will create real-time data based on 542 news sources’ RSS feeds.  It will also have events from the Gigaword corpus, a collection of English-language news articles from the Agence France-Presse, Associated Press, Central News Agency of Taiwan, LA Times, Washington Post, New York Times, and Xinhua News Agency.  Unfortunately, Gigaword ends in December 2010.

But once Phoenix launches, if you want events data from January 2011 through the start of Phoenix, you will be out of luck.**  The Gigawords corpus will have stopped, Phoenix will not have started, and you will have to make your own data.  Since part of my research focuses on the Arab Spring, the main events of which occurred in 2011, this hole compelled me to create my own events data.  I ended up making 6 scripts to get the data I needed.***

The question I am interested in – How to individuals in authoritarian regimes coordinate protests? – led me to study 16 countries across 426 days.  While hand-coding is the gold-standard, my data contain 6,816 country days, too large an amount of time for one person to code.  I therefore built a series of Python scripts to download events from the BBC, Reuters, United Press International, and Xinhua News Agency.  (I tried Associated Press and Agence France-Press, but their interfaces are not amenable to structured searches.)  These scripts can be found here.

These scripts let me search any keywords within a user-defined date range.  (While I restrict my download to the end of 2010 through 2011, the script allows for any date range.)  Any articles the websites return are then downloaded into a JSON-formatted text file, one file per news agency.  Another script then cleans each file: each source produces many generic stories, such as reprinting press releases or sports scores, that certainly do not contain needed events, so they are discarded.  Each, now smaller, JSON file is read again by another script (based on the first script here), and 3 new files are created: one where each line is the first sentence of each article,  one where each sentence in each article gets its own line, and another that provides citation information for each article.  The first two files are my event files, and they can now be turned into events data.

To create the events data, you can then hand code the event files or use another program to automatically generate event data.  Which approach you choose will depend on how much data you have, what programs are available, and the question you are answering.  Figuring out which approach to take is the topic of a future post.


*  Not counting ICEWS because it is not publicly available.

**  The events data may start a few months from before the launch date of Phoenix, based on conversations with John Beieler.

***  For some unclear reason, does not provide a way to insert footnotes and does not allow plugins. does.  This site is hosted on the former.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.