Written by ZacharySTApril 7, 2017March 5, 2025

Header for GDELT 2.0 and Phoenix

Working with machine-coded events data is cool. What’s not cool is that the raw data from two of the main datasets, GDELT 2.0 and Phoenix, do not include headers in their files. It is simple to create a list with the column names, but the closest I could find that already existed for GDELT 2.0 […]

Written by ZacharySTJanuary 24, 2017March 5, 2025

Zelig for Clustered Standard Errors

In regression modeling, it is common to correct standard errors for natural groupings (clusters) in the data. There are various ways to calculate these values using R, from doing it manually to using one of many packages. Theoretically, Zelig is an R package that will cluster standard errors automatically. In my experience, however, it does […]

Written by ZacharySTOctober 27, 2016March 5, 2025

Copy of Twitter REST API v1.1 Rate Limits

I’ve been writing some scripts to work with Twitter’s REST API. Naturally, I went to their developer documentation to refresh myself on their rate limits. As of today, the link they provide to their rate limit chart is broken. Fortunately, I clipped this page to Evernote a long time ago. I was therefore able to […]

Written by ZacharySTApril 6, 2016March 5, 2025

In R, use openxlsx instead of xlsx

I recently had to read an Excel spreadsheet into R. Why Excel? The original data were in a Google Sheet, and it appears that Google downloads everything to a .xlsx. (There HAS to be a way to download to .csv, but I did not feel like searching.) Opening the file – it was only 12 […]

Written by ZacharySTMarch 31, 2016March 5, 2025

Parallelize a Multiargument Function in Python

How do you parallelize a function with multiple arguments in Python? It turns out that it is not much different than for a function with one argument, but I could not find any documentation of that online. An “embarrassingly parallel” computing task is one in which each calculation is independent of the ones that came […]

Written by ZacharySTNovember 24, 2015March 5, 2025

A Simple Function for Forest Plots

A great way of conveying regression results is through a forest plot. Widely used in meta-analyses to compare results across models, they are also a convenient way to visualize regression results. Wanting to make one for a presentation, I naturally turned to R and its seemingly infinite packages. The package the internet recommends is forestplot. […]

Written by ZacharySTAugust 22, 2015March 5, 2025

Twitter Descriptive Statistics, Part 2, Or: Let’s Use Twitter to Study Antarctica

The chart below replicates the data presented in my earlier post about Twitter but ranked by accounts per million inhabitants. Antarctica, of course, does not really have the greatest number of accounts per capita. In this sample, however, Twitter identified 4 tweets from there, and Antarctica has a population of 0, as do the 2nd […]

Written by ZacharySTJuly 25, 2015March 5, 2025

Twitter Descriptive Statistics, Part 1

How many followers does the average Twitter user have? How many accounts does the average Twitter account follow? How many times has the average account tweeted? What about the median? These questions seem simple, but it is not easy to find answers to them. Twitter only discloses how many monthly active users exist, and other […]

Written by ZacharySTApril 12, 2015March 5, 2025

Formatting CAMEO Event Codes in ICEWS

UPDATE: Thanks to @icews for helping me figure this out. It turns out that the CAMEO Code field is saved as a string, but Pandas interprets that column as integers and drops the leading zero. To read that column correctly, use the following line: data = pd.read_csv(/Data/ICEWS/events.2010.20150313084533.tab’, sep = ‘\t’, dtype={‘CAMEO Code’: object}) —————————————————————————————————————————————————— Wanting […]