Understanding Subnational Variation in Tweets

My primary source of data is tweets I get from Twitter’s POST statuses/filter endpoint, what I believe was called the “Streaming Endpoint” when I started working with Twitter data eons ago.  While it has always been straightforward to use a bounding box to get tweets with geographic information, exactly what Twitter reports and how it […]

Crawling Followers with Intelligent Stopping

Like almost every other academic, I have started a Covid-19 project.  I think my team has a unique angle because of the kind of data I collect.  One dynamic we are interested in is patterns of following, and being able to analyze that across enough accounts required me to work with Twitter endpoints I have […]

group_by() %>% mutate() using pandas

While I have my issues with the tidyverse, one feature I am enamored with is the ability to assign values to observations in grouped data without aggregating the data.  This assigning is done by using the mutate() command instead of summarize().  I am in the middle of some data processing in a Python pipeline where I […]

Python, if any() else in list comprehension

This one took me about 20-30 minutes to figure out today and required stringing together some SO answers, so I’m putting what I learned here for future reference. The scenario: searching if 1 of multiple strings exists in a longer string.  In this case, some possible Twitter clients in the source field of a tweet […]

I prefer simplejson to json

I thought I was going to spend some time on Friday analyzing tweets from Cameroon.  Instead, starting that process led me down a rabbit hold that has, I hope, culminated in me realizing I should have used Python’s simplejson library this whole time. A script of mine used a try-except sequence to enclose the section […]

Header for GDELT 2.0 and Phoenix

Working with machine-coded events data is cool.  What’s not cool is that the raw data from two of the main datasets, GDELT 2.0 and Phoenix, do not include headers in their files.  It is simple to create a list with the column names, but the closest I could find that already existed for GDELT 2.0 […]

Assigning the Correct Time to a Tweet

When Twitter provides a tweet, the ‘created_at’ field provides a timestamp for when the tweet was authored.  This timestamp is useful, but it cannot be used right away because it is in Greenwich Mean Time.  Unless the tweet happens to have come from that timezone, its time needs to be adjusted to account for this discrepancy. […]

Parallelize a Multiargument Function in Python

How do you parallelize a function with multiple arguments in Python? It turns out that it is not much different than for a function with one argument, but I could not find any documentation of that online. An “embarrassingly parallel” computing task is one in which each calculation is independent of the ones that came […]