My Twitter pipeline collects the tweets with R but processes them with Python, but tor the event data class I just taught at APSA (material available here), I wanted to keep all of the code in R. I therefore needed to replace my Python code with R code that would open text files containing raw tweets, keep desired information, and write the result to .csv.
I knew that, when I set up my work in late 2013, I chose Python because it was easier to work with JSON data than R was. In the intervening 7 years, I forgot just how annoying this task is in R and assumed it had gotten easier. As I wrangled my code into working order, I was viscerally reminded of why I avoid R when working with JSON data.
Three culprits make reading a file of tweets in R needlessly difficult. First, tweet files need to be read one line at a time because many come in with formatting errors that cause functions like parse_stream
or fromJSON
to error out. Python makes reading one line at a time easy; it is the base way you are expected to interact with a file. In R, however, that is not the case, or at least I could not find a way to do that after much searching.
Instead, there is the readLines
function. This function takes a file and converts each line to a string. This is useful, but it means you still have to load the entire file at once. Loading the entire file at once is far from ideal, as they are hundreds of megabytes to gigabytes in size. For many people, loading this much data will tax their computer’s RAM, which Python does not do since it keeps only one line in memory at once. In short, with readLines
, one does not scan a file one line at a time; one loads a character version of the file and reads that vector one entry at a time, hence the sarcastic quotation marks in this post’s title.
The second culprit is the difficulty of R’s tryCatch
process. I find the syntax of this function very confusing, and skipping tweets that create an error required a Boolean flag. I have not had to code this flag or check it in Python, and I can’t see why it is needed. I find error handling significantly easier in Python, and errors when processing tweets are unavoidable.
The third difficulty is ensuring that each tweet has the same fields. This task is non-trivial because tweets are semi-structured data, meaning the content of a tweet object varies depending on what is in the tweet. For example, a tweet with hashtags will have as part of its tweet object a field listing the hashtags and another field with their location in the tweet. A tweet without hashtags will not contain this information, not even as blank fields. In Python, pandas.io.json.json_normalize
handles these differences easily by generating NA values for fields a tweet is missing. In R, however, I could not find an out of the box equivalent. I therefore selected which fields I wanted and made those fields a data frame. This procedure means that if I later decide I want different fields, I have to process the raw tweets all over again. I do not have to do that in pandas since pandas.io.json.json_normalize
means I do not have to choose fields. I have never had a project where I knew on Day 1 which data I wanted, so having to revisit the start of a data processing pipeline adds complexity and time to a project.
I am glad I persevered and taught myself these steps, but I also hope to never forget why I did not teach myself these steps years ago. Use Python to process tweets.
The code is available on Github and pasted below. The data it loads are available at that repo as well. It runs too slowly to be of practical use: about 20 minutes to process tweets that only took 10 minutes to download. I profiled the code and the slowdown appears to be readJSON
; please let me know if the code can be made faster and therefore usable. Speed is another Python advantage.
thetweets <- readLines('Data/teaching_tweets_mixture.json')
orig_n <- length(thetweets) # To know later how many tweets I lose
# This function opens a tweet, keeps certain fields.
# tweet is JSON formatted string
parseTweet <- function(tweet){
temp <- fromJSON(tweet)
# There must be a better way of doing this.
if(is.null(temp$user$location)){
temp$user$location <- NA
}
# How to have data.frame handle fields with NULL values? The loop above is my answer, but I am sure there is a more elegant solution.
# Keep only certain fields
df <- data.frame(lang=temp$lang, text=temp$text, created_at = temp$created_at, id = temp$id_str, source = temp$source, user.id = temp$user$id_str, user.sn = temp$user$screen_name, user.location=temp$user$location, user.created_at=temp$user$created_at)
return(df)
}
tweet_mixture <- NA
i <- 0
for(line in 1:length(thetweets)){
i <- i + 1
# This tryCatch structure is here: https://stackoverflow.com/questions/8093914/use-trycatch-skip-to-next-value-of-loop-upon-error
skip_to_next <- FALSE
tryCatch(
{
tweet_df <- parseTweet(tweet=thetweets[line])
},
error = function(e){
#message(sprintf("Error: %s", e)) # No need to show the errors when compling, though they are useful for you to see when you are working on your own.
skip_to_next <<- TRUE
}
)
if(skip_to_next){ # Go to next line if there was an error
next
}
if(!skip_to_next){
tweet_mixture <- rbind(tweet_mixture, tweet_df)
}
if(i %% 1000 == 0){ # I like to know how far along I am. It calms my worry.
print(i)
}
}
print(paste0(round(nrow(tweet_mixture)/orig_n*100,2), '% of tweets are kept.'))
write.csv(tweet_mixture, '../Data/teaching_tweets_mixture.csv')