Skip to content
  • About
  • Blog
  • Code
  • CV
  • Presentations
  • Publications
  • Teaching

Written by ZacharySTSeptember 9, 2020September 9, 2020

Reading Tweets “Line by Line” in R

My Twitter pipeline collects the tweets with R but processes them with Python, but tor the event data class I just taught at APSA (material available here), I wanted to keep all of the code in R. I therefore needed to replace my Python code with R code that would open text files containing raw tweets, keep desired information, and write the result to .csv.

I knew that, when I set up my work in late 2013, I chose Python because it was easier to work with JSON data than R was. In the intervening 7 years, I forgot just how annoying this task is in R and assumed it had gotten easier. As I wrangled my code into working order, I was viscerally reminded of why I avoid R when working with JSON data.

Three culprits make reading a file of tweets in R needlessly difficult. First, tweet files need to be read one line at a time because many come in with formatting errors that cause functions like parse_stream or fromJSON to error out. Python makes reading one line at a time easy; it is the base way you are expected to interact with a file. In R, however, that is not the case, or at least I could not find a way to do that after much searching.

Instead, there is the readLines function. This function takes a file and converts each line to a string. This is useful, but it means you still have to load the entire file at once. Loading the entire file at once is far from ideal, as they are hundreds of megabytes to gigabytes in size. For many people, loading this much data will tax their computer’s RAM, which Python does not do since it keeps only one line in memory at once. In short, with readLines, one does not scan a file one line at a time; one loads a character version of the file and reads that vector one entry at a time, hence the sarcastic quotation marks in this post’s title.

The second culprit is the difficulty of R’s tryCatch process. I find the syntax of this function very confusing, and skipping tweets that create an error required a Boolean flag. I have not had to code this flag or check it in Python, and I can’t see why it is needed. I find error handling significantly easier in Python, and errors when processing tweets are unavoidable.

The third difficulty is ensuring that each tweet has the same fields. This task is non-trivial because tweets are semi-structured data, meaning the content of a tweet object varies depending on what is in the tweet. For example, a tweet with hashtags will have as part of its tweet object a field listing the hashtags and another field with their location in the tweet. A tweet without hashtags will not contain this information, not even as blank fields. In Python, pandas.io.json.json_normalize handles these differences easily by generating NA values for fields a tweet is missing. In R, however, I could not find an out of the box equivalent. I therefore selected which fields I wanted and made those fields a data frame. This procedure means that if I later decide I want different fields, I have to process the raw tweets all over again. I do not have to do that in pandas since pandas.io.json.json_normalize means I do not have to choose fields. I have never had a project where I knew on Day 1 which data I wanted, so having to revisit the start of a data processing pipeline adds complexity and time to a project.

I am glad I persevered and taught myself these steps, but I also hope to never forget why I did not teach myself these steps years ago. Use Python to process tweets.

The code is available on Github and pasted below. The data it loads are available at that repo as well. It runs too slowly to be of practical use: about 20 minutes to process tweets that only took 10 minutes to download. I profiled the code and the slowdown appears to be readJSON; please let me know if the code can be made faster and therefore usable. Speed is another Python advantage.

thetweets <- readLines('Data/teaching_tweets_mixture.json')
orig_n <- length(thetweets)  # To know later how many tweets I lose

# This function opens a tweet, keeps certain fields.
# tweet is JSON formatted string
parseTweet <- function(tweet){
  temp <- fromJSON(tweet)
  
  # There must be a better way of doing this.
  if(is.null(temp$user$location)){
    temp$user$location <- NA
  }
  
  # How to have data.frame handle fields with NULL values?  The loop above is my answer, but I am sure there is a more elegant solution.

  # Keep only certain fields
  df <- data.frame(lang=temp$lang, text=temp$text, created_at = temp$created_at, id = temp$id_str, source = temp$source, user.id = temp$user$id_str, user.sn = temp$user$screen_name, user.location=temp$user$location, user.created_at=temp$user$created_at)
  return(df)
}

tweet_mixture <- NA
i <- 0
for(line in 1:length(thetweets)){
  i <- i + 1
# This tryCatch structure is here: https://stackoverflow.com/questions/8093914/use-trycatch-skip-to-next-value-of-loop-upon-error
  skip_to_next <- FALSE
  tryCatch(
  {
    tweet_df <- parseTweet(tweet=thetweets[line])
  },
  error = function(e){
    #message(sprintf("Error: %s", e))  # No need to show the errors when compling, though they are useful for you to see when you are working on your own.
    skip_to_next <<- TRUE
  }
  )
  if(skip_to_next){  # Go to next line if there was an error
    next
  }
  if(!skip_to_next){
   tweet_mixture <- rbind(tweet_mixture, tweet_df)
  }
  if(i %% 1000 == 0){  # I like to know how far along I am.  It calms my worry.
    print(i)
  }
}

print(paste0(round(nrow(tweet_mixture)/orig_n*100,2), '% of tweets are kept.'))

write.csv(tweet_mixture, '../Data/teaching_tweets_mixture.csv')

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...
Posted in Thoughts and Things.Tagged big data, computational social science, data science, R, Twitter.

Leave a Reply Cancel reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

You are commenting using your WordPress.com account. ( Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. ( Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. ( Log Out /  Change )

Cancel

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

September 2020
M T W T F S S
 123456
78910111213
14151617181920
21222324252627
282930  
« Aug   Oct »

Archives

  • December 2022
  • November 2022
  • August 2022
  • July 2022
  • June 2022
  • January 2022
  • December 2021
  • August 2021
  • April 2021
  • January 2021
  • December 2020
  • October 2020
  • September 2020
  • August 2020
  • June 2020
  • May 2020
  • April 2020
  • December 2019
  • November 2019
  • October 2019
  • June 2019
  • May 2019
  • March 2019
  • December 2018
  • November 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • January 2018
  • December 2017
  • September 2017
  • June 2017
  • April 2017
  • January 2017
  • October 2016
  • April 2016
  • March 2016
  • November 2015
  • August 2015
  • July 2015
  • April 2015
  • March 2015
  • September 2014
  • March 2014
  • September 2013

Meta

  • Register
  • Log in
  • Entries feed
  • Comments feed
  • WordPress.com

Post navigation

Previous Post Fixing Page Numbers with Large Floats in Latex
Next Post Batch zip files
Blog at WordPress.com.
  • Follow Following
    • zacharyst.com
    • Already have a WordPress.com account? Log in now.
    • zacharyst.com
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Copy shortlink
    • Report this content
    • View post in Reader
    • Manage subscriptions
    • Collapse this bar
%d bloggers like this: