Assigning the Correct Time to a Tweet

When Twitter provides a tweet, the ‘created_at’ field provides a timestamp for when the tweet was authored.  This timestamp is useful, but it cannot be used right away because it is in Greenwich Mean Time.  Unless the tweet happens to have come from that timezone, its time needs to be adjusted to account for this discrepancy.  This post details how to assign the correct timestamp to a tweet that has GPS coordinates.

The common answer the internet gives you is to use the ‘user’:’utc_offset’ field that Twitter provides.  There are two problems with that field, unfortunately.  First, it is not provided for every user.  As far as I can tell, it is only provided when the user has provided it to Twitter; in other words, it is not provided for all tweets, only those tweets of users who have provided that information.  From the data I used for this post (1.3 million geotagged tweets from the US from 24 hours), 35.3% of tweets do not have a ‘user’:’utc_offset’ field.  Relying on that field, you are therefore going to throw out very many tweets.  Second, the offset field does not change based on the actual location of a tweet.  For example, my time-zone is Pacific Standard Time, so all of my tweets will have a utc_offset value of -25200 because I am currently that many seconds (seven hours) behind UTC time.  If I tweet from Baltimore or Chicago, an analyst wants to know the time I tweeted, and s/he uses the utc_offset, the answer will be wrong by two or three hours.

When possible, one therefore has to manually calculate the tweet creation time based on the location of the tweet.  (This process, and therefore this post, only works for tweets with GPS coordinates.)  While you could do this task in plenty of languages, I chose Python because it is what I am most familiar with.  The steps are straightforward:

  1. Read each tweet from a file
  2. Get the longitude and latitude of the tweet
  3. Find the timezone for that long/lat pair (I used the pytz library)
  4. Find the difference between that timezone and the UTC timezone
  5. Add or subtract that difference from the ‘created_at’ field
  6. Save the tweet to a new file
    1. If you are reading from a database, you can overwrite the existing tweet with this updated one.  I store my tweets in flat files and so cannot do that.

My code for this process can be found at this Github repository.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: