With some free time on my hands, I sat down to update my code that extracts tweets from my tweet collection based on user-supplied keywords or locations. In doing that, however, I ended up making a major improvement, one that should have existed from day one.
You see, simply trying to read a file of tweets line by line sometimes generates an error. Perhaps the download was interrupted, so the file does not have a proper end of file marker. (Why the computer isn’t smart enough to ignore that, I don’t know.) More frequently, part of the tweet, usually the text, contains “\r” or “\n”. Computers interpret those two sets of characters as a new line, not as two random sets of characters. If a tweet has those, later processing I do will fail. Since a script will fail if you do not tell it how to handle an error, I told it to skip any file which contained an error, regardless of the error. The script would run, I would have tweets, and the world seemed content.
That was a bad idea; it turned out I was missing many tweets that otherwise would have a hashtag or location match. So, I made two corrections. First, I moved my
try-except statements in one level, so they work per line of a file instead of per file. That way, any error would skip that line and not the file containing the line. Second, I modified my exception statements to be specific to the kind of exception, and I now write these errors to a file. Before, I used what’s called a “bare except”, meaning I did not differentiate between, say, a
JSONDecodeError. These two oversights meant, when my code would reach an error, I would not know where it was occurring and why it occurred. The improvements are noticeable. For a three month period in Gabon, I went from 38,000 to 150,000 tweets.
The moral of the story? Code right the first time!
Here are links I found useful during this process: