I recently wrote a script that reads thousands of files of tweets, transforms them, and spits out “only” hundreds of files. Having tested the script on my computer on a few files, I was surprised to find the execution taking much longer than anticipated on my server, especially since the server’s CPUs are more powerful than my laptop’s. Since it would take at least 40 days for my script to run through all the files, I knew I had to improve my code. Time to learn how to profile.
Profiling refers to timing each line of a script so you can see where execution bottlenecks are. It lets you modify your code intelligently instead of misdirecting your effort. While you could use Python’s time module and manually insert timers, two other approaches, cProfile and Robert Kern’s line_profiler, exist that are easier and more informative. I found Huy Nguyen’s guide to all three approaches to be particularly useful, though Marco Bonzanini’s sample code for line_profiler was more accurate.
cProfile writes binary output to a file you provide, and that output then needs to be read using Python’s pstats module. kernprof (the bash command to run Kern’s line_profiler), prints its statistics to Terminal. Both are useful, though I find kernprof’s output to be easier to read. For cProfile’s output, I only looked at the 100 most time consuming steps, whereas kernprof annotates each line of code. It is possible that cProfile’s output provides much more information than kernprof, as it certainly provides more information than I cared to analyze for my script.
Profiling showed me that my two bottlenecks were reading each file and then modifying the timestamp of each entry in each file. Since I could not avoid reading each file line by line, I focused on increasing the speed of my timestamp conversion. It turns out that Python’s datetime module is slow, at least for datetime.datetime.strptime(). But since my timestamps were all formatted the same way, I could convert each part of the string to an integer and make a datetime object from those integers, which is much faster than making one from strings. I found Vita Smid’s tutorial particularly useful, and it matched others’ answers on Stack Overflow.
Three other changes were important. First, I stopped using a tweet’s ‘created_at’ field and used the ‘timestamp_ms’. Since the timestamp can easily be made an integer, it can be converted to a Timestamp object very quickly. Second, my code would reach file line by line, convert the tweets to a dataframe, and then read the dataframe row by row. I moved as much dataframe processing as possible to when I first read each line of the file, and I vectorized as much of the rest of the code that read row by row as I could. The third change had to do with executing the coding. I use Python’s multiprocessing module to parallelize my code, and I had used 8 cores. For some reason I still do not understand, had not experienced when using fewer cores, and could not find a convincing answer for on the internet, the code executed incredibly slowly. Running the new code on only two cores therefore proved to be the biggest contributor to executing my code faster.
To summarize, profiling my code showed that making a datetime object from a string was the biggest bottleneck. Four changes to how I processed the timestamp in a tweet halved the execution time, based on testing on my laptop. Executing the code with two cores on my server provides the biggest boost, however. Whereas before I could process approximately 200 files per day, I now process just under 7,000 What would have taken longer than a month will now take just more than two days. Color me happy going into the weekend!