I recently had to learn gsutil, a Python package used to interact with Google Cloud Platform. Overall, I guess it was easy enough to learn, though there was one pain point. When I was in the GCP bucket containing the directories I wanted to download, I tried to download the directories directly using the graphical interface. The graphical interface can only download one file at a time, which would not work for me because there are hundreds. Helpfully, GCP provides a message with the gsutil code to download the files. Unfortunately, this code is not quite correct: it only provides the command, options, and path to whatever is trying to be downloaded. However, pasting that command into one’s command line interface, Terminal for me, will cause an error. This error is because the command GCP provides does not include any syntax to select the files in directories; it only provides paths to those directories. Once I figured out to add the wildcards and words I needed, downloads went smoothly.
It makes sense that GCP would not provide a complete code snippet because it does not know what one wants to download. However, their documentation should say that! Otherwise, it is misleadingly presented as a complete solution.
I had to mess with gsutil so that I could download data Twitter releases about state-backed information operations. These data are very rich and underexplored to date. However, there are a couple of quirks worth keeping in mind.
- It is clear that there is often a delay between a takedown of a campaign’s accounts and the announcement. For example, an announcement was made in February 2021, but there is GCP bucket for that date. There is one for December 2020 without a corresponding announcement, so those two probably match. Same for June 2020 (announcement) and May 2020 (bucket name on GCP). It’s not really a big deal and easily resolvable with a little slowing down, but it is annoying and worth noting in case I run into this issue again.
- Twitter switches between labeling directories as “Russia” and “IRA”, the Internet Research Agency, a branch of the Russian government responsible for foreign influence operations. I have to imagine in both cases they are the same group, or at least Twitter is not trying to convey meaning with the different words. Their press releases do not indicate as much.
- Twitter used to release README files per announcement and per takedown within each announcement. At some point, they stopped the latter. Their last two announcements have had no README files. I assume the assumption is the content is the same as previous takedowns, but Twitter should still use README files. What if a researcher is new to the data and does not realize there are earlier README files?