When I was a PhD student, one method of calming my anxiety was to read advice from professors to PhD students; The Professor is In, Fabio Rojas, and Chris Blattman are particularly helpful. Now that I won the lottery and have started my seven year post-doc, I find myself in a position with even less structure than graduate school. With no teaching requirements this quarter, my mind has had plenty of time to worry itself. To my pleasant surprise, however, I have found plenty of online advice about this stage of my career. The purpose of this post is simply to provide links for my future self and others in the same career stage. The links are provided in reverse chronological order I found them.

Advice for New Assistant Professors by Eric Grollman – A compendium of links to other places.

Tips for success on your path to tenure by Rodney E. Rohde – Aimed at the sciences.

Advice for Your First Year on the Tenure Track by Karen Kelsky – Very many good, concise points. Her points about applying for external funding, building your profile in your discipline, and maintaining personal time are new, for this list to this point.

My Rules of Thumb by Greg Mankiw – A short essay about how he manages his time, not aimed specifically at tenure-track professors. Some points: surround yourself with good people; manage your time wisely; write well.

Advice for New Junior Faculty by Greg Mankiw – Get your dissertation out the door; good is better than perfect; be a good citizen for your department; network; rejection happens; and don’t blog (written in 2007).

Advice to New Assistant Professors by Chris Blattman – Learn to say no, use blogs and public social media professionally, and other nuggets.

Managing Your Research Pipeline by Matthew J. Lebo – A method for tracking your progress to tenure based on research productivity, from Day 1 to the tenure decision. Slow and steady wins the race.

]]>

In other words, I wrote a script that would tell me the best times to commute. I, and you if you click on that link, add my origin and destination, and computers do the rest. Every five minutes (via cron), the script pings Google. Google is nice enough to tell my script how long the commute will take by car and by public transit. I then plot those points, color them by commute option, and smooth a line over the points (though the line is overkill). I then edited one script to start from home and end at work and made another script to start at home and end from work. Below are the results.

**COMMUTE FROM HOME TO OFFICE**

**COMMUTE FROM OFFICE TO HOME**

Cool, there’s a lot of interesting stuff going on here! Going from home to work, there is a clear difference between weekdays and weekends. Though I have not changed the dots to show which are from the weekday, I would bet my last dollar that the first red hump is the weekday effect. I would bet again, though less, that weekend commute times are lower throughout. The worst time to leave my place is between 8:00 a.m. – 9:00 a.m., and the traffic starts around 5 p.m. If I want to leave home late, it doesn’t matter if I leave at 10 a.m. or 2 p.m., the time is basically the same.

Most surprising is that the commute time to the office does not return to its pre-rush hour stillness until late at night. I have alternated between traveling to work around 7 a.m. or 10 a.m., and it is clear that 7 a.m. is shorter. 10 a.m. also has higher variance, which makes sense – sometimes there is construction, sometimes there are wrecks, etc.

Going from work to home shows similar results. There is a clear weekend effect, but I am surprised at how wide the variance around the peak transit time is. In other words, it seems like there is no good time to go home unless you go late at night or before lunch.

In both directions, mass transit clearly has higher variance than private transit. I expect this result is because buses cannot change their route in response to traffic conditions. If I take my car, I can get on the 101 and skip all of the exits or get on the 405. Or I can take Ocean Avenue to Pico then a left on Wilshire to the Santa Monica Pier. Note, however, that mass transit from my house means the bus; I expect a subway would have much less variance, possibly even less than a car. It is also interesting to note that mass transit has higher variance late at night and early in the morning; I expect this reflects decreased service levels.

So, what have I learned? First, being a contrarian would have its advantages. I should commute on the weekends and take my days off during the week or get to the office around 5 a.m. so I can leave at Noon. Second, it is better to leave home to beat the rush hour instead of trying to time its back side. Third, there is no good time to come home. The commute home has high variance. While there is a clear increase in commute time starting at noon, I am indifferent to the commute time between 2:00 p.m. and 5:00 p.m.. Fourth, the commute to the office is less unpredictable than the commute home.

Most interestingly, *commute time at rush hour in mass transit is not much slower than in a car. *For my route, I think this is because there are express buses which have dedicated lanes (though not true bus rapid transit). It is thus not uncommon to pass cars during rush hour, and when the express lane ends the bus becomes just another vehicle in bumper-to-bumper traffic. Only away from rush hour peaks are cars extremely faster, and I bet most of that is because drivers can vary their route. While driving myself off rush hour on the same route my bus would take would be quicker because I can accelerate around buses, I do not think it would be more than a few minutes quicker. Even better, I can do work – write some code, read papers, prepare a lecture, and so on – on a bus. The only “work” I could do in the car is making phone calls, which is stupid because it’s dangerous and you won’t really remember the convo because you’re focusing on driving as well. Moreover, keep in mind that the gap between the two modes would be narrower, and perhaps in favor of mass transit, if a light rail or subway were available the whole time.

If I felt like working on this script more, I would change two things. First, I would modify the code to automatically query Google for return trips. For the charts I made, I made a second version of the script where I simply switched the origin and destination. Second, I would track the day of the week of the query and add that data to the dataframe. That way, I could modify the plot points (probably by changing their shape) to show four dimensions: time leave origin, estimated time of travel, day of week, and mode of transit.

]]>

Here is the header for GDELT 2.0:

`header = ['GlobalEventID', 'Day', 'MonthYear', 'Year', 'FractionDate', 'Actor1Code', 'Actor1Name', 'Actor1CountryCode', 'Actor1KnownGroupCode', 'Actor1EthnicCode', 'Actor1Religion1Code', 'Actor1Religion2Code', 'Actor1Type2Code', 'Actor1Type3Code', 'Actor2Code', 'Actor2Name', 'Actor2CountryCode', 'Actor2KnownGroupCode', 'Actor2EthnicCode', 'Actor2Religion1Code', 'Actor2Religion2Code', 'Actor2Type1Code', 'Actor2Type2Code', 'Actor2Type3Code', 'IsRootEvent', 'EventCode', 'EventBaseCode', 'EventRootCode', 'QuadClass', 'GoldsteinScale', 'NumMentions', 'NumSources', 'NumArticles', 'AvgTone', 'Actor1Geo_Type', 'Actor1Geo_Fullname', 'Actor1Geo_CountryCode', 'Actor1Geo_ADM1Code', 'Actor1Geo_ADM2Code', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor1Geo_FeatureID', 'Actor2Geo_Type', 'Actor2Geo_Fullname', 'Actor2Geo_CountryCode', 'Actor2Geo_ADM1Code', 'Actor2Geo_ADM2Code', 'Actor2Geo_Lat', 'Actor2Geo_Long', 'Actor2Geo_FeatureID', 'ActionGeo_Type', 'ActionGeo_Fullname', 'ActionGeo_CountryCode', 'ActionGeo_ADM1Code', 'ActionGeo_ADM2Code', 'ActionGeo_Lat', 'ActionGeo_Long', 'ActionGeo_FeatureID', 'Dateadded', 'Sourceurl']`

Here is the header for Phoenix:

`header = ['EventID', 'Date', 'Year', 'Month', 'Day', 'SourceActorFull', 'SourceActorEntity', 'SourceActorRole', 'SourceActorAttribute', 'TargetActorFull', 'TargetActorEntity', 'TargetActorRole', 'TargetActorAttribute', 'EventCode', 'EventRootCode', 'PentaClass', 'GoldsteinScore', 'Issues', 'Lat', 'Lon', 'LocationName', 'StateName', 'CountryCode', 'SentenceID', 'URLs', 'NewsSources']`

]]>

The common answer the internet gives you is to use the ‘user’:’utc_offset’ field that Twitter provides. There are two problems with that field, unfortunately. First, it is not provided for every user. As far as I can tell, it is only provided when the user has provided it to Twitter; in other words, it is not provided for all tweets, only those tweets of users who have provided that information. From the data I used for this post (1.3 million geotagged tweets from the US from 24 hours), 35.3% of tweets do not have a ‘user’:’utc_offset’ field. Relying on that field, you are therefore going to throw out very many tweets. Second, the offset field does not change based on the actual location of a tweet. For example, my time-zone is Pacific Standard Time, so all of my tweets will have a utc_offset value of -25200 because I am currently that many seconds (seven hours) behind UTC time. If I tweet from Baltimore or Chicago, an analyst wants to know the time I tweeted, and s/he uses the utc_offset, the answer will be wrong by two or three hours.

When possible, one therefore has to manually calculate the tweet creation time based on the location of the tweet. (This process, and therefore this post, only works for tweets with GPS coordinates.) While you could do this task in plenty of languages, I chose Python because it is what I am most familiar with. The steps are straightforward:

- Read each tweet from a file
- Get the longitude and latitude of the tweet
- Find the timezone for that long/lat pair (I used the pytz library)
- Find the difference between that timezone and the UTC timezone
- Add or subtract that difference from the ‘created_at’ field
- Save the tweet to a new file
- If you are reading from a database, you can overwrite the existing tweet with this updated one. I store my tweets in flat files and so cannot do that.

My code for this process can be found at this Github repository.

]]>

Theoretically, Zelig is an R package that will cluster standard errors automatically. In my experience, however, it does not actually do so. Instead, the results for a linear model and a linear model with clustered robust standard errors are the same. (I am in a rush and so do not have time to post regression results, but you can do it and see the same problem.) I have had the same experience with negative binomial models, and this person has had the same issue with logit models. In other words, Zelig seems to have a systematic bug which prevents it from (properly?) calculating robust clustered standard errors. Or, if there is no bug, the documentation’s description for calculating them is misleading.

]]>

Fortunately, I clipped this page to Evernote a long time ago. I was therefore able to save it to a .pdf and have uploaded it to my website. If you would like your own copy of the rate limits, download one from this link. I have also pasted the .pdf below, one page at a time.

]]>

The most common answer was to use the xlsx package (very convenient name). That package required me to install a Java plugin, which was super annoying. Even after that, the file would not load, apparently because it was too big. (12 megabytes is big data????) Starting to get really pissed off at Google and Microsoft, I tried my luck at at the openxlsx package. Turns out that package worked like a charm.

]]>

An “embarrassingly parallel” computing task is one in which each calculation is independent of the ones that came before it. For example, squaring each number in the list [1, 2, 3, 4, 5] is embarrassingly parallel because the square of 2 does not depend on 1, 3 on 2, and so on. Instead of each calculation waiting for the previous one to complete, multiple calculations could run simultaneously if they can use different different processors. Because a calculation is performed on each element of the list, for-loops (or similar structures) are usually used.

It is easy enough to find examples of how to parallelize, in R or Python, a for loop. The canonical Python example uses the joblib library:

>>> from math import sqrt >>> from joblib import Parallel, delayed >>> Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10)) # NB: If using Python 3.x, use list(range(10)). [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

This code uses 2 processors to take the square root of the square of the number [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].

That example assumes your code only has 1 argument. What happens if you have multiple arguments, e.g. if you have a nested loop? For example, what if you want to calculate the product of pairwise combination of [1, 2, 3, 4, 5] and [10, 11, 12, 13, 14, 15]? I could not find a simple explanation online for this case, which is surprising; the top Google results for “joblib parallel” are this, this, this, and this (the canonical code).

Passing multiple arguments is simple enough. That code is also simple enough, requiring just a slight modification of the canonical code.

>>> from joblib import Parallel, delayed >>> def multiple(a, b): return a*b >>> Parallel(n_jobs=2)(delayed(multiple)(a=i, b=j) for i in range(1, 6) for j in range(11, 16))

This code defines a function which will take two arguments and multiplies them together. The slightly confusing part is that the arguments to the multiple() function as passed outside of the call to that function, and keeping track of the loops can get confusing if there are many arguments to pass. Below is the function I ended up writing to generate sample network data, where the network is defined by 4 parameters.

>>> from joblib import Parallel, delayed >>> vertices = [100, 1000, 10000] >>> edge_probabilities = [.1, .2, .3, .4, .5, .6] >>> power_exponents = [2, 2.5, 3, 3.5, 4] >>> graph_types = ['Erdos_Renyi', 'Barabasi', 'Watts_Strogatz'] >>> Parallel(n_jobs=6)(delayed(makeGraph)(graph_type=graph, nodes=vertex, edge_probability=prob, power_exponent=exponent) for vertex in vertices for prob in edge_probabilities for exponent in power_exponents for graph in graph_types)

makeGraph is a function I created. It is too long to show here, but the idea should be clear. I tell the computer to use 6 processors to calculate the makeGraph function, and that function takes on different values from lists I have already defined.

]]>

For a project, I am interested in tweets from Iran and Venezuela from early 2014. Below are figures that show the number of GPS-tagged tweets from those two countries during that time. Notice how none exist at first, then a few tweets start to arrive on March 14th, there is a large increase starting on the 26th, and by April 1st it appears that a normal amount of tweets come from each country.

**Iran**

**Venezuela**

What is going on? The connection to the streaming API returns tweets before March 14th, and other countries, such as Ukraine and France, have large numbers of labeled tweets before April 1st. A plot of tweets from Pakistan looks similar to Venezuela and Iran. This leads me to believe that Twitter was not assigning country codes to all countries before at least April 1st, 2014. Moreover, the lack of assignment *was not at random*. I inspected random hours of data from before April 1st, and the countries labelled are large and tend to be wealthy, e.g. the United States, France, Japan, or Turkey.

Fortunately, it is not difficult to determine manually the country where a tweet is authored. R’s maps package has a convenient function, map.where, which returns the country that contains a GPS coordinate.* (Many thanks to Pablo Barbera for pointing me to the package.) I therefore wrote an R script which opens each file of tweets, ignores those that already have country codes, reads the (longitude, latitude) coordinate of each tweet, and returns the matching country. The default map is not precise enough to resolve tweets that are on coasts of a country (many tweets from Bahrain, a small island, were labeled as N/A, for example), so it is important to use the worldHires map from the mapdata package.** Without manual labeling, 15%-20% of tweets lack country codes; without the worldHires map, 1%; with the worldHires map, only a few hundred tweets per hour lack a country code.

This code is available via this github repository. Because my data are stored as thousands of flat files, the code is designed to run in parallel on multiple cores. If you do not have multiple cores available, delete lines 48-50 and 52, then change line 51 to `lapply(files, function(x) processFile(file_path=x))`

.

Enjoy!

—————–

*It can also return the United States state or county of a GPS coordinate.

**map.where returns the full country name, so I turn to the countrycode package to map between country name and the ISO-3166-1-alpha-2 code.

]]>

The package the internet recommends is forestplot. I actually found this package very frustrating to use for two reasons. First, it is designed to work with meta-analyses, so the user interested in plotting just one regression result has to modify a lot of function arguments. Second, it appears impossible to modify the size of axis labels. The documentation suggests this is possible, but hours of messing around got me nowhere. This difficulty could be because forestplot uses the lattice plotting framework, and I am firmly in the base R and ggplot world. Not worth learning a new approach for a small problem.

I then researched how to make a forest plot in ggplot. Forest plots in ggplot are doable, but I wasn’t pleased with the syntax required. Too much hacking for what should be really simple.

After having wasted many hours, I bit the bullet and wrote my own function to make a simple forestplot. (Many thanks to Alex Hughes for the initial code.) The code is very simple and so not worth releasing a R package. Indeed, the initial results for your model will probably require you to tweak the code. The main benefit of this function therefore is to save the time of setting up the initial code, and you should have a presentable .pdf file with just a couple of minutes of tweaking. The code is available at this Github repository and pasted below.

Some notes that may not be obvious from my comments. model should be the results of a regression. The plot has pretty wide margins, but they may not be wide enough depending on the length of your coefficient labels. Line width and point size are hard coded but very easy to change if you desire. The code is designed to produce .pdf output, and the outpath argument should end in ‘.pdf’.

''' The purpose of this script is to define an R function that makes pretty forest plots based on regression results. Function arguments: 1. model = model results 2. coefficients = character vector of coefficients to keep. Easiest to pass as names(model$coefficients) 3. coef_labels = labels for y-axis on plot 4. se = user can pass custom standard errors 5. zscore = how user defines width of confidence intervals 6. outfile = name, including filepath, where the output will be saved. Must end in .pdf 7. cex = magnification for text NB: This plotting works best when the coefficients have similar values. ''' forestPlot <- function(model=NULL, coefficients=NULL, coef_labels=NULL, se = NULL, zscore = 1.96, outpath, cex){ coef <- model$coefficients # Get coefficients from model if(is.null(coefficients) == TRUE){ # Generate coefficie coefficients <- names(model$coefficients) } coef <- coef[coefficients] # Keep user specific coefficients coef <- as.numeric(coef) if(is.null(se) == TRUE){ # If no standard errors given, take from model stdev <- summary(model)$coefficients[,2] # Get coefficient standard deviations stdev <- stdev[coefficients] stdev <- as.numeric(stdev) } if(is.null(se) == FALSE){ # If standard errors given, use those stdev <- sd stdev <- stdev[coefficients] stdev <- as.numeric(stdev) } lower <- coef - zscore*stdev # Lower value of confidence interval upper <- coef + zscore*stdev # Upper value of confidence interval vars <- length(coef) # Will be used to create y values if(is.null(coef_labels) == TRUE){ coef_labels <- coefficients } # Define x minimum to plot y-axis labels label_pos <- min(lower)*-1.05 #Plot all results pdf(outpath) par(mar=c(5,10,3,1)) # Need more space on left side of plot (6) for var. labels plot(x=coef, y=vars:1, xlim=c(min(lower)*1.1,max(upper)*1.5), pch = 20, xlab='Coefficient', bty='n', ylab='', yaxt='n', xaxt='n', cex.lab = cex) for(i in 1:vars){ lines(x=c(lower[i], upper[i]), y = rep(vars+1-i, each=2), lwd=2) } axis(1, cex.axis=cex) axis(2, at=vars:1, labels=coef_labels, las=1, lwd=0, pos=-label_pos, outer=TRUE, cex.axis=cex) # pos is x-axis location of labels abline(v=0, lty=2) dev.off() }

]]>