This post is a sequel to my wildly popular – ok, not wildly popular, but very useful for me – post documenting my function to create a forest plot. Recently, I made models using cross-validation via R’s caret package. The `train()`

function does the heavy lifting, but the problem is that its output is a of the `train`

class, which I presume was created for `caret`

. This is a problem because my function assumes it is receiving a generic model, so it fails.

The new function, pasted at the end of this post, makes two modifications. The main modification is to take the coefficients from `model$finalModel$coefficients`

. You of course could pass those directly to the `model`

argument in `forestPlot`

, but that would require more typing, and I am lazy. The second update, which is not because of `train`

, it to add `na.rm=TRUE`

to `min()`

and `max()`

. Those arguments should have been in the original function, oh well. The third change is simply to change the x-axis label, which is hard coded. That should probably be an argument to pass to the function; oh well.

In response to demand from Alex Hughes, I have included an example of the output. Initial results suggest that violence, small groups, children, and seeing more individuals protest correlates with larger protests in the future, while the presence of police and large groups correlates with smaller protests.

''' This formula has been updated to handle results of train class, from caret package. The purpose of this script is to define an R function that makes pretty forest plots based on regression results. Function arguments: 1. model = model results from caret 2. coefficients = character vector of coefficients to keep. Easiest to pass as names(model$coefficients) 3. coef_labels = labels for y-axis on plot 4. se = user can pass custom standard errors 5. zscore = how user defines width of confidence intervals 6. outfile = name, including filepath, where the output will be saved. Must end in .pdf 7. cex = magnification for text NB: This plotting works best when the coefficients have similar values, i.e. consider standardizing your variables. ''' forestPlot <- function(model=NULL, coefficients=NULL, coef_labels=NULL, se = NULL, zscore = 1.96, outpath, cex){ coef <- model$finalModel$coefficients # Get coefficients from model if(is.null(coefficients) == TRUE){ # Generate coefficient names coefficients <- names(model$finalModel$coefficients) } coef <- coef[coefficients] # Keep user specific coefficients coef <- as.numeric(coef) if(is.null(se) == TRUE){ # If no standard errors given, take from model stdev <- summary(model)$coefficients[,2] # Get coefficient standard deviations stdev <- stdev[coefficients] stdev <- as.numeric(stdev) } if(is.null(se) == FALSE){ # If standard errors given, use those stdev <- sd stdev <- stdev[coefficients] stdev <- as.numeric(stdev) } lower <- coef - zscore*stdev # Lower value of confidence interval upper <- coef + zscore*stdev # Upper value of confidence interval vars <- length(coef) # Will be used to create y values if(is.null(coef_labels) == TRUE){ coef_labels <- coefficients } # Define x minimum to plot y-axis labels label_pos <- min(lower, na.rm=TRUE)*-1.05 #Plot all results pdf(outpath) par(mar=c(7,10,3,1)) # Need more space on left side of plot (6) for var. labels plot(x=coef, y=vars:1, xlim=c(min(lower, na.rm=TRUE)*1.1,max(upper, na.rm=TRUE)*1.5), pch = 20, xlab='Standardized Coefficient', bty='n', ylab='', yaxt='n', xaxt='n', cex.lab = cex) for(i in 1:vars){ lines(x=c(lower[i], upper[i]), y = rep(vars+1-i, each=2), lwd=2) } axis(1, cex.axis=cex) axis(2, at=vars:1, labels=coef_labels, las=1, lwd=0, pos=-label_pos, outer=TRUE, cex.axis=cex) # pos is x-axis location of labels abline(v=0, lty=2) dev.off() }