Solved: Recreate Minitab Normal Probability Plot in R
I'm taking a stats class where everything is done in Minitab, which is Teh Suck because Minitab only runs on Windows and I No Haz Windows. I've been successfully using R as a substitute, which has turned out to be a non-issue for the most part. However...
Minitab produces a very specific type of probability plot, the "Normal Probability Plot", which has no near-equivalent in R. Thus I embarked on a journey to try to recreate the damn thing. There are lots of threads all over the intertoobs that solve parts of the problem (which turns out to be fairly involved), but I couldn't find anything which puts all the pieces together into a single, ready-made solution. What follows is a pretty close recreation; I just had to move on to the rest of my life before I could get the legend right.
Presented in the hopes that it'll save other people the effort:
minitab_normal_prob_plot <- function(data, x_label) { # The labels for the y-axis, corresponding to percentiles y_axis_labels = c(1,5,10,20,30,40,50,60,70,80,90,95,99) # Lengths, mean, and sd of data n = length(data) my_mean = mean(data) my_sd = sd(data) ### Set up the y-axis values # Translate labels to decimal percentages percentages = y_axis_labels / 100 # Convert percentages to z-values and shift so that all values are >= 0 y_axis_points = qnorm(percentages) y_shift = y_axis_points[1] y_axis_points = y_axis_points - y_shift # The minimum and maximum y values y_min = y_axis_points[1] y_max = y_axis_points[length(y_axis_points)] ### Calculate the main data set # x and y values per http://en.wikipedia.org/wiki/Normal_probability_plot. x_data_points = sort(data) data_percents = c() for(i in 1:n) { if (i == 1) { data_percents[i] = 1 - 0.5^(1/n) } else if (i == n) { data_percents[i] = 0.5^(1/n) } else { data_percents[i] = (i - 0.3175)/(n+0.365) } } ### Trend line calculation # Project a line represented expected distribution values on assumption # that data is normal. trend_x0 = qnorm(percentages[1], mean = my_mean, sd = my_sd) trend_x1 = qnorm( percentages[length(percentages)], mean = my_mean, sd = my_sd ) # Convert percents to z-values and shift as before y_data_points = qnorm(data_percents) - y_shift ### Set up the envelope # Stolen from # http://stackoverflow.com/questions/3929611/recreate-minitab-normal-probability-plot library(MASS) qprobs<-qnorm(percentages) fd<-fitdistr(data, "normal") #Maximum-likelihood Fitting of Univariate Dist from MASS xp_hat<-fd$estimate[1]+qprobs*fd$estimate[2] #estimated perc. for the fitted normal v_xp_hat<- fd$sd[1]^2+qprobs^2*fd$sd[2]^2+2*qprobs*fd$vcov[1,2] #var. of estimated perc xpl<-xp_hat + qnorm(0.025)*sqrt(v_xp_hat) #lower bound xpu<-xp_hat + qnorm(0.975)*sqrt(v_xp_hat) #upper bound ### Set up the x-axis x_min = min(c(data, trend_x0, trend_x1, xpl, xpu)) x_max = max(c(data, trend_x0, trend_x1, xpl, xpu)) ### Plot it all # Data set. Points plotted twice due to keep them from getting clobbered by # white rectangle. par(bg = "beige") plot( x_data_points, y_data_points, xlim = c(x_min, x_max), ylim = c(y_min, y_max), axes = FALSE, ylab = "Percent", xlab = x_label, pch = 16, col = "red", main = paste("Probability Plot of", x_label,"\nNormal - 95% CI") ) rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col = "white") points(x_data_points, y_data_points, pch = 16, col = "red") # Trend line segments(trend_x0, y_min, trend_x1, y_max, col = "blue") # Lower and upper bounds lines(xpl, y_axis_points, col = "blue") lines(xpu, y_axis_points, col = "blue") # Y-axis gridlines for (i in 1:length(y_axis_points)) { abline(h = y_axis_points[i], col = "gray", lty = 2) } # Axes axis(1) axis(2, at = y_axis_points, labels = y_axis_labels) # Box and x-grid box() grid(ny = NA, lty = 2) # Legend legend( "topright", c( paste("Mean", my_mean, sep = " "), paste("StDev", my_sd, sep = " ") ), bg = "white" ) }