Solved: Recreate Minitab Normal Probability Plot in R
I'm taking a stats class where everything is done in Minitab, which is Teh Suck because Minitab only runs on Windows and I No Haz Windows. I've been successfully using R as a substitute, which has turned out to be a non-issue for the most part. However...
Minitab produces a very specific type of probability plot, the "Normal Probability Plot", which has no near-equivalent in R. Thus I embarked on a journey to try to recreate the damn thing. There are lots of threads all over the intertoobs that solve parts of the problem (which turns out to be fairly involved), but I couldn't find anything which puts all the pieces together into a single, ready-made solution. What follows is a pretty close recreation; I just had to move on to the rest of my life before I could get the legend right.
Presented in the hopes that it'll save other people the effort:
minitab_normal_prob_plot <- function(data, x_label) {
# The labels for the y-axis, corresponding to percentiles
y_axis_labels = c(1,5,10,20,30,40,50,60,70,80,90,95,99)
# Lengths, mean, and sd of data
n = length(data)
my_mean = mean(data)
my_sd = sd(data)
### Set up the y-axis values
# Translate labels to decimal percentages
percentages = y_axis_labels / 100
# Convert percentages to z-values and shift so that all values are >= 0
y_axis_points = qnorm(percentages)
y_shift = y_axis_points[1]
y_axis_points = y_axis_points - y_shift
# The minimum and maximum y values
y_min = y_axis_points[1]
y_max = y_axis_points[length(y_axis_points)]
### Calculate the main data set
# x and y values per http://en.wikipedia.org/wiki/Normal_probability_plot.
x_data_points = sort(data)
data_percents = c()
for(i in 1:n) {
if (i == 1) {
data_percents[i] = 1 - 0.5^(1/n)
} else if (i == n) {
data_percents[i] = 0.5^(1/n)
} else {
data_percents[i] = (i - 0.3175)/(n+0.365)
}
}
### Trend line calculation
# Project a line represented expected distribution values on assumption
# that data is normal.
trend_x0 = qnorm(percentages[1], mean = my_mean, sd = my_sd)
trend_x1 = qnorm(
percentages[length(percentages)], mean = my_mean, sd = my_sd
)
# Convert percents to z-values and shift as before
y_data_points = qnorm(data_percents) - y_shift
### Set up the envelope
# Stolen from
# http://stackoverflow.com/questions/3929611/recreate-minitab-normal-probability-plot
library(MASS)
qprobs<-qnorm(percentages)
fd<-fitdistr(data, "normal") #Maximum-likelihood Fitting of Univariate Dist from MASS
xp_hat<-fd$estimate[1]+qprobs*fd$estimate[2] #estimated perc. for the fitted normal
v_xp_hat<- fd$sd[1]^2+qprobs^2*fd$sd[2]^2+2*qprobs*fd$vcov[1,2] #var. of estimated perc
xpl<-xp_hat + qnorm(0.025)*sqrt(v_xp_hat) #lower bound
xpu<-xp_hat + qnorm(0.975)*sqrt(v_xp_hat) #upper bound
### Set up the x-axis
x_min = min(c(data, trend_x0, trend_x1, xpl, xpu))
x_max = max(c(data, trend_x0, trend_x1, xpl, xpu))
### Plot it all
# Data set. Points plotted twice due to keep them from getting clobbered by
# white rectangle.
par(bg = "beige")
plot(
x_data_points, y_data_points,
xlim = c(x_min, x_max), ylim = c(y_min, y_max),
axes = FALSE,
ylab = "Percent", xlab = x_label,
pch = 16, col = "red",
main = paste("Probability Plot of", x_label,"\nNormal - 95% CI")
)
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col = "white")
points(x_data_points, y_data_points, pch = 16, col = "red")
# Trend line
segments(trend_x0, y_min, trend_x1, y_max, col = "blue")
# Lower and upper bounds
lines(xpl, y_axis_points, col = "blue")
lines(xpu, y_axis_points, col = "blue")
# Y-axis gridlines
for (i in 1:length(y_axis_points)) {
abline(h = y_axis_points[i], col = "gray", lty = 2)
}
# Axes
axis(1)
axis(2, at = y_axis_points, labels = y_axis_labels)
# Box and x-grid
box()
grid(ny = NA, lty = 2)
# Legend
legend(
"topright",
c(
paste("Mean", my_mean, sep = " "),
paste("StDev", my_sd, sep = " ")
),
bg = "white"
)
}

0 Comments:
Post a Comment
<< Home