Sunday, March 03, 2013

Solved: Recreate Minitab Normal Probability Plot in R

I'm taking a stats class where everything is done in Minitab, which is Teh Suck because Minitab only runs on Windows and I No Haz Windows. I've been successfully using R as a substitute, which has turned out to be a non-issue for the most part. However...

Minitab produces a very specific type of probability plot, the "Normal Probability Plot", which has no near-equivalent in R. Thus I embarked on a journey to try to recreate the damn thing. There are lots of threads all over the intertoobs that solve parts of the problem (which turns out to be fairly involved), but I couldn't find anything which puts all the pieces together into a single, ready-made solution. What follows is a pretty close recreation; I just had to move on to the rest of my life before I could get the legend right.

Presented in the hopes that it'll save other people the effort:

minitab_normal_prob_plot <- function(data, x_label) {
   
    # The labels for the y-axis, corresponding to percentiles

    y_axis_labels = c(1,5,10,20,30,40,50,60,70,80,90,95,99)

    # Lengths, mean, and sd of data

    n = length(data)
    my_mean = mean(data)
    my_sd = sd(data)

    ### Set up the y-axis values

    # Translate labels to decimal percentages

    percentages = y_axis_labels / 100

    # Convert percentages to z-values and shift so that all values are >= 0

    y_axis_points = qnorm(percentages)
    y_shift = y_axis_points[1]
    y_axis_points = y_axis_points - y_shift

    # The minimum and maximum y values

    y_min = y_axis_points[1]
    y_max = y_axis_points[length(y_axis_points)]

    ### Calculate the main data set
    # x and y values per http://en.wikipedia.org/wiki/Normal_probability_plot.

    x_data_points = sort(data)

    data_percents = c()
    for(i in 1:n) {
        if (i == 1) {
            data_percents[i] = 1 - 0.5^(1/n)
        } else if (i == n) {
            data_percents[i] = 0.5^(1/n)
        } else {
            data_percents[i] = (i - 0.3175)/(n+0.365)
        }
    }

    ### Trend line calculation
    # Project a line represented expected distribution values on assumption
    # that data is normal.

    trend_x0 = qnorm(percentages[1], mean = my_mean, sd = my_sd)
    trend_x1 = qnorm(
            percentages[length(percentages)], mean = my_mean, sd = my_sd
    )

    # Convert percents to z-values and shift as before

    y_data_points = qnorm(data_percents) - y_shift

    ### Set up the envelope
    # Stolen from
    # http://stackoverflow.com/questions/3929611/recreate-minitab-normal-probability-plot

    library(MASS)
    qprobs<-qnorm(percentages)
    fd<-fitdistr(data, "normal") #Maximum-likelihood Fitting of Univariate Dist from MASS
    xp_hat<-fd$estimate[1]+qprobs*fd$estimate[2]  #estimated perc. for the fitted normal
    v_xp_hat<- fd$sd[1]^2+qprobs^2*fd$sd[2]^2+2*qprobs*fd$vcov[1,2] #var. of estimated perc
    xpl<-xp_hat + qnorm(0.025)*sqrt(v_xp_hat)  #lower bound
    xpu<-xp_hat + qnorm(0.975)*sqrt(v_xp_hat)  #upper bound

    ### Set up the x-axis

    x_min = min(c(data, trend_x0, trend_x1, xpl, xpu))
    x_max = max(c(data, trend_x0, trend_x1, xpl, xpu))

    ### Plot it all

    # Data set. Points plotted twice due to keep them from getting clobbered by
    # white rectangle.
    par(bg = "beige")
    plot(
        x_data_points, y_data_points,
        xlim = c(x_min, x_max), ylim = c(y_min, y_max),
        axes = FALSE,
        ylab = "Percent", xlab = x_label,
        pch = 16, col = "red",
        main = paste("Probability Plot of", x_label,"\nNormal - 95% CI")
    )
    rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col = "white")
    points(x_data_points, y_data_points, pch = 16, col = "red")

    # Trend line

    segments(trend_x0, y_min, trend_x1, y_max, col = "blue")

    # Lower and upper bounds

    lines(xpl, y_axis_points, col = "blue")
    lines(xpu, y_axis_points, col = "blue")

    # Y-axis gridlines

    for (i in 1:length(y_axis_points)) {
        abline(h = y_axis_points[i], col = "gray", lty = 2)
    }

    # Axes

    axis(1)
    axis(2, at = y_axis_points, labels = y_axis_labels)

    # Box and x-grid

    box()
    grid(ny = NA, lty = 2)

    # Legend

    legend(
        "topright",
        c(
            paste("Mean", my_mean, sep = " "),
            paste("StDev", my_sd, sep = " ")
        ),
        bg = "white"
    )
}
Blog Information Profile for gg00