Chapter 14 — Case Study 8 — Monte Carlo Percentile-Based Capability for Non-Normal Data

In GMP environments, capability indices (Cp, Cpk) are widely used but rarely questioned.
They assume a normal, symmetric distribution of data and rely entirely on the pair (mean, standard deviation). In a normal distribution, in fact, these two parameters are sufficient to describe the entire curve.

Despite this, many organisations continue to calculate Cp/Cpk as if the data were normal.

This Case Study shows how to replace σ-based indices with percentile-based capability metrics derived from:

This approach is fully aligned with USP <1210>, <1220>, ICH Q14, and the GxP focus on data realism.

🧩 1. Classical Cp/Cpk and why they fail

(a) Classical definitions

(b) When the assumptions break

Intuitive example (no formulas):
Consider a lognormal process with mean around 100 and a CV of 10%. The histogram may look tight and well-controlled, but because of skewness the sample standard deviation is inflated by the long right tail. When this inflated σ enters the Cpk formula, the index may become artificially high (e.g., Cpk > 2), even though the vast majority of values are concentrated close to the median.

🧠 2. Percentile-based capability: the intuitive idea

These two percentiles correspond to the same coverage as ±3σ in a normal distribution,
but here they are taken directly from:

(a) Upper percentile-based capability

(b) Lower percentile-based capability

🔍 3. Monte Carlo and bootstrap: how they enter

Two approaches are possible:

(A) Parametric Monte Carlo

Assume a model (e.g., lognormal)
→ estimate its parameters
→ simulate 100k values
→ compute percentiles from the simulated distribution.

(B) Non-parametric Bootstrap

Use the actual data
→ resample with replacement
→ compute percentiles for each bootstrap dataset
→ obtain confidence intervals for the percentile-based Cpk.

🧪 4. Practical Case – A lognormal process (R)

🔧 R setup and simulation

# ==========================================================
# Case study 8 - Lognormal process & capability indices
# ==========================================================

suppressPackageStartupMessages(library(dplyr))

set.seed(123)              # to make the results reproducible

# --- Process & specs ---------------------------------------

# Simulated lognormal process (skewed but realistic)
n      <- 2000             # number of simulated observations
mu_log <- log(100)         # median = 100 (lognormal position parameter)
sd_log <- 0.1              # log sd: here ≈ CV 10%
USL    <- 130              # specification limits
LSL    <- 70

# Simulation of a lognormal process (skewed but realistic)
x <- rlnorm(n, meanlog = mu_log, sdlog = sd_log)

# --- Classical Cp e Cpk (assuming normality) --------------
Cp_classic <- (USL - LSL) / (6 * sd(x))
Cpk_classic <- min((USL - mean(x)) / (3 * sd(x)),
                   (mean(x) - LSL) / (3 * sd(x)))

# --- Empirical percentiles corresponding to ±3σ ------------
q <- quantile(x, probs = c(0.00135, 0.5, 0.99865))
q_lower <- q[1]; q_med <- q[2]; q_upper <- q[3]

# --- Percentile-based capability ---------------------------
Cpu_p <- (USL - q_med) / (q_upper - q_med)
Cpl_p <- (q_med - LSL) / (q_med - q_lower)
Cpk_p <- min(Cpu_p, Cpl_p)

# --- Bootstrap for CI of Cpk_percentile --------------------
B <- 2000
Cpk_boot <- numeric(B)

for(i in 1:B){
  xb <- sample(x, replace = TRUE)
  qb <- quantile(xb, probs = c(0.00135, 0.5, 0.99865))
  Cpu_b <- (USL - qb[2]) / (qb[3] - qb[2])
  Cpl_b <- (qb[2] - LSL) / (qb[2] - qb[1])
  Cpk_boot[i] <- min(Cpu_b, Cpl_b)
}

Cpk_ci <- quantile(Cpk_boot, probs = c(0.025, 0.975))

# --- Summary results -----------------------------------------
list(
  Cp_classic = Cp_classic,
  Cpk_classic = Cpk_classic,
  Cpk_percentile = Cpk_p,
  Cpk_percentile_CI = Cpk_ci
)

📊 Visual comparison: classical vs percentile-based capability

library(ggplot2)

specs <- data.frame(
  limit = c(LSL, USL),
  type  = c("LSL", "USL")
)

# Histogram
p1 <- ggplot(data.frame(x), aes(x)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "white") +
  geom_vline(data = specs,
             aes(xintercept = limit, color = type),
             linetype = "dashed", size = 1.0) +
  scale_color_manual(values = c("LSL" = "red", "USL" = "darkred")) +
  labs(title = "Lognormal process: histogram",
       x = "Measurement", y = "Frequency",
       color = "Specification limits") +
  theme_minimal(base_size = 13)

# QQ-plot
p2 <- ggplot(data.frame(x), aes(sample = x)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "QQ-plot: highly non-normal",
       x = "Theoretical quantiles", y = "Sample quantiles") +
  theme_minimal(base_size = 13)

print(p1)
print(p2)

Histogram of lognormal process with specification limits

Figure 14.1 – Histogram of the simulated lognormal process. The distribution is clearly right-skewed. The dashed red vertical lines mark the lower (LSL = 70) and upper (USL = 130) specification limits. Although the bulk of the distribution appears well-centered, the asymmetric tail plays a dominant role in capability assessment.

Figure 14.2 – Normal QQ-plot of the simulated process. The central portion of the data lies close to the normal reference line, but both tails deviate markedly — a signature of strong right-skewness. This tail behaviour is what invalidates classical Cp/Cpk assumptions.

📈 Summary table — classical vs percentile-based capability

Cp ≈ 0.99: the overall spread (USL–LSL) relative to 6·σ suggests a marginally capable process.

Cpk ≈ 0.96: the classical index, based on the mean and σ, indicates borderline capability.

Percentile-based Cpk ≈ 0.79: once the real data tails are used instead of ±3σ, capability decreases notably.

95% CI (0.76–0.97): the true Cpk is uncertain and the classical Cpk lies near the upper bound of the plausible range.

These values illustrate how classical Cpk can be overly optimistic when data are skewed.

Metric	Value (this example)
Classical Cp	0.988
Classical Cpk	0.962
Percentile-based Cpk	0.794
Percentile-based Cpk (95% CI, bootstrap)	0.758 – 0.974

Chapter 14 — Case Study 8 — Monte Carlo Percentile-Based Capability for Non-Normal Data

🎯 Why this case matters

🧩 1. Classical Cp/Cpk and why they fail

(a) Classical definitions

(b) When the assumptions break

🧠 2. Percentile-based capability: the intuitive idea

(a) Upper percentile-based capability

(b) Lower percentile-based capability

(c) Two-sided percentile capability

🔍 3. Monte Carlo and bootstrap: how they enter

Two approaches are possible:

(A) Parametric Monte Carlo

(B) Non-parametric Bootstrap

🧪 4. Practical Case – A lognormal process (R)

🔧 R setup and simulation

📊 Visual comparison: classical vs percentile-based capability

📌 Why the QQ-plot indicates non-normality even if the center looks normal

📈 Summary table — classical vs percentile-based capability

🧭 5. GMP interpretation

✅ Takeaways