Chapter 13 — Case Study 7 — Microbiological Counts: Quantifying Overdispersion via Monte Carlo Simulation

 

Before diving into model selection, here is a quick overview of the chapter structure:

🗺️ Chapter roadmap

1️⃣ Conceptual background — when and why overdispersion arises
2️⃣ Monte Carlo perspective — from single-sample to campaign-level risk
3️⃣ Practical example (R) — simulation and visualization
4️⃣ GMP interpretation and takeaways

 

Microbiological count data (CFU per plate or per sample) are discrete and typically show low means, often near zero.
For such data, two statistical models are commonly used: the Poisson and the Negative Binomial (NB or NegBin).
Although closely related, they describe fundamentally different sources of variability
and the correct choice has direct implications for interpreting GMP environmental and water monitoring results.


 

🎯 Why this matters

In routine GMP monitoring:

The distinction determines whether observed “outliers” (e.g., a few plates with unusually high CFU)
represent true process excursions or just normal statistical variability.

💡 In GMP microbiology, this is not a purely theoretical issue.
The model choice directly affects how “unexpected” results are classified —
whether they are treated as Out of Trend (OOT) or as expected variability due to environmental heterogeneity.

 


 

🧩 1. Conceptual background — why overdispersion matters

(a) Poisson model — homogeneous process

\[X \sim \mathrm{Poisson}(\lambda),\quad \mathbb{E}[X]=\mathrm{Var}[X]=\lambda\]

All sampling points share the same contamination intensity ($\lambda$).
Variance equals the mean ($\mathrm{Var}(X)=\mathbb{E}[X]=\lambda$): randomness alone explains observed variation. This model fits clean, well-controlled systems (e.g., WFI, cleanrooms).

 


(b) Negative Binomial model — heterogeneous process

\[X \sim \mathrm{NegBin}(\mu, k),\quad \mathbb{E}[X]=\mu,\quad \mathrm{Var}[X]=\mu+\frac{\mu^2}{k}\]

Parameterization note. We use the mean–dispersion parameterization ($\mu, k$).
Some software reports Negative Binomial as ($\mathrm{size}=k$, $\mathrm{prob}=p$).
In our convention: $\mathbb{E}[X]=\mu$, $\mathrm{Var}(X)=\mu+\mu^2/k$,
and the Poisson–Gamma mixture uses $\lambda \sim \mathrm{Gamma}(\mathrm{shape}=k,\ \mathrm{scale}=\mu/k)$.

Here (k) is the dispersion parameter:

The Negative Binomial can be seen as a Poisson–Gamma mixture:

\[X \mid \lambda \sim \mathrm{Poisson}(\lambda),\quad \lambda \sim \mathrm{Gamma}(\mathrm{shape}=k,\ \mathrm{scale}=\mu/k)\]

This means that the rate of contamination (λ) itself varies among samples —
reflecting real heterogeneity in space or time.

 


(c) Relationship between the two

\[\lim_{k \to \infty} \mathrm{NegBin}(\mu,k) = \mathrm{Poisson}(\mu)\]

The Poisson model is thus a special case of the Negative Binomial, valid only when the environment is uniform.

🧠 Interpretation:

 

Quick contrast — operational view | Aspect | Poisson | Negative Binomial | |:–|:–|:–| | Variability | Random (homogeneous) | Structured (heterogeneous) | | Variance–Mean | $Var=\mu$ | $Var=\mu+\mu^2/k$ | | Tail behavior | Light | Heavy (higher extreme counts) | | GMP implication | True OOL likely anomaly | OOL may reflect expected heterogeneity |

 


(d) Practical tool — Index of Dispersion (ID)

A simple way to assess whether data are consistent with the Poisson model is to compute the Index of Dispersion:

\[ID = \frac{s^2}{\bar{x}}\]

where $s^2$ is the sample variance and $\bar{x}$ the sample mean.

Interpretation Condition Typical model
ID ≈ 1 Variance ≈ Mean Poisson (homogeneous process)
ID > 1 Variance > Mean Overdispersion → Negative Binomial
ID < 1 Variance < Mean Underdispersion → Binomial or constrained process

ℹ️ Underdispersion is rare in practice.
It typically emerges from constraints or inhibitory mechanisms. In GMP Microbiology it is far less frequent than overdispersion.

🧮 In practice:


 

🔍 2. Monte Carlo perspective

Monte Carlo simulation is not used here to “fit” the model, but to explore the practical consequences of overdispersion.
Even when Poisson and Negative Binomial models share the same mean (e.g. μ = 2 CFU),
their predictions differ dramatically in the tails.

For example:

Model Mean (μ) Variance % of counts > 10 CFU
Poisson (λ = 2) 2.0 2.0 0.02 %
NegBin (μ = 2, k = 1) 2.0 6.0 1.4 %

Note. Percentages in the table refer to single-sample tail probabilities (e.g., one plate, one PW sample, or one swab). ``

Terminology — “campaign” (in this chapter)
Here, campaign means a sampling session made of multiple individual samples (e.g., 10 plates, PW samples, or swabs collected under similar conditions). This is not the manufacturing “production campaign”.

A Monte Carlo simulation over thousands of hypothetical sampling campaigns allows one to estimate:

 

🔬 Monte Carlo thus provides a bridge between statistical modeling and operational risk:
it quantifies how different assumptions on variability translate into different expected rates of apparent “excursions”.

 

🧩 Having clarified the conceptual meaning of overdispersion, we now turn to a Monte Carlo experiment that visualizes its operational consequences in realistic GMP scenarios.

 


 

🧪 3. Practical Example — Monte Carlo Simulation of Overdispersed Counts

🎯 Objective

Estimate, via Monte Carlo, how often a monitoring campaign of 10 samples would yield at least one result above 10 CFU, assuming either a Poisson(λ = 2) or a NegBin(μ = 2, k = 1) model.

💻 R Code

# ================================================================
# Case Study 7b — Monte Carlo simulation of microbiological counts
# ================================================================

suppressPackageStartupMessages(library(dplyr))
set.seed(123)

# --- Parameters -------------------------------------------------
n_samples   <- 10      # samples per campaign
n_campaigns <- 10000   # Monte Carlo simulations
limit       <- 10      # CFU threshold
lambda      <- 2       # Poisson mean
mu          <- 2       # NegBin mean
k           <- 1       # NegBin dispersion (overdispersion)

# --- Helper function: simulate campaigns ------------------------
simulate_campaigns <- function(model = c("poisson", "negbin")){
  model <- match.arg(model)
  
  counts <- switch(model,
                   poisson = matrix(rpois(n_samples * n_campaigns, lambda),
                                    ncol = n_samples, byrow = TRUE),
                   negbin  = matrix(rnbinom(n_samples * n_campaigns,
                                            size = k, mu = mu),
                                    ncol = n_samples, byrow = TRUE))
  
  data.frame(
    model     = model,
    max_count = apply(counts, 1, max),
    sum_count = rowSums(counts),
    any_OOL   = apply(counts > limit, 1, any)
  )
}

# --- Run simulations --------------------------------------------
poiss_df <- simulate_campaigns("poisson")
negb_df  <- simulate_campaigns("negbin")
sim_df   <- bind_rows(poiss_df, negb_df)

# --- Summaries --------------------------------------------------
summary_df <- sim_df %>%
  group_by(model) %>%
  summarise(
    P_any_OOL  = mean(any_OOL),
    Mean_max   = mean(max_count),
    Mean_total = mean(sum_count)
  )

print(summary_df)

Reproducibility note. All simulations set set.seed(123) so results are replicable on re-run.

 

Example output from R simulation:

Model P(max > 10) Mean max CFU Mean total CFU
Poisson 0.0001 (0.01 %) 4.39 20.0
NegBin 0.112 (11.2 %) 6.72 20.0

*Table 13.1 – Monte Carlo summary statistics over 10 000 campaigns.

Despite identical means (20 CFU per campaign), the Negative Binomial model produces far more campaigns with max > 10 CFU.*

Higher tail probability → higher chance that a multi-sample session contains at least one apparent OOL.

 

📊 Visual comparison

library(ggplot2)

ggplot(sim_df, aes(x = max_count, fill = model)) +
  geom_histogram(binwidth = 1, position = "identity", alpha = 0.5) +
  geom_vline(xintercept = limit, linetype = 2, color = "red") +
  labs(title = "Monte Carlo distributions of maximum CFU per campaign",
       subtitle = sprintf("%d samples per campaign, %d simulations",
                          n_samples, n_campaigns),
       x = "Maximum CFU observed", y = "Frequency") +
  theme_minimal(base_size = 13)

 

The histograms below visually highlight how overdispersion amplifies the probability of extreme counts.

 

Monte Carlo distributions of maximum CFU per campaign (Poisson vs. Negative Binomial)
Figure 13.1 – Monte Carlo distributions of maximum CFU per campaign. Each campaign consists of 10 samples, and the histogram shows the simulated distribution of the highest CFU observed across 10 000 campaigns under Poisson (blue) and Negative Binomial (pink) assumptions. The dashed red line marks the 10 CFU limit used to classify “out-of-limit” events.

 

🧮 Index of Dispersion check

The ID is computed on raw per-sample counts (all samples from all campaigns), as defined in §1(d).

# Re-simulate raw per-sample counts for ID (same parameters, same seed)
set.seed(123)
counts_poiss <- rpois(n_samples * n_campaigns, lambda)
counts_negb  <- rnbinom(n_samples * n_campaigns, size = k, mu = mu)

ID_df <- dplyr::tibble(
  Model = c("Poisson", "NegBin"),
  Mean  = c(mean(counts_poiss), mean(counts_negb)),
  Var   = c(var(counts_poiss),  var(counts_negb))
) |>
  dplyr::mutate(ID = Var / Mean)

print(ID_df)

 

Example output from R simulation:

Model Mean Variance Index of Dispersion (ID)
Poisson 2.00 2.00 0.999
NegBin 2.00 6.02 3.02

Table 13.2 – Index of Dispersion (ID) calculated on raw per-sample counts (all samples across all campaigns).
Consistent with theory: Poisson → ID ≈ 1; Negative Binomial → ID ≈ 3 (overdispersion).

Empirical ID confirms theoretical expectations: Poisson ≈ 1, NegBin ≫ 1.

 


 

🧭 GMP interpretation

 

Model Typical environment Expected OOL frequency (with §3 parameters) GMP interpretation
Poisson Highly controlled areas (Grade A/B, WFI) ≈ 1 in 10 000 campaigns OOL → likely true anomaly
NegBin Variable areas (Grade C/D, non-sterile zones) ≈ 1 in 9 campaigns OOL → possibly expected (natural variability)

Table 13.3 – Expected frequency of “out-of-limit” results under Poisson vs. Negative Binomial models
(given the Monte Carlo setup in §3). The Poisson assumption suits uniform, low-variability systems, while the Negative Binomial model accommodates heterogeneous microbial loads across space or time.

🧠 Monte Carlo bridges statistical theory and operational interpretation:
it shows that apparent “outliers” can arise naturally from overdispersed processes
and should not always trigger full investigations if supported by historical evidence.

Overdispersion is not an error, but a measurable property of the process —
it reflects heterogeneity rather than instability.

 

✅ Key Takeaways

 

💡 Key insight — Monte Carlo as a quantitative risk estimator
In microbiological monitoring, the Monte Carlo method acts as a quantitative risk estimator, translating the inherent randomness of counts into measurable probabilities of excursions. This transforms subjective notions of “rare” or “unexpected” events into data-driven, reproducible indicators of process control — a decisive step toward quantitative risk management in GMP microbiology.

 


📘 Summary

 

🧩 Next connection: The same Monte Carlo logic used here for microbiological counts extends naturally to other low-rate phenomena (e.g., excursions in stability or sporadic process anomalies).


← Previous: Case Study 6 — Monte Carlo Sampling Plans and Empirical OC Curves | ▲ Back to top | Next: Case Study 8 — Percentile-Based Capability for Non-Normal Data →