Contents

Why this step matters
What the script does in one sentence
The algorithm, step by step
Default parameters and why they are what they are
When and why we fall back to defaults
What the output file looks like
Takeaway
Sources

Step 12 — Alpha Diversity

Script: scripts/mbx_alpha_diversity_run.sh

Companion files in this folder: - 12_alpha_diversity.html — same content with copy buttons. - 12_alpha_diversity.pptx — slide deck for the talk.

Why this step matters

Alpha diversity answers the simplest question we ask of a microbiome sample: how diverse is this single community? Two intuitions go into the answer:

Richness — how many different things are there?
Evenness — are they all roughly equal, or does one species dominate?

Step 12 computes five alpha-diversity metrics, each capturing those two intuitions slightly differently — so a reviewer can see if the finding is robust to the choice of metric. It then asks the next question: are the alpha-diversity differences between treatment groups statistically significant? — and produces the boxplots + tests that answer it.

What the script does in one sentence

It rarefies the feature table to the depth Step 11 chose, computes the five canonical alpha-diversity vectors (Observed Features, Shannon, Simpson, Pielou, Faith PD), merges them into one tidy alpha_diversity.xlsx with the metadata, then runs Kruskal-Wallis + pairwise Dunn + compact letter displays per metric per categorical variable and produces the boxplots.

The algorithm, step by step

1. Gate on Step 11's verdict

First the script reads 11_pre_diversity/mbx_pre_diversity_info.txt and refuses to run unless READY_FOR_DIVERSITY=yes (or the user passes --force). This is the gate; if Step 11 said the depth choice was unreliable, Step 12 honours that.

2. Rarefy the feature table

Then it runs qiime feature-table rarefy at the depth Step 11 recommended, producing a rarefied_table.qza. Every diversity metric operates on this rarefied table — the same number of reads per sample — so apples-to-apples comparisons are guaranteed.

3. Compute five alpha metrics

Now the script runs QIIME2's alpha-diversity plugin five times:

Observed Features — just the count of distinct ASVs per sample (pure richness, no evenness contribution).
Shannon index — H = − Σ p_i log p_i. Combines richness and evenness; sensitive to rare taxa.
Simpson diversity — 1 − Σ p_i². Combines richness and evenness; weights common taxa more than Shannon does.
Pielou's evenness — J = H / log(S), where S is observed richness. Pure evenness — how close to the maximum entropy is the community?
Faith's phylogenetic diversity — the total branch length of the sub-tree spanned by the ASVs present in the sample. The phylogenetically-aware richness measure.

Five values per sample, exported to TSV.

4. Build the consolidated XLSX

Then the script joins the five TSV files with the metadata in R and writes one tidy table:

sample-id	Treatment	ASVs_or_Features	Shannon_Index	Simpson_Diversity	Faith_PD	Pielou_Evenness
SampleA	High	412	5.23	0.91	28.4	0.866
SampleB	Low	367	4.92	0.88	25.1	0.834

That XLSX is the single source of truth that the rest of Step 12 — and the final report — reads.

5. Statistics per (metric × variable)

For every categorical variable discovered in the metadata (same auto- detection as Step 9/10), the script runs:

Kruskal-Wallis (3+ groups) or Wilcoxon rank-sum (2 groups) on each metric.
Benjamini-Hochberg correction across the five metrics within each variable (the Step 12 statistics report each metric independently, so the correction is per-variable, not pipeline-wide).
Dunn's pairwise post-hoc on KW hits.
Compact letter display per metric per variable, written to CLD_Summary_<metric>_by_<variable>.xlsx.

6. Boxplots

Then it draws one boxplot per (metric × variable) and a multi-panel boxplot showing all five metrics next to each other for each variable. The compact letter display from step 5 is annotated on top of each box so a reviewer can read significance directly off the figure.

PNG (always) + SVG (always) + PDF (with --publication-figures).

7. Write `mbx_alpha_diversity_info.txt`

Finally the cross-step info file records the rarefaction depth used, every XLSX produced, the categorical variables actually analysed, and STATUS=COMPLETE.

Default parameters and why they are what they are

Default	Value	Why this default
Rarefaction depth	from Step 11	Step 11 is the single source of truth — never overridden here.
Five metrics	Observed Features, Shannon, Simpson, Pielou, Faith PD	The five most commonly reported in microbiome papers, each capturing richness/evenness/phylogeny differently.
Test (2 groups)	Wilcoxon rank-sum	Non-parametric — alpha-diversity distributions are rarely normal.
Test (3+ groups)	Kruskal-Wallis	Same reasoning.
Post-hoc on KW hits	Dunn's test	The standard non-parametric post-hoc.
Multiple-testing correction	Benjamini-Hochberg across the 5 metrics within each variable	Conservative enough; field-standard.
CLD algorithm	`multcompView::multcompLetters`	Standard R implementation.
Plot formats	PNG + SVG always; PDF on `--publication-figures`	Publication-ready by default.
Threads	`MBX_THREADS`	Single source of truth.
Seed	`MBX_SEED`	Single source of truth for reproducibility.

When and why we fall back to defaults

Fallback	When it triggers	Why this fallback exists
Refuse to run	`READY_FOR_DIVERSITY=no` from Step 11	Step 11's verdict gates us; the override is `--force` and is recorded.
Skip a metric	A required artifact (e.g. tree) is missing	Faith PD needs the rooted tree; if the tree wasn't produced, only the four non-phylogenetic metrics are computed.
Skip a comparison	A categorical variable has singleton groups after NA filter	Same logic as Step 10; logged + continued.
Re-use rarefied table	Existing `rarefied_table.qza` at the same depth	Saves a re-rarefaction on re-runs.

What the output file looks like

alpha_diversity.xlsx (the consolidated table):

sample-id	Treatment	ASVs_or_Features	Shannon_Index	Simpson_Diversity	Faith_PD	Pielou_Evenness
SampleA	High	412	5.23	0.91	28.4	0.866
SampleB	Low	367	4.92	0.88	25.1	0.834
...

Plus per-variable subfolders containing the KW/Dunn/CLD XLSX files and the boxplot PNG/SVG/PDF, plus the mbx_alpha_diversity_info.txt contract for the final report.

Takeaway

Step 12 answers the simplest microbiome question (how diverse is each sample?) five different ways so the answer is robust, then applies the same Kruskal-Wallis + Dunn + BH + CLD stack as Step 10 to test group differences. The boxplots are annotated with the CLD so significance is readable directly off the figure.

Sources

The script: mbXPro/scripts/mbx_alpha_diversity_run.sh
Shannon index: Shannon (1948), A mathematical theory of communication, BSTJ 27:379–423.
Simpson diversity: Simpson (1949), Measurement of diversity, Nature 163:688.
Pielou's evenness: Pielou (1966), The measurement of diversity in different types of biological collections, J Theor Biol 13:131.
Faith PD: Faith (1992), Conservation evaluation and phylogenetic diversity, Biol Conserv 61:1–10.
QIIME2 q2-diversity plugin: https://docs.qiime2.org/2025.4/plugins/available/diversity/