Contents
  1. Why this step matters
  2. What the script does in one sentence
  3. The algorithm, step by step
  4. Default parameters and why they are what they are
  5. When and why we fall back to defaults
  6. What the output file looks like
  7. Takeaway
  8. Sources

Step 12 — Alpha Diversity

Script: scripts/mbx_alpha_diversity_run.sh

Companion files in this folder: - 12_alpha_diversity.html — same content with copy buttons. - 12_alpha_diversity.pptx — slide deck for the talk.


Why this step matters

Alpha diversity answers the simplest question we ask of a microbiome sample: how diverse is this single community? Two intuitions go into the answer:

Step 12 computes five alpha-diversity metrics, each capturing those two intuitions slightly differently — so a reviewer can see if the finding is robust to the choice of metric. It then asks the next question: are the alpha-diversity differences between treatment groups statistically significant? — and produces the boxplots + tests that answer it.


What the script does in one sentence

It rarefies the feature table to the depth Step 11 chose, computes the five canonical alpha-diversity vectors (Observed Features, Shannon, Simpson, Pielou, Faith PD), merges them into one tidy alpha_diversity.xlsx with the metadata, then runs Kruskal-Wallis + pairwise Dunn + compact letter displays per metric per categorical variable and produces the boxplots.


The algorithm, step by step

1. Gate on Step 11's verdict

First the script reads 11_pre_diversity/mbx_pre_diversity_info.txt and refuses to run unless READY_FOR_DIVERSITY=yes (or the user passes --force). This is the gate; if Step 11 said the depth choice was unreliable, Step 12 honours that.

2. Rarefy the feature table

Then it runs qiime feature-table rarefy at the depth Step 11 recommended, producing a rarefied_table.qza. Every diversity metric operates on this rarefied table — the same number of reads per sample — so apples-to-apples comparisons are guaranteed.

3. Compute five alpha metrics

Now the script runs QIIME2's alpha-diversity plugin five times:

Five values per sample, exported to TSV.

4. Build the consolidated XLSX

Then the script joins the five TSV files with the metadata in R and writes one tidy table:

sample-id Treatment ASVs_or_Features Shannon_Index Simpson_Diversity Faith_PD Pielou_Evenness
SampleA High 412 5.23 0.91 28.4 0.866
SampleB Low 367 4.92 0.88 25.1 0.834

That XLSX is the single source of truth that the rest of Step 12 — and the final report — reads.

5. Statistics per (metric × variable)

For every categorical variable discovered in the metadata (same auto- detection as Step 9/10), the script runs:

6. Boxplots

Then it draws one boxplot per (metric × variable) and a multi-panel boxplot showing all five metrics next to each other for each variable. The compact letter display from step 5 is annotated on top of each box so a reviewer can read significance directly off the figure.

PNG (always) + SVG (always) + PDF (with --publication-figures).

7. Write mbx_alpha_diversity_info.txt

Finally the cross-step info file records the rarefaction depth used, every XLSX produced, the categorical variables actually analysed, and STATUS=COMPLETE.


Default parameters and why they are what they are

Default Value Why this default
Rarefaction depth from Step 11 Step 11 is the single source of truth — never overridden here.
Five metrics Observed Features, Shannon, Simpson, Pielou, Faith PD The five most commonly reported in microbiome papers, each capturing richness/evenness/phylogeny differently.
Test (2 groups) Wilcoxon rank-sum Non-parametric — alpha-diversity distributions are rarely normal.
Test (3+ groups) Kruskal-Wallis Same reasoning.
Post-hoc on KW hits Dunn's test The standard non-parametric post-hoc.
Multiple-testing correction Benjamini-Hochberg across the 5 metrics within each variable Conservative enough; field-standard.
CLD algorithm multcompView::multcompLetters Standard R implementation.
Plot formats PNG + SVG always; PDF on --publication-figures Publication-ready by default.
Threads MBX_THREADS Single source of truth.
Seed MBX_SEED Single source of truth for reproducibility.

When and why we fall back to defaults

Fallback When it triggers Why this fallback exists
Refuse to run READY_FOR_DIVERSITY=no from Step 11 Step 11's verdict gates us; the override is --force and is recorded.
Skip a metric A required artifact (e.g. tree) is missing Faith PD needs the rooted tree; if the tree wasn't produced, only the four non-phylogenetic metrics are computed.
Skip a comparison A categorical variable has singleton groups after NA filter Same logic as Step 10; logged + continued.
Re-use rarefied table Existing rarefied_table.qza at the same depth Saves a re-rarefaction on re-runs.

What the output file looks like

alpha_diversity.xlsx (the consolidated table):

sample-id Treatment ASVs_or_Features Shannon_Index Simpson_Diversity Faith_PD Pielou_Evenness
SampleA High 412 5.23 0.91 28.4 0.866
SampleB Low 367 4.92 0.88 25.1 0.834
...

Plus per-variable subfolders containing the KW/Dunn/CLD XLSX files and the boxplot PNG/SVG/PDF, plus the mbx_alpha_diversity_info.txt contract for the final report.


Takeaway

Step 12 answers the simplest microbiome question (how diverse is each sample?) five different ways so the answer is robust, then applies the same Kruskal-Wallis + Dunn + BH + CLD stack as Step 10 to test group differences. The boxplots are annotated with the CLD so significance is readable directly off the figure.


Sources