Script: scripts/mbx_alpha_diversity_run.sh
Companion files in this folder:
- 12_alpha_diversity.html — same content with copy buttons.
- 12_alpha_diversity.pptx — slide deck for the talk.
Alpha diversity answers the simplest question we ask of a microbiome sample: how diverse is this single community? Two intuitions go into the answer:
Step 12 computes five alpha-diversity metrics, each capturing those two intuitions slightly differently — so a reviewer can see if the finding is robust to the choice of metric. It then asks the next question: are the alpha-diversity differences between treatment groups statistically significant? — and produces the boxplots + tests that answer it.
It rarefies the feature table to the depth Step 11 chose, computes the
five canonical alpha-diversity vectors (Observed Features, Shannon,
Simpson, Pielou, Faith PD), merges them into one tidy
alpha_diversity.xlsx with the metadata, then runs Kruskal-Wallis +
pairwise Dunn + compact letter displays per metric per categorical
variable and produces the boxplots.
First the script reads 11_pre_diversity/mbx_pre_diversity_info.txt
and refuses to run unless READY_FOR_DIVERSITY=yes (or the user passes
--force). This is the gate; if Step 11 said the depth choice was
unreliable, Step 12 honours that.
Then it runs qiime feature-table rarefy at the depth Step 11
recommended, producing a rarefied_table.qza. Every diversity metric
operates on this rarefied table — the same number of reads per sample —
so apples-to-apples comparisons are guaranteed.
Now the script runs QIIME2's alpha-diversity plugin five times:
H = − Σ p_i log p_i. Combines richness and
evenness; sensitive to rare taxa.1 − Σ p_i². Combines richness and evenness;
weights common taxa more than Shannon does.J = H / log(S), where S is observed
richness. Pure evenness — how close to the maximum entropy is the
community?Five values per sample, exported to TSV.
Then the script joins the five TSV files with the metadata in R and writes one tidy table:
| sample-id | Treatment | ASVs_or_Features | Shannon_Index | Simpson_Diversity | Faith_PD | Pielou_Evenness |
|---|---|---|---|---|---|---|
| SampleA | High | 412 | 5.23 | 0.91 | 28.4 | 0.866 |
| SampleB | Low | 367 | 4.92 | 0.88 | 25.1 | 0.834 |
That XLSX is the single source of truth that the rest of Step 12 — and the final report — reads.
For every categorical variable discovered in the metadata (same auto- detection as Step 9/10), the script runs:
CLD_Summary_<metric>_by_<variable>.xlsx.Then it draws one boxplot per (metric × variable) and a multi-panel boxplot showing all five metrics next to each other for each variable. The compact letter display from step 5 is annotated on top of each box so a reviewer can read significance directly off the figure.
PNG (always) + SVG (always) + PDF (with --publication-figures).
mbx_alpha_diversity_info.txtFinally the cross-step info file records the rarefaction depth used,
every XLSX produced, the categorical variables actually analysed, and
STATUS=COMPLETE.
| Default | Value | Why this default |
|---|---|---|
| Rarefaction depth | from Step 11 | Step 11 is the single source of truth — never overridden here. |
| Five metrics | Observed Features, Shannon, Simpson, Pielou, Faith PD | The five most commonly reported in microbiome papers, each capturing richness/evenness/phylogeny differently. |
| Test (2 groups) | Wilcoxon rank-sum | Non-parametric — alpha-diversity distributions are rarely normal. |
| Test (3+ groups) | Kruskal-Wallis | Same reasoning. |
| Post-hoc on KW hits | Dunn's test | The standard non-parametric post-hoc. |
| Multiple-testing correction | Benjamini-Hochberg across the 5 metrics within each variable | Conservative enough; field-standard. |
| CLD algorithm | multcompView::multcompLetters |
Standard R implementation. |
| Plot formats | PNG + SVG always; PDF on --publication-figures |
Publication-ready by default. |
| Threads | MBX_THREADS |
Single source of truth. |
| Seed | MBX_SEED |
Single source of truth for reproducibility. |
| Fallback | When it triggers | Why this fallback exists |
|---|---|---|
| Refuse to run | READY_FOR_DIVERSITY=no from Step 11 |
Step 11's verdict gates us; the override is --force and is recorded. |
| Skip a metric | A required artifact (e.g. tree) is missing | Faith PD needs the rooted tree; if the tree wasn't produced, only the four non-phylogenetic metrics are computed. |
| Skip a comparison | A categorical variable has singleton groups after NA filter | Same logic as Step 10; logged + continued. |
| Re-use rarefied table | Existing rarefied_table.qza at the same depth |
Saves a re-rarefaction on re-runs. |
alpha_diversity.xlsx (the consolidated table):
| sample-id | Treatment | ASVs_or_Features | Shannon_Index | Simpson_Diversity | Faith_PD | Pielou_Evenness |
|---|---|---|---|---|---|---|
| SampleA | High | 412 | 5.23 | 0.91 | 28.4 | 0.866 |
| SampleB | Low | 367 | 4.92 | 0.88 | 25.1 | 0.834 |
| ... |
Plus per-variable subfolders containing the KW/Dunn/CLD XLSX files and
the boxplot PNG/SVG/PDF, plus the mbx_alpha_diversity_info.txt
contract for the final report.
Step 12 answers the simplest microbiome question (how diverse is each sample?) five different ways so the answer is robust, then applies the same Kruskal-Wallis + Dunn + BH + CLD stack as Step 10 to test group differences. The boxplots are annotated with the CLD so significance is readable directly off the figure.
mbXPro/scripts/mbx_alpha_diversity_run.shq2-diversity plugin:
https://docs.qiime2.org/2025.4/plugins/available/diversity/