Contents

Why this step matters
What the script does in one sentence
The algorithm, step by step
Default parameters and why they are what they are
When and why we fall back to defaults
What the output file looks like
Takeaway
Sources

Step 11 — Pre-diversity (phylogeny + depth selection)

Script: scripts/mbx_pre_diversity_parameters.sh

Companion files in this folder: - 11_pre_diversity_parameters.html — same content with copy buttons. - 11_pre_diversity_parameters.pptx — slide deck for the talk.

Why this step matters

Steps 12 and 13 — alpha and beta diversity — both need two things that nothing upstream has produced:

A phylogenetic tree of every ASV in the dataset, because the phylogenetic diversity metrics (Faith's PD, weighted/unweighted UniFrac) need to know how related the ASVs are.
A scientifically-defensible rarefaction depth — the number of reads we'll subsample every sample down to before computing diversity. Picking that number wrong is the single most common source of misleading microbiome results in the literature.

The depth question is the substantive one. If the depth is too low, we throw away usable read information; samples that started with deep sequencing end up looking artificially impoverished. If the depth is too high, we lose every sample that didn't reach it — and we lose them non-randomly (the deepest samples are often the easiest libraries to prepare). Either failure mode biases every downstream comparison.

Step 11's job is to answer the depth question quantitatively, using three concurrent statistical criteria that any reviewer can defend, and to produce a STATUS field that gates the diversity steps from running if the depth verdict isn't strong enough.

What the script does in one sentence

It builds a rooted phylogenetic tree from the ASV sequences, then computes the smallest rarefaction depth that simultaneously satisfies ≥ 90 % sample retention, ≥ 0.98 mean Good's coverage, and a flat analytical rarefaction curve (Hurlbert 1971 closed-form slope < 0.5 features / 1,000 reads).

The algorithm, step by step

1. Pre-flight + version logging

First the script confirms QIIME2, R, and the input artifacts are all present, logs every tool version (so the final report can cite them exactly), and creates the 11_pre_diversity/ output directory.

2. QC visualisations

Then it produces two QIIME2 QZVs — metadata_summary.qzv and feature_table_summary.qzv — useful for the user's sanity check before they commit hours to diversity computation.

3. Build the rooted phylogenetic tree

Next the script runs the align-to-tree-mafft-fasttree QIIME2 pipeline:

qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences         representative_sequences.qza \
  --o-alignment         aligned-rep-seqs.qza \
  --o-masked-alignment  masked-aligned-rep-seqs.qza \
  --o-tree              unrooted-tree.qza \
  --o-rooted-tree       rooted-tree.qza \
  --p-n-threads         <MBX_THREADS>

Under the hood this runs four sub-steps:

MAFFT multiple-sequence alignment of every ASV against every other ASV. MAFFT is fast and accurate; FFT-NS-2 is the default mode.
Mask the alignment to remove highly variable positions that confuse tree-building (QIIME2 default; based on entropy filtering).
FastTree builds an approximately-maximum-likelihood tree from the masked alignment. Much faster than RAxML for the ~hundreds-to-thousands of ASVs we typically have.
Mid-point root the tree so the diversity metrics have a defined root (UniFrac requires it; Faith's PD doesn't strictly but uses one).

4. Tabulate per-sample frequencies

Now the script exports the feature table to TSV + BIOM and counts the total non-chimeric ASVs per sample. That distribution is the input to the depth-selection math.

5. The depth-selection algorithm — three concurrent criteria

This is the heart of the step. Let f_i be the read count in sample i, and let d be a candidate rarefaction depth. The algorithm sweeps d from 1,000 up to max(f_i) and picks the largest d that simultaneously satisfies:

(a) Overall sample-retention rule: The fraction of samples with f_i ≥ d must be ≥ 0.90 (MIN_OVERALL). Equivalently: at most 10 % of samples are allowed to be dropped by the rarefaction. If a --group-col is supplied, an additional per-group retention check requires ≥ MIN_GROUP of each group's samples to survive — otherwise we'd risk wiping out an entire treatment group.

(b) Coverage rule (Good's coverage): For each retained sample, the analytical Good's coverage at depth d is computed:

C(d) = 1 − f₁(d) / d

where f₁(d) is the expected number of singletons in a subsample of size d from the sample's count vector. The mean of C(d) across retained samples must be ≥ 0.98 (GOOD_COV_MIN) — i.e. on average, at most 2 % of the community at the rarefied depth would be unobserved singletons.

(c) Plateau rule (Hurlbert 1971): The expected number of observed ASVs at depth d, given a sample with f_i reads and ASV counts n_j, is the Hurlbert rarefaction closed-form:

E[S | d, sample i] = Σ_j  [ 1 − C(f_i − n_j, d) / C(f_i, d) ]

where C(a, b) is the binomial coefficient and the sum is over every ASV j in sample i. (This is the exact expectation, not a simulation approximation — much faster and bit-reproducible.) For each candidate d, the algorithm computes the mean slope of that curve in a small window around d, expressed as features gained per 1,000 additional reads. That slope must be < 0.5 (PLATEAU_SLOPE_MAX) — meaning the curve has effectively flattened, so we're not throwing information away by stopping here.

The recommended depth is the largest d satisfying all three criteria. If no d does, the algorithm reports which criterion is binding (the script can then say "criterion (c) was binding — the curves haven't plateaued at any feasible depth").

6. Sample-ID consistency checks

Then the script runs three ID-overlap checks:

Metadata IDs ↔ feature-table IDs.
Feature-table IDs ↔ representative-sequences IDs.
Representative-sequences IDs ↔ tree tip IDs.

A mismatch at any of these levels would cause the diversity step to silently drop samples. The script lists every mismatching ID by name and sets OVERALL_STATUS=FAIL.

7. Official QIIME alpha-rarefaction

For documentation, the script also runs qiime diversity alpha-rarefaction with observed_features + shannon + faith_pd, max depth = max sample frequency, 10 iterations, 20 steps. The resulting QZV is what the user actually clicks through in their browser; the analytical preview (the closed-form Hurlbert plot from step 5) goes into the final report.

8. Decision-supporting visualisations

The script then writes four PNGs:

sequencing_depth_distribution.png — histogram of f_i.
depth_vs_retention.png — the (d, retained-fraction) curve, with the 90 % threshold marked.
alpha_rarefaction_curves.png — mean Hurlbert curve, annotated with the chosen depth.
group_depth_summary.png — per-treatment-group depth distributions when a --group-col is supplied.

9. Write the cross-step info file + `STATUS`

Finally the script writes mbx_pre_diversity_info.txt with every output path, the recommended depth, the binding criterion (if any), and one of four STATUS values:

PASS — all three criteria met cleanly.
PASS_WITH_WARNINGS — criteria met but borderline; non-blocking warnings (e.g. one group at exactly 90 % retention).
REVIEW_REQUIRED — fallback rule used, or a criterion failed; user should look at the summary before continuing.
FAIL — pipeline-stopping problem (zero overlap between metadata and table; a group entirely wiped out). The next step refuses to run unless --force is passed.

READY_FOR_DIVERSITY=yes|no is also written — Steps 12, 13, 14, 15 all read it and refuse to run when set to no.

Default parameters and why they are what they are

Default	Value	Why this default
Tree builder	`align-to-tree-mafft-fasttree`	MAFFT + FastTree is the QIIME2-recommended pipeline; fast + accurate enough for community-level diversity.
Alignment mode	MAFFT `FFT-NS-2` (default)	Best speed/accuracy trade-off for ~10²–10³ ASVs.
Tree rooting	mid-point	UniFrac requires a rooted tree. Mid-point root is unbiased w.r.t. group structure.
Overall sample retention floor	0.90 (`MIN_OVERALL`)	Microbial ecology convention; losing > 10 % of samples non-randomly is unacceptable.
Per-group retention floor (when `--group-col`)	0.80 (`MIN_GROUP`)	Slightly more permissive — small groups need extra leeway.
Good's coverage floor	0.98 (`GOOD_COV_MIN`)	At most 2 % of the community can be unobserved singletons at the chosen depth.
Plateau slope ceiling	0.5 features / 1,000 reads (`PLATEAU_SLOPE_MAX`)	Below this slope the curve is statistically flat — additional reads aren't recovering new biology.
Depth sweep step size	adaptive (log-linear)	Faster than a constant grid without losing resolution near the candidate depth.
Rarefaction iterations (QIIME visualisation)	10	QIIME2 default; enough to smooth the curves for visual interpretation.
Rarefaction steps (QIIME visualisation)	20	Same.
Phylogeny threads	`MBX_THREADS`	Single source of truth for parallelism.

When and why we fall back to defaults

Fallback	When it triggers	Why this fallback exists
`PASS_WITH_WARNINGS`	All three criteria met but borderline (e.g. coverage 0.978)	Non-blocking — downstream steps run, but the report flags the concern.
`REVIEW_REQUIRED`	No `d` satisfies all three; the algorithm uses the largest `d` satisfying any two	The user should inspect the curves; downstream steps run only with the report-issued warning.
`FAIL` (depth-based)	A `--group-col` group would be entirely wiped at every candidate depth	Continuing would silently drop a treatment group; this must be a hard stop.
`FAIL` (ID-based)	< 50 % overlap between metadata and feature-table sample IDs	Almost always a metadata-vs-FASTQ-sample-ID mismatch that Step 1 missed.
`READY_FOR_DIVERSITY=no`	`STATUS ∈ {FAIL, REVIEW_REQUIRED}` (configurable)	Downstream steps 12, 13, 14, 15 all gate on this.
`--force` flag on downstream steps	User wants to run anyway	The override exists; the report records that it was used.

What the output file looks like

RECOMMENDED_DEPTH=12450
DEPTH_CRITERION_OVERALL=PASS  (94 % of samples ≥ 12,450)
DEPTH_CRITERION_COVERAGE=PASS (mean Good's coverage = 0.987)
DEPTH_CRITERION_PLATEAU=PASS  (slope = 0.31 features / 1,000 reads)
BINDING_CRITERION=none

ROOTED_TREE_QZA=/.../11_pre_diversity/rooted-tree.qza
FEATURE_TABLE_QZA=/.../7_taxonomy_csv/feature_table_filtered.qza
METADATA_TXT=/.../metadata.txt

OVERALL_STATUS=PASS
READY_FOR_DIVERSITY=yes
STATUS=COMPLETE

Plus four PNGs that the final report embeds, the official QIIME rarefaction QZV for interactive inspection, and a plain-language summary file (mbx_pre_diversity_summary.txt) the user can hand to a reviewer.

Takeaway

Step 11 builds the tree the diversity metrics need, but more importantly it picks the rarefaction depth with three concurrent criteria so the decision is reviewer-defensible. Hurlbert 1971's closed-form rarefaction lets us evaluate the plateau analytically — bit-reproducible without Monte Carlo. The STATUS field gates Steps 12 through 15 from running when the depth verdict is too weak.

Sources

The script: mbXPro/scripts/mbx_pre_diversity_parameters.sh
Hurlbert rarefaction: Hurlbert (1971), The non-concept of species diversity, Ecology 52:577–586.
Good's coverage: Good (1953), The population frequencies of species and the estimation of population parameters, Biometrika 40:237.
MAFFT: Katoh & Standley (2013), MBE 30:772–780.
FastTree: Price et al. (2010), PLOS ONE 5:e9490.
Why depth choice matters: Weiss et al. (2017), Microbiome 5:27.