Script: scripts/mbx_pre_diversity_parameters.sh
Companion files in this folder:
- 11_pre_diversity_parameters.html — same content with copy buttons.
- 11_pre_diversity_parameters.pptx — slide deck for the talk.
Steps 12 and 13 — alpha and beta diversity — both need two things that nothing upstream has produced:
The depth question is the substantive one. If the depth is too low, we throw away usable read information; samples that started with deep sequencing end up looking artificially impoverished. If the depth is too high, we lose every sample that didn't reach it — and we lose them non-randomly (the deepest samples are often the easiest libraries to prepare). Either failure mode biases every downstream comparison.
Step 11's job is to answer the depth question quantitatively, using
three concurrent statistical criteria that any reviewer can defend, and
to produce a STATUS field that gates the diversity steps from running
if the depth verdict isn't strong enough.
It builds a rooted phylogenetic tree from the ASV sequences, then computes the smallest rarefaction depth that simultaneously satisfies ≥ 90 % sample retention, ≥ 0.98 mean Good's coverage, and a flat analytical rarefaction curve (Hurlbert 1971 closed-form slope < 0.5 features / 1,000 reads).
First the script confirms QIIME2, R, and the input artifacts are all
present, logs every tool version (so the final report can cite them
exactly), and creates the 11_pre_diversity/ output directory.
Then it produces two QIIME2 QZVs — metadata_summary.qzv and
feature_table_summary.qzv — useful for the user's sanity check before
they commit hours to diversity computation.
Next the script runs the align-to-tree-mafft-fasttree QIIME2 pipeline:
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences representative_sequences.qza \
--o-alignment aligned-rep-seqs.qza \
--o-masked-alignment masked-aligned-rep-seqs.qza \
--o-tree unrooted-tree.qza \
--o-rooted-tree rooted-tree.qza \
--p-n-threads <MBX_THREADS>
Under the hood this runs four sub-steps:
Now the script exports the feature table to TSV + BIOM and counts the total non-chimeric ASVs per sample. That distribution is the input to the depth-selection math.
This is the heart of the step. Let f_i be the read count in sample
i, and let d be a candidate rarefaction depth. The algorithm sweeps
d from 1,000 up to max(f_i) and picks the largest d that
simultaneously satisfies:
(a) Overall sample-retention rule:
The fraction of samples with f_i ≥ d must be ≥ 0.90
(MIN_OVERALL). Equivalently: at most 10 % of samples are allowed to be
dropped by the rarefaction. If a --group-col is supplied, an additional
per-group retention check requires ≥ MIN_GROUP of each group's samples
to survive — otherwise we'd risk wiping out an entire treatment group.
(b) Coverage rule (Good's coverage):
For each retained sample, the analytical Good's coverage at depth
d is computed:
C(d) = 1 − f₁(d) / d
where f₁(d) is the expected number of singletons in a subsample of
size d from the sample's count vector. The mean of C(d) across
retained samples must be ≥ 0.98 (GOOD_COV_MIN) — i.e. on average,
at most 2 % of the community at the rarefied depth would be unobserved
singletons.
(c) Plateau rule (Hurlbert 1971):
The expected number of observed ASVs at depth d, given a sample with
f_i reads and ASV counts n_j, is the Hurlbert rarefaction
closed-form:
E[S | d, sample i] = Σ_j [ 1 − C(f_i − n_j, d) / C(f_i, d) ]
where C(a, b) is the binomial coefficient and the sum is over every
ASV j in sample i. (This is the exact expectation, not a simulation
approximation — much faster and bit-reproducible.) For each candidate
d, the algorithm computes the mean slope of that curve in a small
window around d, expressed as features gained per 1,000 additional
reads. That slope must be < 0.5 (PLATEAU_SLOPE_MAX) — meaning the
curve has effectively flattened, so we're not throwing information away
by stopping here.
The recommended depth is the largest d satisfying all three
criteria. If no d does, the algorithm reports which criterion is
binding (the script can then say "criterion (c) was binding — the
curves haven't plateaued at any feasible depth").
Then the script runs three ID-overlap checks:
A mismatch at any of these levels would cause the diversity step to
silently drop samples. The script lists every mismatching ID by name and
sets OVERALL_STATUS=FAIL.
For documentation, the script also runs qiime diversity
alpha-rarefaction with observed_features + shannon + faith_pd, max
depth = max sample frequency, 10 iterations, 20 steps. The resulting
QZV is what the user actually clicks through in their browser; the
analytical preview (the closed-form Hurlbert plot from step 5) goes into
the final report.
The script then writes four PNGs:
sequencing_depth_distribution.png — histogram of f_i.depth_vs_retention.png — the (d, retained-fraction) curve, with the
90 % threshold marked.alpha_rarefaction_curves.png — mean Hurlbert curve, annotated with
the chosen depth.group_depth_summary.png — per-treatment-group depth distributions
when a --group-col is supplied.STATUSFinally the script writes mbx_pre_diversity_info.txt with every
output path, the recommended depth, the binding criterion (if any), and
one of four STATUS values:
PASS — all three criteria met cleanly.PASS_WITH_WARNINGS — criteria met but borderline; non-blocking
warnings (e.g. one group at exactly 90 % retention).REVIEW_REQUIRED — fallback rule used, or a criterion failed; user
should look at the summary before continuing.FAIL — pipeline-stopping problem (zero overlap between metadata and
table; a group entirely wiped out). The next step refuses to run
unless --force is passed.READY_FOR_DIVERSITY=yes|no is also written — Steps 12, 13, 14, 15 all
read it and refuse to run when set to no.
| Default | Value | Why this default |
|---|---|---|
| Tree builder | align-to-tree-mafft-fasttree |
MAFFT + FastTree is the QIIME2-recommended pipeline; fast + accurate enough for community-level diversity. |
| Alignment mode | MAFFT FFT-NS-2 (default) |
Best speed/accuracy trade-off for ~10²–10³ ASVs. |
| Tree rooting | mid-point | UniFrac requires a rooted tree. Mid-point root is unbiased w.r.t. group structure. |
| Overall sample retention floor | 0.90 (MIN_OVERALL) |
Microbial ecology convention; losing > 10 % of samples non-randomly is unacceptable. |
Per-group retention floor (when --group-col) |
0.80 (MIN_GROUP) |
Slightly more permissive — small groups need extra leeway. |
| Good's coverage floor | 0.98 (GOOD_COV_MIN) |
At most 2 % of the community can be unobserved singletons at the chosen depth. |
| Plateau slope ceiling | 0.5 features / 1,000 reads (PLATEAU_SLOPE_MAX) |
Below this slope the curve is statistically flat — additional reads aren't recovering new biology. |
| Depth sweep step size | adaptive (log-linear) | Faster than a constant grid without losing resolution near the candidate depth. |
| Rarefaction iterations (QIIME visualisation) | 10 | QIIME2 default; enough to smooth the curves for visual interpretation. |
| Rarefaction steps (QIIME visualisation) | 20 | Same. |
| Phylogeny threads | MBX_THREADS |
Single source of truth for parallelism. |
| Fallback | When it triggers | Why this fallback exists |
|---|---|---|
PASS_WITH_WARNINGS |
All three criteria met but borderline (e.g. coverage 0.978) | Non-blocking — downstream steps run, but the report flags the concern. |
REVIEW_REQUIRED |
No d satisfies all three; the algorithm uses the largest d satisfying any two |
The user should inspect the curves; downstream steps run only with the report-issued warning. |
FAIL (depth-based) |
A --group-col group would be entirely wiped at every candidate depth |
Continuing would silently drop a treatment group; this must be a hard stop. |
FAIL (ID-based) |
< 50 % overlap between metadata and feature-table sample IDs | Almost always a metadata-vs-FASTQ-sample-ID mismatch that Step 1 missed. |
READY_FOR_DIVERSITY=no |
STATUS ∈ {FAIL, REVIEW_REQUIRED} (configurable) |
Downstream steps 12, 13, 14, 15 all gate on this. |
--force flag on downstream steps |
User wants to run anyway | The override exists; the report records that it was used. |
RECOMMENDED_DEPTH=12450
DEPTH_CRITERION_OVERALL=PASS (94 % of samples ≥ 12,450)
DEPTH_CRITERION_COVERAGE=PASS (mean Good's coverage = 0.987)
DEPTH_CRITERION_PLATEAU=PASS (slope = 0.31 features / 1,000 reads)
BINDING_CRITERION=none
ROOTED_TREE_QZA=/.../11_pre_diversity/rooted-tree.qza
FEATURE_TABLE_QZA=/.../7_taxonomy_csv/feature_table_filtered.qza
METADATA_TXT=/.../metadata.txt
OVERALL_STATUS=PASS
READY_FOR_DIVERSITY=yes
STATUS=COMPLETE
Plus four PNGs that the final report embeds, the official QIIME
rarefaction QZV for interactive inspection, and a plain-language summary
file (mbx_pre_diversity_summary.txt) the user can hand to a reviewer.
Step 11 builds the tree the diversity metrics need, but more importantly it picks the rarefaction depth with three concurrent criteria so the decision is reviewer-defensible. Hurlbert 1971's closed-form rarefaction lets us evaluate the plateau analytically — bit-reproducible without Monte Carlo. The
STATUSfield gates Steps 12 through 15 from running when the depth verdict is too weak.
mbXPro/scripts/mbx_pre_diversity_parameters.sh