Script: scripts/mbx_picrust_run.sh
Companion files in this folder:
- 15_picrust2.html — same content with copy buttons.
- 15_picrust2.pptx — slide deck for the talk.
Every step up to now has answered "what's there" questions. Step 15 answers a "what is it doing" question — at least, predicted from the phylogenetic placement of each ASV.
16S sequencing tells us identity but nothing direct about function. PICRUSt2 (Douglas et al. 2020) fills that gap by exploiting the fact that closely-related organisms tend to share gene families. The algorithm places every ASV on a reference phylogenetic tree, looks up the known gene content of its nearest reference genomes, and infers the expected gene-family abundance per sample. The output is four functional abundance tables:
Step 15's job is to run PICRUSt2 reliably (it lives in its own conda env — see "fallbacks"), flag samples whose ASVs are too distantly related to the reference to trust the prediction, run differential abundance across treatments on each database, and produce a self-contained HTML report so the user can read the functional result alongside the taxonomic ones.
It runs PICRUSt2 in a pre-baked conda env to predict KO/EC/COG/MetaCyc abundances per sample, computes per-sample NSTI (the phylogenetic distance from each ASV to its nearest reference genome, the official trustworthiness metric), runs Kruskal-Wallis + Dunn differential abundance per categorical variable, optionally correlates genera × pathways with the Step 14 results, and writes a self-contained HTML report.
First, same READY_FOR_DIVERSITY gate. PICRUSt2 uses raw counts
(non-rarefied) like ANCOMBC2 — it needs absolute count information for
its NSTI weighting.
mbx_picrust2 conda envThen the script confirms a conda env named mbx_picrust2 exists. If
the user is running inside the Docker image, the env is baked in at
build time (see Phase 3.3 in CHANGELOG.md) — picrust2=2.6.3 from
bioconda. If running on a fresh host, the script creates the env on first
use (~2 GB download). All picrust2_*.py commands are invoked via
conda run -n mbx_picrust2, so the script never modifies the active
QIIME2 env.
Next the script converts the Step-4 representative_sequences.qza to a plain FASTA and the Step-7 filtered feature table to a BIOM/TSV pair. Those are PICRUSt2's required inputs.
place_seqs.pyNow PICRUSt2 places every ASV on its reference 20,000-genome tree
using epa-ng (placement) + gappa (re-formatting):
place_seqs.py \
-s rep_seqs.fasta \
-o placed_seqs.tre \
-p $THREADS
This is the most computationally expensive sub-step (a few minutes per 1,000 ASVs).
Then PICRUSt2 computes the Nearest Sequenced Taxon Index for each ASV — the phylogenetic distance from the ASV to its closest reference genome on the tree. A low NSTI (≤ 0.15) means the prediction is well-grounded in a real reference; a high NSTI (> 1.0) means the ASV is so distantly related to anything sequenced that the predicted gene content is essentially a guess.
The script computes the weighted mean NSTI per sample (read-count- weighted, not just ASV-mean) and **flags any sample with weighted NSTI
1.0** in the final report. Those samples are excluded from the per- variable DA analysis by default.
For each of KO, EC, COG:
metagenome_pipeline.py \
-i feature_table.biom \
-m marker_nsti_predicted.tsv.gz \
-f <KO|EC|COG>_predicted.tsv.gz \
--strat_out \
-o <KO|EC|COG>_metagenome/
Output is per-sample predicted abundance of every KO/EC/COG term.
Then the script runs pathway_pipeline.py to roll the EC predictions
into MetaCyc pathway abundances. MetaCyc is the standard
human-readable pathway database — much friendlier than raw KO codes for
biology-focused readers.
For every categorical metadata variable (auto-detected, same as Steps 9/10) the script runs:
Outputs: DA_KO_KW_<Variable>.xlsx, DA_EC_KW_<Variable>.xlsx,
DA_COG_KW_<Variable>.xlsx, DA_metacyc_pathways_KW_<Variable>.xlsx,
each with the per-term p, p_adj, group means, and Dunn pairs for
significant terms.
If Step 14 ran, the script computes Spearman correlations between
every (significant differentially-abundant genus from Step 14) and
every (significant differentially-abundant MetaCyc pathway). The result
is correlation_DAgenera_x_DApathways_<Variable>.{tsv,png,pdf} — a
heatmap that lets the reader say "the increase in pathway X is driven
by the increase in genus Y".
PNG + SVG always; PDF on --publication-figures.
Finally the script writes picrust2_report.html, embedding every
plot as inline <img src="data:image/..."> so the user can email the
single file. Per-sample NSTI flagging is shown at the top so the reader
can see immediately which samples had unreliable predictions.
| Default | Value | Why this default |
|---|---|---|
| Conda env name | mbx_picrust2 |
Isolated from the QIIME2 env so we don't pollute the active scientific env. Baked into the Docker image. |
| PICRUSt2 version | 2.6.3 | The bioconda-published latest stable at image build. Pinned in the Dockerfile. |
| NSTI threshold (per-sample flag) | mean NSTI > 1.0 | PICRUSt2 maintainers' own recommendation — above this the predictions are too unreliable to act on. |
| Databases predicted | KO, EC, COG, MetaCyc | The four most-used functional databases in microbiome papers. |
| Test (per-variable) | Kruskal-Wallis (or Wilcoxon for 2 groups) | Functional abundances are not normal; same reasoning as Step 10. |
| Multiple-testing correction | Benjamini-Hochberg | Standard FDR for the per-database panel. |
| Per-database significance threshold | p_adj ≤ 0.05 | Field convention; user can re-filter. |
| Cross-reference with Step 14 | always when Step 14 outputs are present | Adds the genus-drives-pathway interpretation for free. |
| Plot formats | PNG + SVG always; PDF on --publication-figures |
Publication-ready by default. |
| Threads | MBX_THREADS |
Single source of truth. |
| Fallback | When it triggers | Why this fallback exists |
|---|---|---|
| Refuse to run | READY_FOR_DIVERSITY=no |
Same gate. |
| Skip COG | Known PICRUSt2 2.6.x bug — COG mapping file uses an HDF5 format that the pathway pipeline can't read | KO + EC + MetaCyc still ship; the report explains the gap. Once PICRUSt2 fixes it, COG will run automatically. |
| Exclude high-NSTI samples from DA | Weighted mean NSTI > 1.0 | Their predictions are unreliable; including them would bias the test. They're still in the per-sample table. |
| Skip genus×pathway correlation | Step 14 didn't run or had no significant genera | No data; the correlation block is omitted from the report. |
| Reuse cached intermediates | A previous run produced placed_seqs.tre + the predicted tables | Saves the most expensive sub-step on re-runs. |
| First-run env creation | mbx_picrust2 env doesn't exist (running outside Docker) |
One-time ~2 GB bioconda download. The Docker image avoids this. |
Summary_picrust2_NSTI.xlsx:
| sample-id | mean_NSTI | weighted_mean_NSTI | flag |
|---|---|---|---|
| SampleA | 0.041 | 0.029 | RELIABLE |
| SampleB | 0.053 | 0.038 | RELIABLE |
| SampleC | 1.214 | 1.142 | UNRELIABLE — excluded from DA |
DA_metacyc_pathways_KW_<Variable>.xlsx (per-pathway KW results):
| pathway | description | KW_chi2 | p_value | p_adj | significant |
|---|---|---|---|---|---|
| PWY-7242 | D-fructuronate degradation | 13.92 | 0.0009 | 0.0048 | TRUE |
| PWY-6383 | Mono-trans, poly-cis decaprenyl phosphate biosynthesis | 7.81 | 0.020 | 0.041 | TRUE |
| ... |
Plus the cross-referenced correlation_DAgenera_x_DApathways_<Variable>.tsv
when Step 14 is present, and the self-contained HTML report.
Step 15 turns the taxonomic 16S story into a functional story — phylogenetic placement + nearest-relative gene lookup + per-sample NSTI trust scoring. The script keeps PICRUSt2 in its own conda env so it doesn't poison the QIIME2 env, and bakes that env into the Docker image so the pipeline is fully offline-capable.
mbXPro/scripts/mbx_picrust_run.sh