Contents

Why this step matters
What the script does in one sentence
The algorithm, step by step
Default parameters and why they are what they are
When and why we fall back to defaults
What the output file looks like
Takeaway
Sources

Step 15 — PICRUSt2 (Functional Prediction)

Script: scripts/mbx_picrust_run.sh

Companion files in this folder: - 15_picrust2.html — same content with copy buttons. - 15_picrust2.pptx — slide deck for the talk.

Why this step matters

Every step up to now has answered "what's there" questions. Step 15 answers a "what is it doing" question — at least, predicted from the phylogenetic placement of each ASV.

16S sequencing tells us identity but nothing direct about function. PICRUSt2 (Douglas et al. 2020) fills that gap by exploiting the fact that closely-related organisms tend to share gene families. The algorithm places every ASV on a reference phylogenetic tree, looks up the known gene content of its nearest reference genomes, and infers the expected gene-family abundance per sample. The output is four functional abundance tables:

KO (KEGG Orthology) — individual gene families.
EC (Enzyme Commission) — enzymatic activities.
COG (Clusters of Orthologous Groups) — broad functional categories.
MetaCyc pathways — high-level metabolic pathways.

Step 15's job is to run PICRUSt2 reliably (it lives in its own conda env — see "fallbacks"), flag samples whose ASVs are too distantly related to the reference to trust the prediction, run differential abundance across treatments on each database, and produce a self-contained HTML report so the user can read the functional result alongside the taxonomic ones.

What the script does in one sentence

It runs PICRUSt2 in a pre-baked conda env to predict KO/EC/COG/MetaCyc abundances per sample, computes per-sample NSTI (the phylogenetic distance from each ASV to its nearest reference genome, the official trustworthiness metric), runs Kruskal-Wallis + Dunn differential abundance per categorical variable, optionally correlates genera × pathways with the Step 14 results, and writes a self-contained HTML report.

The algorithm, step by step

1. Gate on Step 11

First, same READY_FOR_DIVERSITY gate. PICRUSt2 uses raw counts (non-rarefied) like ANCOMBC2 — it needs absolute count information for its NSTI weighting.

2. Use the pre-baked `mbx_picrust2` conda env

Then the script confirms a conda env named mbx_picrust2 exists. If the user is running inside the Docker image, the env is baked in at build time (see Phase 3.3 in CHANGELOG.md) — picrust2=2.6.3 from bioconda. If running on a fresh host, the script creates the env on first use (~2 GB download). All picrust2_*.py commands are invoked via conda run -n mbx_picrust2, so the script never modifies the active QIIME2 env.

3. Export the rep-seqs and feature table to PICRUSt2-friendly formats

Next the script converts the Step-4 representative_sequences.qza to a plain FASTA and the Step-7 filtered feature table to a BIOM/TSV pair. Those are PICRUSt2's required inputs.

4. Run `place_seqs.py`

Now PICRUSt2 places every ASV on its reference 20,000-genome tree using epa-ng (placement) + gappa (re-formatting):

place_seqs.py \
  -s rep_seqs.fasta \
  -o placed_seqs.tre \
  -p $THREADS

This is the most computationally expensive sub-step (a few minutes per 1,000 ASVs).

5. Compute NSTI per ASV

Then PICRUSt2 computes the Nearest Sequenced Taxon Index for each ASV — the phylogenetic distance from the ASV to its closest reference genome on the tree. A low NSTI (≤ 0.15) means the prediction is well-grounded in a real reference; a high NSTI (> 1.0) means the ASV is so distantly related to anything sequenced that the predicted gene content is essentially a guess.

The script computes the weighted mean NSTI per sample (read-count- weighted, not just ASV-mean) and **flags any sample with weighted NSTI

1.0** in the final report. Those samples are excluded from the per- variable DA analysis by default.

6. Predict gene-family abundance — three databases

For each of KO, EC, COG:

metagenome_pipeline.py \
  -i feature_table.biom \
  -m marker_nsti_predicted.tsv.gz \
  -f <KO|EC|COG>_predicted.tsv.gz \
  --strat_out          \
  -o <KO|EC|COG>_metagenome/

Output is per-sample predicted abundance of every KO/EC/COG term.

7. Predict MetaCyc pathway abundance

Then the script runs pathway_pipeline.py to roll the EC predictions into MetaCyc pathway abundances. MetaCyc is the standard human-readable pathway database — much friendlier than raw KO codes for biology-focused readers.

8. Per-variable Kruskal-Wallis + Dunn on each database

For every categorical metadata variable (auto-detected, same as Steps 9/10) the script runs:

Kruskal-Wallis on each KO / EC / COG / MetaCyc-pathway abundance.
Benjamini-Hochberg across the gene families.
Dunn pairwise on KW hits.
Compact letter displays on significant pathways.

Outputs: DA_KO_KW_<Variable>.xlsx, DA_EC_KW_<Variable>.xlsx, DA_COG_KW_<Variable>.xlsx, DA_metacyc_pathways_KW_<Variable>.xlsx, each with the per-term p, p_adj, group means, and Dunn pairs for significant terms.

9. Cross-reference with Step 14 (when available)

If Step 14 ran, the script computes Spearman correlations between every (significant differentially-abundant genus from Step 14) and every (significant differentially-abundant MetaCyc pathway). The result is correlation_DAgenera_x_DApathways_<Variable>.{tsv,png,pdf} — a heatmap that lets the reader say "the increase in pathway X is driven by the increase in genus Y".

10. Plots

Stacked-bar top-20 MetaCyc pathways per variable.
Heatmap of differentially-abundant pathways with log-fold-change + significance stars.
Functional PCoA based on Bray-Curtis of MetaCyc abundance — the functional analogue of Step 13's taxonomic PCoA.
Genus × pathway correlation heatmap (when Step 14 is available).

PNG + SVG always; PDF on --publication-figures.

11. Self-contained HTML report

Finally the script writes picrust2_report.html, embedding every plot as inline <img src="data:image/..."> so the user can email the single file. Per-sample NSTI flagging is shown at the top so the reader can see immediately which samples had unreliable predictions.

Default parameters and why they are what they are

Default	Value	Why this default
Conda env name	`mbx_picrust2`	Isolated from the QIIME2 env so we don't pollute the active scientific env. Baked into the Docker image.
PICRUSt2 version	2.6.3	The bioconda-published latest stable at image build. Pinned in the Dockerfile.
NSTI threshold (per-sample flag)	mean NSTI > 1.0	PICRUSt2 maintainers' own recommendation — above this the predictions are too unreliable to act on.
Databases predicted	KO, EC, COG, MetaCyc	The four most-used functional databases in microbiome papers.
Test (per-variable)	Kruskal-Wallis (or Wilcoxon for 2 groups)	Functional abundances are not normal; same reasoning as Step 10.
Multiple-testing correction	Benjamini-Hochberg	Standard FDR for the per-database panel.
Per-database significance threshold	p_adj ≤ 0.05	Field convention; user can re-filter.
Cross-reference with Step 14	always when Step 14 outputs are present	Adds the genus-drives-pathway interpretation for free.
Plot formats	PNG + SVG always; PDF on `--publication-figures`	Publication-ready by default.
Threads	`MBX_THREADS`	Single source of truth.

When and why we fall back to defaults

Fallback	When it triggers	Why this fallback exists
Refuse to run	`READY_FOR_DIVERSITY=no`	Same gate.
Skip COG	Known PICRUSt2 2.6.x bug — COG mapping file uses an HDF5 format that the pathway pipeline can't read	KO + EC + MetaCyc still ship; the report explains the gap. Once PICRUSt2 fixes it, COG will run automatically.
Exclude high-NSTI samples from DA	Weighted mean NSTI > 1.0	Their predictions are unreliable; including them would bias the test. They're still in the per-sample table.
Skip genus×pathway correlation	Step 14 didn't run or had no significant genera	No data; the correlation block is omitted from the report.
Reuse cached intermediates	A previous run produced placed_seqs.tre + the predicted tables	Saves the most expensive sub-step on re-runs.
First-run env creation	`mbx_picrust2` env doesn't exist (running outside Docker)	One-time ~2 GB bioconda download. The Docker image avoids this.

What the output file looks like

Summary_picrust2_NSTI.xlsx:

sample-id	mean_NSTI	weighted_mean_NSTI	flag
SampleA	0.041	0.029	RELIABLE
SampleB	0.053	0.038	RELIABLE
SampleC	1.214	1.142	UNRELIABLE — excluded from DA

DA_metacyc_pathways_KW_<Variable>.xlsx (per-pathway KW results):

pathway	description	KW_chi2	p_value	p_adj	significant
PWY-7242	D-fructuronate degradation	13.92	0.0009	0.0048	TRUE
PWY-6383	Mono-trans, poly-cis decaprenyl phosphate biosynthesis	7.81	0.020	0.041	TRUE
...

Plus the cross-referenced correlation_DAgenera_x_DApathways_<Variable>.tsv when Step 14 is present, and the self-contained HTML report.

Takeaway

Step 15 turns the taxonomic 16S story into a functional story — phylogenetic placement + nearest-relative gene lookup + per-sample NSTI trust scoring. The script keeps PICRUSt2 in its own conda env so it doesn't poison the QIIME2 env, and bakes that env into the Docker image so the pipeline is fully offline-capable.

Sources

The script: mbXPro/scripts/mbx_picrust_run.sh
PICRUSt2: Douglas et al. (2020), PICRUSt2 for prediction of metagenome functions, Nature Biotechnology 38:685–688.
NSTI: Langille et al. (2013), Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences, Nature Biotechnology 31:814.
MetaCyc: Caspi et al. (2020), The MetaCyc database of metabolic pathways and enzymes, NAR 48:D445.
KEGG: Kanehisa et al. (2017), KEGG: new perspectives on genomes, pathways, diseases and drugs, NAR 45:D353.