Contents
  1. Why this step matters
  2. What the script does in one sentence
  3. The algorithm, step by step
  4. Default parameters and why they are what they are
  5. When and why we fall back to defaults
  6. What the output file looks like
  7. Takeaway
  8. Sources

Step 15 — PICRUSt2 (Functional Prediction)

Script: scripts/mbx_picrust_run.sh

Companion files in this folder: - 15_picrust2.html — same content with copy buttons. - 15_picrust2.pptx — slide deck for the talk.


Why this step matters

Every step up to now has answered "what's there" questions. Step 15 answers a "what is it doing" question — at least, predicted from the phylogenetic placement of each ASV.

16S sequencing tells us identity but nothing direct about function. PICRUSt2 (Douglas et al. 2020) fills that gap by exploiting the fact that closely-related organisms tend to share gene families. The algorithm places every ASV on a reference phylogenetic tree, looks up the known gene content of its nearest reference genomes, and infers the expected gene-family abundance per sample. The output is four functional abundance tables:

Step 15's job is to run PICRUSt2 reliably (it lives in its own conda env — see "fallbacks"), flag samples whose ASVs are too distantly related to the reference to trust the prediction, run differential abundance across treatments on each database, and produce a self-contained HTML report so the user can read the functional result alongside the taxonomic ones.


What the script does in one sentence

It runs PICRUSt2 in a pre-baked conda env to predict KO/EC/COG/MetaCyc abundances per sample, computes per-sample NSTI (the phylogenetic distance from each ASV to its nearest reference genome, the official trustworthiness metric), runs Kruskal-Wallis + Dunn differential abundance per categorical variable, optionally correlates genera × pathways with the Step 14 results, and writes a self-contained HTML report.


The algorithm, step by step

1. Gate on Step 11

First, same READY_FOR_DIVERSITY gate. PICRUSt2 uses raw counts (non-rarefied) like ANCOMBC2 — it needs absolute count information for its NSTI weighting.

2. Use the pre-baked mbx_picrust2 conda env

Then the script confirms a conda env named mbx_picrust2 exists. If the user is running inside the Docker image, the env is baked in at build time (see Phase 3.3 in CHANGELOG.md) — picrust2=2.6.3 from bioconda. If running on a fresh host, the script creates the env on first use (~2 GB download). All picrust2_*.py commands are invoked via conda run -n mbx_picrust2, so the script never modifies the active QIIME2 env.

3. Export the rep-seqs and feature table to PICRUSt2-friendly formats

Next the script converts the Step-4 representative_sequences.qza to a plain FASTA and the Step-7 filtered feature table to a BIOM/TSV pair. Those are PICRUSt2's required inputs.

4. Run place_seqs.py

Now PICRUSt2 places every ASV on its reference 20,000-genome tree using epa-ng (placement) + gappa (re-formatting):

place_seqs.py \
  -s rep_seqs.fasta \
  -o placed_seqs.tre \
  -p $THREADS

This is the most computationally expensive sub-step (a few minutes per 1,000 ASVs).

5. Compute NSTI per ASV

Then PICRUSt2 computes the Nearest Sequenced Taxon Index for each ASV — the phylogenetic distance from the ASV to its closest reference genome on the tree. A low NSTI (≤ 0.15) means the prediction is well-grounded in a real reference; a high NSTI (> 1.0) means the ASV is so distantly related to anything sequenced that the predicted gene content is essentially a guess.

The script computes the weighted mean NSTI per sample (read-count- weighted, not just ASV-mean) and **flags any sample with weighted NSTI

1.0** in the final report. Those samples are excluded from the per- variable DA analysis by default.

6. Predict gene-family abundance — three databases

For each of KO, EC, COG:

metagenome_pipeline.py \
  -i feature_table.biom \
  -m marker_nsti_predicted.tsv.gz \
  -f <KO|EC|COG>_predicted.tsv.gz \
  --strat_out          \
  -o <KO|EC|COG>_metagenome/

Output is per-sample predicted abundance of every KO/EC/COG term.

7. Predict MetaCyc pathway abundance

Then the script runs pathway_pipeline.py to roll the EC predictions into MetaCyc pathway abundances. MetaCyc is the standard human-readable pathway database — much friendlier than raw KO codes for biology-focused readers.

8. Per-variable Kruskal-Wallis + Dunn on each database

For every categorical metadata variable (auto-detected, same as Steps 9/10) the script runs:

Outputs: DA_KO_KW_<Variable>.xlsx, DA_EC_KW_<Variable>.xlsx, DA_COG_KW_<Variable>.xlsx, DA_metacyc_pathways_KW_<Variable>.xlsx, each with the per-term p, p_adj, group means, and Dunn pairs for significant terms.

9. Cross-reference with Step 14 (when available)

If Step 14 ran, the script computes Spearman correlations between every (significant differentially-abundant genus from Step 14) and every (significant differentially-abundant MetaCyc pathway). The result is correlation_DAgenera_x_DApathways_<Variable>.{tsv,png,pdf} — a heatmap that lets the reader say "the increase in pathway X is driven by the increase in genus Y".

10. Plots

PNG + SVG always; PDF on --publication-figures.

11. Self-contained HTML report

Finally the script writes picrust2_report.html, embedding every plot as inline <img src="data:image/..."> so the user can email the single file. Per-sample NSTI flagging is shown at the top so the reader can see immediately which samples had unreliable predictions.


Default parameters and why they are what they are

Default Value Why this default
Conda env name mbx_picrust2 Isolated from the QIIME2 env so we don't pollute the active scientific env. Baked into the Docker image.
PICRUSt2 version 2.6.3 The bioconda-published latest stable at image build. Pinned in the Dockerfile.
NSTI threshold (per-sample flag) mean NSTI > 1.0 PICRUSt2 maintainers' own recommendation — above this the predictions are too unreliable to act on.
Databases predicted KO, EC, COG, MetaCyc The four most-used functional databases in microbiome papers.
Test (per-variable) Kruskal-Wallis (or Wilcoxon for 2 groups) Functional abundances are not normal; same reasoning as Step 10.
Multiple-testing correction Benjamini-Hochberg Standard FDR for the per-database panel.
Per-database significance threshold p_adj ≤ 0.05 Field convention; user can re-filter.
Cross-reference with Step 14 always when Step 14 outputs are present Adds the genus-drives-pathway interpretation for free.
Plot formats PNG + SVG always; PDF on --publication-figures Publication-ready by default.
Threads MBX_THREADS Single source of truth.

When and why we fall back to defaults

Fallback When it triggers Why this fallback exists
Refuse to run READY_FOR_DIVERSITY=no Same gate.
Skip COG Known PICRUSt2 2.6.x bug — COG mapping file uses an HDF5 format that the pathway pipeline can't read KO + EC + MetaCyc still ship; the report explains the gap. Once PICRUSt2 fixes it, COG will run automatically.
Exclude high-NSTI samples from DA Weighted mean NSTI > 1.0 Their predictions are unreliable; including them would bias the test. They're still in the per-sample table.
Skip genus×pathway correlation Step 14 didn't run or had no significant genera No data; the correlation block is omitted from the report.
Reuse cached intermediates A previous run produced placed_seqs.tre + the predicted tables Saves the most expensive sub-step on re-runs.
First-run env creation mbx_picrust2 env doesn't exist (running outside Docker) One-time ~2 GB bioconda download. The Docker image avoids this.

What the output file looks like

Summary_picrust2_NSTI.xlsx:

sample-id mean_NSTI weighted_mean_NSTI flag
SampleA 0.041 0.029 RELIABLE
SampleB 0.053 0.038 RELIABLE
SampleC 1.214 1.142 UNRELIABLE — excluded from DA

DA_metacyc_pathways_KW_<Variable>.xlsx (per-pathway KW results):

pathway description KW_chi2 p_value p_adj significant
PWY-7242 D-fructuronate degradation 13.92 0.0009 0.0048 TRUE
PWY-6383 Mono-trans, poly-cis decaprenyl phosphate biosynthesis 7.81 0.020 0.041 TRUE
...

Plus the cross-referenced correlation_DAgenera_x_DApathways_<Variable>.tsv when Step 14 is present, and the self-contained HTML report.


Takeaway

Step 15 turns the taxonomic 16S story into a functional story — phylogenetic placement + nearest-relative gene lookup + per-sample NSTI trust scoring. The script keeps PICRUSt2 in its own conda env so it doesn't poison the QIIME2 env, and bakes that env into the Docker image so the pipeline is fully offline-capable.


Sources