Script: scripts/mbx_classifier_run.sh
Companion files in this folder:
- 6_classifier_run.html — same content with copy buttons on every code block.
- 6_classifier_run.pptx — slide deck for the talk.
Step 5 set the table — decided region-specific vs full-length, downloaded or staged the reference data, sha256-verified anything that came from Zenodo. Step 6 is where the actual taxonomic classification runs: given an ASV sequence, what organism does it come from?
The answer has to be defensible, reproducible, and fall back gracefully
if the Zenodo download Step 5 thought was good turns out to be unloadable
in classify-sklearn. That last part is the trick. Zenodo verifies file
integrity by sha256 — but sha256 only proves "the bits are what we
expected", not "the bits are loadable in the user's exact sklearn
build". A subtle pickle incompatibility (sklearn 1.4.0 vs 1.4.2, say) can
slip through the sha check and only fail when classify-sklearn
actually tries to load the model.
Step 6's job is to handle that gracefully — detect the failure, delete the bad artifact, train a fresh classifier locally, retry, and only fail the pipeline if local training also fails. The user shouldn't have to know any of this happened unless they want to.
It reads Step 5's run-info to know the mode + source, then runs three QIIME2 commands (extract-reads → fit-classifier-naive-bayes → classify-sklearn) — skipping the first two when a pre-trained classifier was downloaded, and falling back to local training if the downloaded classifier turns out to be unloadable.
First the script reads 5_classifier_working_dir/mbx_classifier_run_info.txt
and pulls out CLASSIFIER_MODE, CLASSIFIER_SOURCE, and every path it
needs. No CLI arguments — the script gets everything from Step 5's
canonical output.
Then the script decides which of the three QIIME2 commands to actually execute, based on the source:
CLASSIFIER_SOURCE |
extract-reads | fit-classifier | classify-sklearn |
|---|---|---|---|
local-training (region-specific) |
RUN | RUN | RUN |
local-training (full-length) |
SKIP | RUN | RUN |
zenodo or cached |
SKIP | SKIP | RUN |
A zenodo source skips the two expensive sub-steps entirely — that's where
the 30–90 minutes of saving comes from.
For region-specific local training, the script runs:
qiime feature-classifier extract-reads \
--i-sequences gg2.backbone.full-length.fna.qza \
--p-f-primer <forward primer from Step 0> \
--p-r-primer <reverse primer from Step 0> \
--p-min-length <min ASV length from Step 5> \
--p-max-length <max ASV length from Step 5> \
--o-reads gg2_extracted_reads.qza
This slices the GG2 backbone to just the V-region the user actually sequenced, at the same length distribution Step 4 produced. The resulting reference is much smaller — training on it is faster and species-level accuracy goes up a few points.
Then the script runs:
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads <extracted_reads or full-length backbone>.qza \
--i-reference-taxonomy gg2.backbone.tax.qza \
--o-classifier gg2_trained_classifier.qza
This is the expensive step — 30–90 minutes of CPU + ~15 GB RAM peak. We only run it when we don't have a pre-trained artifact from Zenodo.
Now the actual classification:
qiime feature-classifier classify-sklearn \
--i-classifier <pre-trained or freshly-trained>.qza \
--i-reads representative_sequences.qza \
--p-n-jobs <MBX_THREADS> \
--o-classification taxonomy.qza
This is where the fall-back gets interesting:
zenodo AND classify-sklearn fails, the script
catches the failure, deletes the downloaded artifact, updates
CLASSIFIER_SOURCE to local-training-fallback, re-runs Step 5's
local-training prep, then re-runs fit-classifier-naive-bayes and
classify-sklearn. The user sees a single retry message; the pipeline
keeps going.classify-sklearn fails,
the script aborts with a detailed error. That failure is almost always a
GG2 version mismatch or a corrupted reference — nothing the script can
fix automatically.Finally the script runs qiime metadata tabulate on the classification
result to produce taxonomy.qzv — an interactive HTML table the user can
open in a browser. Every ASV gets its assigned taxonomy string and a
confidence score.
All three QIIME2 commands write to disk; the script checks for those outputs before running each. So a partial run that died after extract-reads is safe to re-run — it'll skip the expensive completed step and pick up where it left off.
| Default | Value | Why this default |
|---|---|---|
extract-reads --p-f-primer / --p-r-primer |
from Step 0 | We use exactly the primers Step 0 detected. If Step 0 was wrong, we'd rather see that here than mask it. |
extract-reads --p-min-length / --p-max-length |
from Step 5 (the ASV length distribution) | The trained classifier should see the same length range it'll classify. |
fit-classifier --p-classify-with |
Naive Bayes (Multinomial) | The QIIME2-default classifier. Faster than scikit-learn's vanilla MultinomialNB on the same problem, and better calibrated for amplicon data. |
classify-sklearn --p-n-jobs |
MBX_THREADS (orchestrator-exported) |
Scales near-linearly. Single source of truth for parallelism. |
classify-sklearn --p-confidence |
0.7 (QIIME2 default) | Below this confidence the taxonomy string is truncated to the deepest defensible level. |
classify-sklearn --p-pre-dispatch |
'2*n_jobs' |
sklearn's default; balances memory vs throughput. |
| Skip extract-reads | when CLASSIFIER_MODE = full-length |
Full-length training doesn't need a region extracted. |
| Skip fit-classifier | when CLASSIFIER_SOURCE ∈ {zenodo, cached} |
A pre-trained classifier already exists; training again would produce the same artifact for 30–90 min of wasted compute. |
| Reuse on existing outputs | always | Step 6 is idempotent — re-running is safe and cheap when outputs already exist. |
| Fallback | When it triggers | Why this fallback exists |
|---|---|---|
| Skip extract-reads | CLASSIFIER_MODE = full-length |
Full-length mode trains on the whole backbone — no slice needed. |
| Skip fit-classifier | CLASSIFIER_SOURCE ∈ {zenodo, cached} |
A pre-trained artifact exists; rebuilding would produce the same result. |
| Reuse cached intermediate | A previous run already produced gg2_extracted_reads.qza or gg2_trained_classifier.qza for the same parameters |
Saves wall-clock when the user reruns Step 6 after a downstream failure. |
local-training-fallback |
classify-sklearn failed on a zenodo artifact |
The bits matched the sha256 but the classifier won't load — usually a subtle scikit-learn pickle issue. Script deletes the bad file, retrains locally, retries. |
Use orchestrator MBX_THREADS |
mbXPro --threads N was set |
One source of truth for parallelism. |
| Hard fail on local-training failure | Local training fitting failed | This is almost always GG2 corruption or a missing reference — nothing the script can fix automatically. We error loudly so the user can re-download. |
taxonomy.qza is a zip archive containing one row per ASV:
Feature ID Taxon Confidence
ASV_001 d__Bacteria;p__Bacillota;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus 0.9912
ASV_002 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides 0.9847
ASV_003 d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales 0.6334
ASV_004 Unassigned 0.0142
...
d__ /
p__ / c__ / o__ / f__ / g__ / s__ prefixes.--p-confidence cutoff get the
string truncated to the deepest level above the cutoff.Unassigned.The companion taxonomy.qzv opens this same data in a browser as a
searchable, sortable table.
Step 6 is the actual classification execution. The interesting part isn't the QIIME2 commands — those are standard — it's the fall-back logic: if a Zenodo-sourced classifier fails to load (subtle sklearn pickle mismatch), the script automatically deletes it, regresses to local training, retries, and surfaces the whole event in
CLASSIFIER_SOURCE = local-training-fallbackso the final report can explain what happened. The pipeline doesn't abort because of an upstream cache mismatch.
mbXPro/scripts/mbx_classifier_run.shq2-feature-classifier plugin:
https://docs.qiime2.org/2025.4/plugins/available/feature-classifier/