Contents
  1. Why this step matters
  2. What the script does in one sentence
  3. The algorithm, step by step
  4. Default parameters and why they are what they are
  5. When and why we fall back to defaults
  6. What the output file looks like
  7. Takeaway
  8. Sources

Step 7 — Taxonomy CSVs + Mito/Chloro Filter

Script: scripts/mbx_taxonomy_run.sh

Companion files in this folder: - 7_taxonomy_run.html — same content with copy buttons on every code block. - 7_taxonomy_run.pptx — slide deck for the talk.


Why this step matters

Steps 4 and 6 left us with two artifacts:

The rest of the pipeline — diversity statistics, differential abundance, machine learning — all needs these two combined into per-level relative abundance tables: a domain table, a phylum table, a class table, …, down to species. The downstream R steps consume those tables, not the QIIME2 QZA files.

But there's also a contamination problem specific to 16S amplicon analysis. The 16S rRNA primers also amplify the rRNA from:

If we leave those reads in, they get classified as plant or animal mitochondrial / chloroplast lineages and show up as bizarre "bacteria" in every downstream analysis. Worse, they often dominate the read counts in any sample with plant or animal material — a single bovine mitochondrial ASV can hijack 30 % of a rumen sample's reads.

Step 7's job is to remove that contamination before anyone downstream ever sees the data, and to produce the seven level CSVs cleanly.


What the script does in one sentence

It filters mitochondrial and chloroplast features out of the Step 4 feature table using the Step 6 taxonomy strings, builds a QIIME2 taxa barplot from the filtered data, exports the barplot to seven level-N.csv files (one per taxonomic level), and writes the cross-step paths Step 8 onwards will read.


The algorithm, step by step

1. Read the upstream info files

First the script reads 5_classifier_working_dir/mbx_classifier_run_info.txt to find the feature-table path and the taxonomy path. It uses Step 5's canonical paths so a re-run with a different working directory layout still works.

2. Filter mitochondria and chloroplasts

Then the script runs:

qiime taxa filter-table \
  --i-table        feature_table.qza \
  --i-taxonomy     taxonomy.qza \
  --p-mode         contains \
  --p-exclude      mitochondria,chloroplast \
  --o-filtered-table feature_table_filtered.qza

The --p-mode contains plus --p-exclude mitochondria,chloroplast tells QIIME2: drop any feature whose taxonomy string mentions either keyword, case-insensitive. This catches:

The script logs how many features were removed and what fraction of the total reads they accounted for — that number is a useful sanity check (typically 0–5 % in stool / soil; up to 30 % in plant material or rumen samples).

3. Build the taxa barplot

Next the script runs:

qiime taxa barplot \
  --i-table       feature_table_filtered.qza \
  --i-taxonomy    taxonomy.qza \
  --m-metadata-file metadata.txt \
  --o-visualization taxa_bar_plots.qzv

taxa_bar_plots.qzv is the interactive HTML the user can open to look at the stacked-bar composition of every sample at every taxonomic level. It's the first qualitative result the user actually sees — it tells them "the data looks right".

4. Export the seven level CSVs

Then the script unzips the .qzv file (a QZV is also a zip archive), finds the seven level-N.csv files inside (one for each of N = 1 through 7 → domain, phylum, class, order, family, genus, species), and copies them to the output directory.

These seven CSVs are the single source of truth that every downstream analysis step consumes. They're plain CSV — one row per sample, one column per taxon at that level, plus the metadata columns. No QIIME2 wrapping; any tool (R, pandas, Excel) can read them.

5. Write the cross-step info file

Finally the script writes mbx_taxonomy_info.txt with:


Default parameters and why they are what they are

Default Value Why this default
Filter mode contains Case-insensitive substring match. Catches c__Mitochondria, f__Mitochondria, c__Chloroplast, and anything weird with those keywords.
Filter terms mitochondria,chloroplast The two contamination sources 16S primers always amplify in eukaryotic-containing samples.
Levels exported 1 through 7 (domain → species) Every taxonomic level downstream R + Python steps need. Level 7 in particular contains the full string from which higher levels can be reconstructed.
Barplot output taxa_bar_plots.qzv QIIME2's interactive HTML viewer — the first place a user spots whether the data looks reasonable.
Cross-step contract field LEVEL_7_CSV The full hierarchy lives in level-7. Step 8 (ezclean) parses it to rebuild higher levels with consistent naming.

When and why we fall back to defaults

Fallback When it triggers Why this fallback exists
Re-use cached feature_table_filtered.qza A previous run already produced it with the same input hashes Saves a re-filter on repeated runs.
Log + continue when 0 features removed Sample is purely microbial (faecal, soil, water) with no eukaryotic host Not an error — just unusual. The log says "0 mito/chloro features removed".
Log + continue when > 30 % features removed Plant or animal-tissue-dominated sample (rumen, root, tongue swab) Not an error — but the user should know. The final report flags this number.
Skip mitochondria term but keep chloroplast Not yet exposed as a flag, but reserved for the rare case of needing mitochondrial rRNA itself (e.g. eDNA studies) Documented for future extension; default behaviour removes both.

What the output file looks like

7_taxonomy_csv/level-7.csv (example, transposed for readability):

sample-id   d__Bacteria;p__Bacillota;...;s__Lactobacillus_acidophilus   d__Bacteria;p__Bacteroidota;...;s__Bacteroides_fragilis   ...
SampleA     0.0214                                                     0.0089                                                   ...
SampleB     0.0142                                                     0.0312                                                   ...
...

Plus six analogous files for levels 1 through 6, the filtered QZA, and the interactive taxa_bar_plots.qzv for the user's first sanity check.


Takeaway

Step 7 is the last QIIME2 step in the pipeline. It's also where the most common 16S-data gotcha (mitochondrial and chloroplast contamination) gets caught — once, here, for the whole downstream pipeline. The seven level CSVs it produces become the single source of truth that every R and Python step from Step 8 onwards consumes. From now on we never touch QIIME2 again.


Sources