Step 3 — DADA2 Parameter Finder

How mbX Pro picks the four numbers DADA2 needs, from the data itself.

Script: scripts/mbx_dada2_parameter_finder.sh (orchestrator) + scripts/create_dada2_parameters_txt.sh (core algorithm)

Contents
  1. Why this step matters
  2. What we tell DADA2
  3. The algorithm, step by step
  4. Default parameters and why
  5. When and why we fall back to defaults
  6. What the output file looks like
  7. Takeaway

Why this step matters

Illumina sequencers don't return perfect reads. Every position in every read comes with a Phred quality score (a number from 0–40 that says how confident the sequencer is about that base). Quality drops off near the back of every read, and it drops off faster on the reverse read than on the forward read.

If we feed those raw reads straight into the denoiser (DADA2 — the algorithm that turns noisy reads into the actual biological variants, called ASVs), two bad things can happen:

The job of this step is to land on the sweet spot — pick the four numbers that DADA2 needs (--p-trim-left-f, --p-trim-left-r, --p-trunc-len-f, --p-trunc-len-r) using the data itself, not a hand-waved guess.


What we tell DADA2

ParameterWhat it means in plain English
--p-trim-left-fHow many bases to chop off the front of the forward read (the primer — the synthetic anchor used during PCR, not biology).
--p-trim-left-rSame, for the front of the reverse read.
--p-trunc-len-fThe length to keep on the forward read (everything past this point is too noisy).
--p-trunc-len-rThe length to keep on the reverse read.

Four numbers. The whole step exists to find these four numbers, defensibly, from the data.


The algorithm, step by step

1Open the QIIME2 artifact

First, the script opens the .qza file. A QIIME2 artifact is just a zip archive (rename it .zip and macOS would happily unzip it). Inside, two things matter:

2Sub-sample reads

Then, instead of reading every single read (hundreds of millions in a real run), the script samples up to 1,000 read pairs per sample (READS_PER_SAMPLE). Quality letters are decoded into integers using Phred+33 encoding (the FASTQ standard — the character '!' means quality 0, 'I' means quality 40, etc.).

This sub-sample is enough to estimate every distribution we care about, and it keeps the whole step under a minute on a normal laptop.

3Figure out how long the biological piece is

Now the clever part. DADA2 needs to know the insert length (the actual DNA between the two primers — the molecule the sequencer was reading). For 16S V4 it's about 253 bases, but every primer pair and every species mix gives a slightly different answer.

Instead of asking the user, the script figures it out from the reads:

This is what lets the script pick a forward-cut and reverse-cut that still leave enough overlap to merge the two reads back together.

4Decide trim-left (where to chop off the primer)

Next, the script decides where each read's biological content actually starts:

It also computes a sanity check: of the reads we sampled, in what fraction did that primer actually match at the 5′ end? The comparison is IUPAC-aware (so Y matches both C and T, N matches anything) and tolerates up to 15 % mismatch. That percentage goes straight into the output file so the user can see at a glance whether the primer was real.

5Walk every read base-by-base — the expected-error trick

Then comes the heart of the algorithm. This replaces "look at the quality plot and guess where to cut".

For every read pair:

6The suffix-sum lookup table

Now we have a scale problem. For every possible (forward-cut, reverse-cut) we want to know: how many of my read pairs survive both cuts? Naively that's millions of look-ups.

So the script builds a 2-D suffix sum — a cumulative-total table in two dimensions (the same trick spreadsheets use for running totals, generalized to a grid). Once built, asking "how many pairs survive at forward-cut=240 and reverse-cut=180?" becomes one array access. That's how it can score hundreds of candidate truncation pairs in a fraction of a second.

7Apply the quality cutoffs

Now we filter and rank. For every candidate cut pair, the script computes three pass rates:

It also requires forward-cut + reverse-cut ≥ insert-length + 12 — that 12 is the minimum merge overlap DADA2 needs to confidently stitch the forward and reverse reads back together. Without it, the reads don't overlap and the pair gets thrown away.

8Pick the winner

Finally, among the eligible candidates, the script:

  1. Keeps the ones within 1 base (LENGTH_SLACK) of the longest total trimmed length — so we discard as little real biology as possible.
  2. Among those, prefers the candidate with the highest pair-pass rate, then the most balanced forward and reverse cuts (asymmetric cuts make the overlap awkward and can introduce bias), then the longest overlap.

The winner gets written into dada2_parameters.txt along with the top 20 runners-up, all three pass rates, the insert length it inferred, and the percentage of reads whose primers actually matched.


Default parameters and why they are what they are

DefaultValueWhy this default
Reads sampled per sample (READS_PER_SAMPLE)1,000Enough to estimate the expected-error distribution within ~1 %; small enough that the step finishes in seconds on a laptop.
Pairs aligned for insert estimation (INSERT_SAMPLE_PAIRS)2,500Enough overlapping pairs to get a clean P99 of the insert length; covers the typical sequencing-batch heterogeneity.
Insert-overlap k-mer size13 basesLong enough to be specific (a random 13-mer hits any genome by chance ~once per 67 million bases); short enough that one or two sequencing errors don't break every seed.
Insert-overlap identity floor≥ 90 %Below this, the alignment is almost certainly spurious. 90 % is the standard cut for paired-end merge tools (vsearch, fastp, PEAR).
Insert length usedP99 of cleaned distributionWe need to cover almost every real fragment but ignore extreme outliers caused by chimeras. P99 strikes that balance.
Trim-left default (no primer known)20 basesMost published 16S primers are 18–22 bases. Twenty is a defensive guess that's never too short on real data.
Primer match tolerance≤ 15 % mismatchIUPAC-aware comparison. 15 % is tight enough to avoid false matches yet loose enough to absorb a couple of sequencing errors in the primer-landing zone.
max-ee (expected-error cap)2.0The DADA2 paper's own default. "At most ~2 wrong bases over the kept length".
Pair-pass rate floor0.80At least 80 % of read pairs must survive the cut — anything less means we're throwing away a quarter of the data.
Forward-pass rate floor0.90Forward reads are good quality on Illumina — we expect 90 %+ to survive. If they don't, the run itself is suspect.
Reverse-pass rate floor0.84Reverse reads are systematically worse; 0.84 is empirically the lowest floor that still rejects clearly broken runs.
Minimum merge overlap12 basesDADA2's own default. Anything shorter and the merge call becomes statistically ambiguous.
Length slack (winner band)1 base"Keep candidates within 1 bp of the longest total length, then pick on quality."

These defaults are not magic numbers — they come from the DADA2 paper (Callahan et al. 2016, Nature Methods) and from empirical sweeps the maintainers ran on real datasets across multiple 16S regions.


When and why we fall back to defaults

The algorithm is designed to always return a recommendation — even on imperfect data — but it tells you, in plain language, every time it had to compromise. There are five places it can fall back:

FallbackWhen it triggersWhy this fallback exists
trim-left = 20 (no primer) User did not pass --forward-primer/--reverse-primer AND Step 0 reported DETECTION_STATUS = UNKNOWN AND no --assume-primer-length was given. Cutting nothing risks leaving primer in the data, which DADA2 would treat as biological signal. 20 bp is the median 16S primer length — a defensive over-trim that's harmless if the primer was shorter and protective if it was longer.
trim-left = 0 (sequencer already trimmed) Step 0 reported DETECTION_STATUS = TRIMMED (it scanned the reads and found no primer-like 5′ anchor). Trimming 20 bp off an already-trimmed read would delete 20 bp of real biology. Better to trust the upstream verdict and not cut.
Use --amplicon-length instead of estimating User supplied --amplicon-length. Estimation needs the forward and reverse reads to actually overlap. If they don't (very long amplicons, short reads), estimation fails. The user override skips estimation entirely.
Relax pass-rate filters to "pair-pass ≥ 0.80 only" No candidate meets all three pass-rate floors at once. A bad run might have a great forward read but a beaten-up reverse read. Rather than fail, the script keeps the most important global cutoff (80 % of pairs survive) and lets the forward/reverse floors slide. It flags the relaxation in the output.
Take any candidate with valid overlap Even the relaxed filter leaves zero candidates. Last-resort. Better to return something the user can inspect than to abort the pipeline at step 3. The output file says clearly that the strict filters could not be met.
The principle: no silent compromises. Every fallback is recorded in the output file under a section called fallback_reasons, so the user can see exactly what gave way and why. Nothing is hidden.

What the output file looks like

The script writes 3_dada2_parameters/dada2_parameters.txt. The interesting keys downstream steps actually consume:

--p-trim-left-f             19
--p-trim-left-r             20
--p-trunc-len-f             240
--p-trunc-len-r             180

INSERT_LENGTH_USED          253
FORWARD_PRIMER_MATCH_RATE   97.3%
REVERSE_PRIMER_MATCH_RATE   95.1%
PAIR_PASS_RATE              0.86
FORWARD_PASS_RATE           0.94
REVERSE_PASS_RATE           0.89
FALLBACK_REASONS            (none)

Plus a "top-20 leaderboard" of candidate truncation pairs so the user can do a five-second sanity check before pressing Run on Step 4.


Takeaway

The DADA2 parameter finder replaces "stare at the quality plot and guess" with a small, fast, deterministic algorithm that uses the data itself: figure out the real insert length from the reads, walk each read until the expected error gets too high, then pick the cut that keeps the most reads while still leaving enough overlap to merge.

Same logic a careful bioinformatician would apply by hand — made reproducible, defensible, and ~60× faster.