Contents
  1. Why this step matters
  2. What the script does in one sentence
  3. The algorithm, step by step
  4. Default parameters and why they are what they are
  5. When and why we fall back to defaults
  6. What the output file looks like
  7. Takeaway
  8. Sources

Step 1 — Manifest Builder

Script: scripts/create_manifest.sh

Companion files in this folder: - 1_create_manifest.html — same content with copy buttons on every code block. - 1_create_manifest.pptx — slide deck for the talk.


Why this step matters

QIIME2 — the platform we use for everything from denoising to taxonomy — can't read raw FASTQ files directly. Before it can do anything, it needs an import manifest: a small tab-separated table that says for every sample, which FASTQ file on disk is the forward read and which is the reverse.

That sounds trivial. It is not. Real sequencing data shows up with five different naming conventions depending on which facility delivered it:

A mismatch — forward labelled as reverse, two samples merged into one, or a sample present in the FASTQ folder but missing from the metadata — produces silent errors that the downstream steps inherit as "weird results" that take hours to debug. The manifest builder's job is to never let that happen.


What the script does in one sentence

It scans a FASTQ directory, classifies every file by direction and sample, reconciles the sample IDs against the user's metadata file, and produces a QIIME2-compatible TSV with absolute paths — failing loudly the moment anything inconsistent appears.


The algorithm, step by step

1. Walk the FASTQ directory

First, the script finds every file matching *.fastq.gz, *.fq.gz, *.fastq, or *.fq in the directory the user pointed to. By default it also peeks one level above and one level below, so users who keep their FASTQs in a subdirectory don't have to know about that.

2. Classify each filename

Then an embedded Python helper applies a regex cascade, in priority order, to every filename:

Both the classifier and the priority order are identical to the ones Step 0 (the primer identifier) uses internally — so Step 0 and Step 1 can never disagree about which file is which.

3. Infer the sample ID

Now the script strips the direction marker out of each filename to recover the sample ID. SampleA_R1_001.fastq.gz → sample SampleA_001 → sample SampleA. The stripping rules are conservative: anything that looks like an Illumina lane or chunk number (_L001, _001) gets dropped only if doing so preserves uniqueness.

4. Reconcile against the metadata

Critical step. If the user supplied a metadata file, the script reads its first column (sample-id / sampleid / id) and tries to match every FASTQ sample-id against it:

5. Decide paired-end vs single-end at the directory level

A directory is paired-end if and only if every sample has both a forward and a reverse file. Mixed directories (some samples paired, some single) are rejected — they're almost always a mistake (e.g. the reverse file failed to copy from the sequencer).

6. Write the manifest

Finally, the script writes the QIIME2-compatible TSV with absolute paths (so QIIME2 can find the files regardless of where the user runs qiime tools import from). Paired-end format:

sample-id    forward-absolute-filepath    reverse-absolute-filepath
SampleA      /abs/path/SampleA_R1.fastq.gz    /abs/path/SampleA_R2.fastq.gz
...

Single-end format:

sample-id    absolute-filepath
SampleA      /abs/path/SampleA.fastq.gz
...

The manifest goes into 1_manifest_file/manifest.txt inside the run's output tree.


Default parameters and why they are what they are

Default Value Why this default
FASTQ extensions recognised .fastq.gz .fq.gz .fastq .fq The four spellings every common pipeline produces. We don't try .bam or .cram — those need a different import path.
Search depth parent + child of given dir Users often keep FASTQs in a subfolder (e.g. ~/project/FASTQ/). Searching one level above and below catches both layouts without requiring the user to think about it.
Name-pattern priority R1/R2 → _1/_2 → READ1/READ2 → forward/reverse → single Most-specific to least-specific so an ambiguous filename can't accidentally match multiple patterns.
Sample-ID matching exact, then unique case-insensitive Catches the common case-mismatch bug (FASTQ has Sample01, metadata has sample01) without inviting collisions.
Output location <parent_of_fastq>/mbX_pro_outputs_<TS>/1_manifest_file/manifest.txt We never write inside the FASTQ folder — those should be treated as read-only.

When and why we fall back to defaults

Fallback When it triggers Why this fallback exists
Search siblings + children User pointed at a directory that contains only sub-directories of FASTQ files. Common in shared-storage setups; easier than asking the user to re-aim.
Adopt metadata casing FASTQ sample-id matches a metadata row only after case-folding. Avoids the most common merge-failure mode without inviting silent ambiguity (we require the case-insensitive match to be unique).
Refuse mixed paired + single Some samples have both reads, others have only one. Mixed-mode runs are almost always a delivery error. Continuing would silently drop the broken samples or import them with wrong direction labels.
Hard-fail on missing metadata row A FASTQ sample-id has no row in the metadata. The alternative — silently importing it as an "unknown" sample — would cause it to vanish at every downstream --m-metadata-file command, looking like a phantom data loss.

What the output file looks like

sample-id   forward-absolute-filepath   reverse-absolute-filepath
SampleA /Users/.../FASTQ/SampleA_R1_001.fastq.gz    /Users/.../FASTQ/SampleA_R2_001.fastq.gz
SampleB /Users/.../FASTQ/SampleB_R1_001.fastq.gz    /Users/.../FASTQ/SampleB_R2_001.fastq.gz
...

Plain text, tab-separated, absolute paths. Step 2 reads the header to detect paired vs single-end and feeds the file straight into qiime tools import.


Takeaway

The manifest builder doesn't sound like an algorithm — but it's the place where five vendor naming conventions meet the user's metadata, and where silent merge errors are most likely to creep in. The whole script is a series of small consistency checks designed to fail loudly the moment anything is wrong, so the downstream pipeline can trust that the file layout actually matches the metadata.


Sources