Script: scripts/create_manifest.sh
Companion files in this folder:
- 1_create_manifest.html — same content with copy buttons on every code block.
- 1_create_manifest.pptx — slide deck for the talk.
QIIME2 — the platform we use for everything from denoising to taxonomy — can't read raw FASTQ files directly. Before it can do anything, it needs an import manifest: a small tab-separated table that says for every sample, which FASTQ file on disk is the forward read and which is the reverse.
That sounds trivial. It is not. Real sequencing data shows up with five different naming conventions depending on which facility delivered it:
SampleA_R1_001.fastq.gz / SampleA_R2_001.fastq.gzSRR123_1.fastq.gz / SRR123_2.fastq.gzgut_READ1.fastq.gz / gut_READ2.fastq.gzs12.forward.fastq.gz / s12.reverse.fastq.gzSRR123.fastq.gzA mismatch — forward labelled as reverse, two samples merged into one, or a sample present in the FASTQ folder but missing from the metadata — produces silent errors that the downstream steps inherit as "weird results" that take hours to debug. The manifest builder's job is to never let that happen.
It scans a FASTQ directory, classifies every file by direction and sample, reconciles the sample IDs against the user's metadata file, and produces a QIIME2-compatible TSV with absolute paths — failing loudly the moment anything inconsistent appears.
First, the script finds every file matching *.fastq.gz, *.fq.gz,
*.fastq, or *.fq in the directory the user pointed to. By default it
also peeks one level above and one level below, so users who keep their
FASTQs in a subdirectory don't have to know about that.
Then an embedded Python helper applies a regex cascade, in priority order, to every filename:
R1 / R2 — Illumina default_1 / _2 — SRA / ENA archive conventionREAD1 / READ2 — some commercial vendorsforward / reverse — older pipelinesBoth the classifier and the priority order are identical to the ones Step 0 (the primer identifier) uses internally — so Step 0 and Step 1 can never disagree about which file is which.
Now the script strips the direction marker out of each filename to
recover the sample ID. SampleA_R1_001.fastq.gz → sample SampleA_001 →
sample SampleA. The stripping rules are conservative: anything that
looks like an Illumina lane or chunk number (_L001, _001) gets dropped
only if doing so preserves uniqueness.
Critical step. If the user supplied a metadata file, the script reads
its first column (sample-id / sampleid / id) and tries to match every
FASTQ sample-id against it:
A directory is paired-end if and only if every sample has both a forward and a reverse file. Mixed directories (some samples paired, some single) are rejected — they're almost always a mistake (e.g. the reverse file failed to copy from the sequencer).
Finally, the script writes the QIIME2-compatible TSV with absolute
paths (so QIIME2 can find the files regardless of where the user runs
qiime tools import from). Paired-end format:
sample-id forward-absolute-filepath reverse-absolute-filepath
SampleA /abs/path/SampleA_R1.fastq.gz /abs/path/SampleA_R2.fastq.gz
...
Single-end format:
sample-id absolute-filepath
SampleA /abs/path/SampleA.fastq.gz
...
The manifest goes into 1_manifest_file/manifest.txt inside the run's
output tree.
| Default | Value | Why this default |
|---|---|---|
| FASTQ extensions recognised | .fastq.gz .fq.gz .fastq .fq |
The four spellings every common pipeline produces. We don't try .bam or .cram — those need a different import path. |
| Search depth | parent + child of given dir | Users often keep FASTQs in a subfolder (e.g. ~/project/FASTQ/). Searching one level above and below catches both layouts without requiring the user to think about it. |
| Name-pattern priority | R1/R2 → _1/_2 → READ1/READ2 → forward/reverse → single | Most-specific to least-specific so an ambiguous filename can't accidentally match multiple patterns. |
| Sample-ID matching | exact, then unique case-insensitive | Catches the common case-mismatch bug (FASTQ has Sample01, metadata has sample01) without inviting collisions. |
| Output location | <parent_of_fastq>/mbX_pro_outputs_<TS>/1_manifest_file/manifest.txt |
We never write inside the FASTQ folder — those should be treated as read-only. |
| Fallback | When it triggers | Why this fallback exists |
|---|---|---|
| Search siblings + children | User pointed at a directory that contains only sub-directories of FASTQ files. | Common in shared-storage setups; easier than asking the user to re-aim. |
| Adopt metadata casing | FASTQ sample-id matches a metadata row only after case-folding. | Avoids the most common merge-failure mode without inviting silent ambiguity (we require the case-insensitive match to be unique). |
| Refuse mixed paired + single | Some samples have both reads, others have only one. | Mixed-mode runs are almost always a delivery error. Continuing would silently drop the broken samples or import them with wrong direction labels. |
| Hard-fail on missing metadata row | A FASTQ sample-id has no row in the metadata. | The alternative — silently importing it as an "unknown" sample — would cause it to vanish at every downstream --m-metadata-file command, looking like a phantom data loss. |
sample-id forward-absolute-filepath reverse-absolute-filepath
SampleA /Users/.../FASTQ/SampleA_R1_001.fastq.gz /Users/.../FASTQ/SampleA_R2_001.fastq.gz
SampleB /Users/.../FASTQ/SampleB_R1_001.fastq.gz /Users/.../FASTQ/SampleB_R2_001.fastq.gz
...
Plain text, tab-separated, absolute paths. Step 2 reads the header to detect
paired vs single-end and feeds the file straight into qiime tools import.
The manifest builder doesn't sound like an algorithm — but it's the place where five vendor naming conventions meet the user's metadata, and where silent merge errors are most likely to creep in. The whole script is a series of small consistency checks designed to fail loudly the moment anything is wrong, so the downstream pipeline can trust that the file layout actually matches the metadata.
mbXPro/scripts/create_manifest.sh_1/_2).