Script: scripts/artifact_creator.sh
Companion files in this folder:
- 2_artifact_creator.html — same content with copy buttons on every code block.
- 2_artifact_creator.pptx — slide deck for the talk.
QIIME2 organises every dataset, every intermediate, and every result inside
artifacts — zip-archive .qza files with strict provenance,
type-checking, and metadata. The reason the rest of the pipeline (steps 3
through 18) can re-use, re-run, and audit everything is that every blob of
data flows through QIIME2 as a typed artifact.
To get there from raw FASTQ files, exactly one command has to run correctly:
qiime tools import. That command needs:
SampleData[PairedEndSequencesWithQuality] for paired-end,
SampleData[SequencesWithQuality] for single-end).PairedEndFastqManifestPhred33V2 etc.)..qza filename.Getting any one of those wrong produces a confusing error from QIIME2 about the manifest format. The artifact creator's job is to detect the right flavour automatically and never run the wrong import.
It reads the header of the Step-1 manifest, detects paired-end vs single-end
from the column count, then runs qiime tools import with the matching
type + view, and writes the resulting .qza next to the manifest.
First, the script confirms the manifest file exists, is non-empty, and its first line is a valid header. It refuses to run on an empty file (the QIIME2 error would be cryptic) or on a file that obviously isn't a manifest (e.g. the user pointed at a FASTQ by mistake).
Then it reads only the header line and counts tab-separated columns:
sample-id, forward-absolute-filepath,
reverse-absolute-filepath) → paired-end.sample-id, absolute-filepath) → single-end.The script never reads further than the header for the detection — it trusts Step 1 to have written it correctly, and Step 1 either wrote both forward and reverse columns or only one.
Next it confirms that qiime is on PATH. If not, it prints the exact
conda activate qiime2-amplicon-2025.4 command the user needs to run, then
exits cleanly — much friendlier than a missing-binary error trace.
Now it constructs one of two QIIME2 imports:
Paired-end:
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path <manifest> \
--input-format PairedEndFastqManifestPhred33V2 \
--output-path Paired_End_artifact.qza
Single-end:
qiime tools import \
--type 'SampleData[SequencesWithQuality]' \
--input-path <manifest> \
--input-format SingleEndFastqManifestPhred33V2 \
--output-path Single_End_artifact.qza
The Phred33V2 view tells QIIME2: "the FASTQ quality letters are the
standard Phred+33 encoding" (the only encoding any current sequencer
produces) "and the manifest format is version 2" (the manifest format
Step 1 actually wrote).
Finally the script executes qiime tools import. In --dry-run mode it
just prints the command — useful for showing exactly what QIIME2 will do in
talks like this one.
The artifact lands in 2_first_artifact_file/Paired_End_artifact.qza
(or Single_End_artifact.qza). Step 3 (the DADA2 parameter finder) reads
from exactly that path with no further configuration.
| Default | Value | Why this default |
|---|---|---|
| Import type (paired) | SampleData[PairedEndSequencesWithQuality] |
The QIIME2-mandated type string for paired-end FASTQ. We don't try to use anything else. |
| Import type (single) | SampleData[SequencesWithQuality] |
Same, for single-end. |
| Manifest view (paired) | PairedEndFastqManifestPhred33V2 |
V2 is the format Step 1 writes. Phred+33 is the only encoding any current sequencer produces (we don't support Phred+64; that's a 2009 problem). |
| Manifest view (single) | SingleEndFastqManifestPhred33V2 |
Same idea, single-end. |
| Output filename | Paired_End_artifact.qza or Single_End_artifact.qza |
Step 3 looks for these exact names. We never rename. |
| Working directory | 2_first_artifact_file/ next to the manifest |
Keeps the artifact in the same mbX_pro_outputs_<TS>/ tree as everything else. |
| Fallback | When it triggers | Why this fallback exists |
|---|---|---|
| Detect single-end from header | Manifest has two columns instead of three. | Step 1 already validated the data is consistently single-end — we just need to use the matching QIIME2 type. |
--dry-run mode |
User passed --dry-run. |
Shows the exact qiime tools import command without consuming a minute of actual import time — useful for demos and CI gates. |
Detect qiime not on PATH |
The user is running outside their QIIME2 conda env. | Prints the exact conda activate qiime2-amplicon-2025.4 command and exits 1 — far friendlier than QIIME2's own error. |
Step 2 writes a .qza — a zip archive containing the FASTQ files, the
manifest, a UUID, and a metadata.yaml describing the type. You can rename
it .zip and explore it in Finder:
Paired_End_artifact.qza
├── data/ <- copies of the FASTQ files
│ ├── SampleA_R1.fastq.gz
│ ├── SampleA_R2.fastq.gz
│ ├── ...
│ └── MANIFEST <- a copy of Step 1's manifest
├── metadata.yaml <- type, format, UUID
└── provenance/ <- which command produced this artifact
The provenance directory is what makes QIIME2 results scientifically defensible — every downstream artifact carries a chain back to this one.
Step 2 is a 350-line wrapper around exactly one QIIME2 command. The reason it's not a one-liner is that the correct one-liner is two different one-liners depending on paired vs single-end — and using the wrong one produces a cryptic, hours-of-debugging error. The whole script exists to detect "which one-liner" automatically, every time, from the manifest header.
mbXPro/scripts/artifact_creator.sh