# Step 8 — ezclean (per-level cleaning) **Script:** `scripts/mbx_ezclean_all_levels.sh` + the R function `mbX::ezclean()` from CRAN **Companion files in this folder:** - `8_ezclean.html` — same content with copy buttons on every code block. - `8_ezclean.pptx` — slide deck for the talk. --- ## Why this step matters Step 7 gave us seven `level-N.csv` files with the raw GTDB taxonomy strings. Those strings are precise but ugly to look at and worse to consume: ``` d__Bacteria;p__Bacillota_A_368345;c__Clostridia_258483;o__Lachnospirales; f__Lachnospiraceae;g__Lachnospira;s__Lachnospira multipara ``` That single feature ID appears at the front of every plot title, every column header, every report row. Multiply by hundreds of taxa per sample and the result is unreadable. Worse, two distinct ASVs can share the *same* taxonomy string with slightly different GTDB cluster suffixes, and without consolidation they show up as two separate "taxa" in the analysis. This is where the **mbX::ezclean()** R function comes in. ezclean() parses every taxonomy string, **consolidates synonymous taxa** (the same genus seen as `g__Bacteroides` and `g__Bacteroides_C_123` get merged), and emits short, human-readable names — `Bilophila.wadsworthia` instead of the full GTDB hierarchy. Step 8's job is to run ezclean() for every taxonomic level (domain → species) and produce seven cleaned XLSX files that downstream steps (ezviz, ezstat, ANCOMBC2's taxon column, the final report) consume. --- ## What the script does in one sentence It installs the pinned R + mbX environment, then calls `mbX::ezclean()` seven times — one per taxonomic level — passing `level-7.csv` (which holds the full GTDB string) as input, and produces one human-readable XLSX per level. --- ## The algorithm, step by step ### 1. Locate system R and verify the pinned environment **First** the script searches for `Rscript` outside the QIIME2 conda env (the conda env's R has different package versions than the host system R; mixing them produces "shared object not found" errors that take hours to debug): - `/opt/homebrew/bin/Rscript` (Apple Silicon Homebrew) - `/usr/local/bin/Rscript` (Intel Mac Homebrew or Linux user-install) - `/Library/Frameworks/R.framework/Resources/bin/Rscript` (macOS CRAN installer) - `/usr/bin/Rscript` (Debian/Ubuntu system R) - `command -v Rscript` (whatever PATH finds) It also **strips conda's `R_LIBS_USER`-style env vars** before invoking R, so the system R doesn't try to load packages from the conda env's library tree. Then it calls `scripts/lib/install_r_deps.R` to install (or verify) every package in the pinned `r_packages.lock`. ### 2. Read the Step 7 contract **Then** the script reads `7_taxonomy_csv/mbx_taxonomy_info.txt` and pulls out: - `LEVEL_7_CSV` — the level-7 (species) CSV. The species CSV is special because its column headers contain the **full GTDB hierarchy string**, not just the species. ezclean parses that string to reconstruct any higher level. So we only need one input file for all seven runs. - `METADATA_TXT` — every step gets the metadata. ### 3. Set the working directory to `8_cleaned_files/` **Critical step.** `mbX::ezclean()` writes about a dozen intermediate XLSX files relative to the R working directory, then cleans them up at the end. If we let R use the user's launching directory, those files would litter the user's home folder. The script changes R's working directory to `8_cleaned_files/` so all that ephemeral state is contained. ### 4. Run ezclean once per level **Now the main work.** The script loops over seven level letters: `d, p, c, o, f, g, s` (domain through species). For each: ``` Rscript --vanilla <_XLSX=` line per level. - `MBX_PACKAGE_VERSION` — which mbX version produced these files. - `STATUS=COMPLETE` (or `STATUS=PARTIAL` if some levels failed). --- ## Default parameters and why they are what they are | Default | Value | Why this default | |---|---|---| | R location | system R, NOT conda | Conda's R has different package versions; mixing causes opaque library errors. | | `R_LIBS_USER` stripping | always | Prevents conda from poisoning system R's library lookup. | | mbX package version | **CRAN 0.2.0** | The pinned version in `r_packages.lock`. Updates land via the lockfile, not by spontaneous CRAN bumps. | | Levels run | **d, p, c, o, f, g, s** | All seven, every time. Skipping any of them just defers the failure to a downstream step. | | Input file | `level-7.csv` | The level-7 column headers contain the full GTDB hierarchy; ezclean reconstructs any level from it. | | R working directory | `8_cleaned_files/` | Contains the ~12 intermediate XLSX files ezclean writes while it works. | | Unidentified naming | `unidentified___from__` | Stable, human-readable, deduplicated by parent so two rows share the same counter. | | Species-name format | `Genus.species` (dot-joined) | R-column-name-safe. The space in `Genus species` would otherwise be illegal in many downstream contexts. | --- ## When and why we fall back to defaults | Fallback | When it triggers | Why this fallback exists | |---|---|---| | **Continue with N<7 levels** | One level's `ezclean()` call returned empty (usually species, for low-classification data) | Species-level failures are common and not fatal; downstream uses genus or below. The script logs which levels failed. | | **Per-level retry on transient errors** | A package failed to load on the first try (slow disk, BiocManager just resolved a dependency) | Common in CI environments; the second attempt almost always succeeds. | | **Bypass install check** | User passed `--skip-install` after a previous successful run | Speeds up reruns when the user knows the env is fine. | | **`STATUS=PARTIAL`** instead of `COMPLETE` | At least one level failed | Tells `mbXPro --resume` that this step isn't fully done; rerunning will fill the gap. | --- ## What the output file looks like `8_cleaned_files/mbX_cleaned_genera_level-7/mbX_cleaned_genera_level-7.xlsx`: | sample-id | Treatment | Bacteroides | Lactobacillus | Bilophila | unidentified_genus_1_from_Lachnospiraceae_family | ... | |---|---|---|---|---|---|---| | SampleA | High | 0.0214 | 0.0089 | 0.0034 | 0.0011 | ... | | SampleB | Low | 0.0142 | 0.0312 | 0.0007 | 0.0089 | ... | The metadata columns are joined in for convenience (downstream steps don't have to re-merge), and every taxon column is a clean, short, R-safe name. --- ## Takeaway > Step 8 turns the precise-but-unreadable GTDB strings into short, > human-readable names that downstream R steps, the ANCOMBC2 step we > already simplified, and the final report can all use. It does it > by delegating to the mbX CRAN package's `ezclean()` function — the > same function that gives the project its name. From here on, every > XLSX, every plot, every table reads cleanly. --- ## Sources - The wrapper script: `mbXPro/scripts/mbx_ezclean_all_levels.sh` - The R function: `mbX::ezclean()` on CRAN, version pinned in `scripts/lib/r_packages.lock`. - mbX package: https://cran.r-project.org/package=mbX - Greengenes2 nomenclature: McDonald et al. (2024), Nature Biotechnology 42:715–718.