Script: scripts/mbx_ezstat_all_levels_all_treatments.sh
+ the R function mbX::ezstat() from CRAN
Companion files in this folder:
- 10_ezstat.html — same content with copy buttons on every code block.
- 10_ezstat.pptx — slide deck for the talk.
Step 9's stacked-bar plots show what's there. The next question — and the one every reviewer asks — is which taxa actually differ between treatment groups, and is that difference statistically real?
That's a multi-step inferential job. For every (taxonomic level × every categorical metadata column × every taxon), you need to:
a,
b, ab letters that mark which groups differ from which on a
boxplot).Doing this by hand for one comparison is tedious. Doing it for every combination — seven levels × N variables × hundreds of taxa — is error-prone in a way no human should accept. Step 10's job is to automate the whole grid, consistently, with one stat per cell.
Note that Step 10 produces per-taxon, per-level statistics (univariate, taxon-by-taxon). Step 14 (ANCOMBC2) does the same job in a fundamentally different way — modelling all taxa together and correcting for compositionality. Both views are useful; reviewers like to see both.
It calls mbX::ezstat() once per (taxonomic level × categorical metadata
column), and ezstat runs Kruskal-Wallis (or pairwise Wilcoxon) per taxon,
adjusts p-values with Benjamini-Hochberg, performs Dunn post-hocs for
multi-group designs, and writes one XLSX per combination.
First, the script applies the same categorical-variable detection as Step 9 — drop sample-id, drop numeric-only columns, drop singleton-value columns, drop all-unique columns. What remains is the set of comparisons to run.
Then it reads 8_cleaned_files/mbx_ezclean_info.txt to find the per-
level paths. Each (level × variable) combination becomes one ezstat()
call.
Rscript --vanilla <<RSCRIPT
library(mbX)
setwd("10_stats/<variable>")
ezstat(
microbiome_data = "mbX_cleaned_<level>_level-7.xlsx",
metadata = "metadata.txt",
level = "<letter>",
selected_metadata = "<column>"
)
RSCRIPT
Inside ezstat():
p.adjust(..., method = "BH") in R) — the most powerful FDR control
for univariate microbiome panels.KW_<level>_by_<variable>.xlsx — the per-taxon KW (or Wilcoxon)
stats: p, p_adj, test statistic.
- Pairwise_<level>_by_<variable>.xlsx — every significant pairwise
comparison from Dunn.
- CLD_Summary_<level>_by_<variable>.xlsx — the compact letter display
per significant taxon (used by the boxplot in Step 12 and the final
report).
- Summary_KW_all_metrics_by_<variable>.xlsx — one summary row per
level.The script handles four edge cases gracefully:
mbx_ezstat_info.txtFinally the cross-step info file records every XLSX produced, the
metadata variables actually tested, and STATUS=COMPLETE. The final
report reads this to know which comparisons to embed.
| Default | Value | Why this default |
|---|---|---|
| Test (two groups) | Wilcoxon rank-sum | Non-parametric — microbiome relative abundances are almost never normal. Robust to outliers. |
| Test (three or more groups) | Kruskal-Wallis | Non-parametric ANOVA equivalent. Same robustness argument. |
| Post-hoc (multi-group, on KW hits) | Dunn's test | The standard non-parametric post-hoc; matches the KW assumptions. |
| Multiple-testing correction | Benjamini-Hochberg (BH) FDR | Most powerful FDR control for the kind of correlated multi-taxon panel we have. |
| Significance threshold | p_adj ≤ 0.05 | Field convention; the user can re-filter the XLSX with any threshold downstream. |
| Compact letter display algorithm | multcompView::multcompLetters |
The standard R implementation; widely used + cited. |
| Per-taxon prevalence filter | none, by default | Filtering hides genuinely-low-abundance signals. We trust BH FDR to handle the false-positive risk. |
| Levels | all seven | Mirrors Step 9 — every level gets statistics. |
| Variables | all auto-detected | Same as Step 9; no enumeration burden on the user. |
| Fallback | When it triggers | Why this fallback exists |
|---|---|---|
| Wilcoxon instead of KW | The variable has exactly two groups | KW reduces to a Wilcoxon equivalent — we use the actual Wilcoxon for the slightly tighter test. |
| Skip taxa with all-zero rows | A taxon has zero abundance in every sample | Nothing to test; skipping avoids a spurious p-value. |
| Skip singleton-group cells | Any group has only one sample | The non-parametric tests need ≥ 2 observations per group. The cell is logged as SKIPPED — singleton group. |
| Skip variables with one valid group after NA filter | All but one group's samples were dropped because of NAs | Nothing to compare; logged + continued. |
| Continue with N<7 levels | A level's XLSX was empty from Step 8 | We never crash because of a downstream level being thin. |
10_stats/<variable>/KW_<level>_by_<variable>.xlsx (per-taxon KW results):
| taxon | n_groups | KW_chi2 | p_value | p_adj_BH | significant |
|---|---|---|---|---|---|
| Bacteroides | 3 | 12.41 | 0.0020 | 0.0118 | TRUE |
| Lactobacillus | 3 | 0.84 | 0.6582 | 0.6582 | FALSE |
| Bilophila | 3 | 7.92 | 0.0190 | 0.0451 | TRUE |
| ... |
CLD_Summary_<level>_by_<variable>.xlsx (compact letter display):
| taxon | group_A | group_B | group_C |
|---|---|---|---|
| Bacteroides | a | ab | b |
| Bilophila | a | b | b |
| ... |
Groups sharing a letter are not significantly different. The boxplot in the final report places these letters on top of each box to summarise the pairwise comparison in one glance.
Step 10 turns the Step-8 abundance tables into a defensible "which-taxa-differ-where" answer. Wilcoxon for two groups, Kruskal-Wallis + Dunn + BH for more, compact letter displays for the boxplots. The work is mechanical but only because Step 10 mechanises it — done by hand across seven levels × N variables × hundreds of taxa, this is the place a human introduces errors.
mbXPro/scripts/mbx_ezstat_all_levels_all_treatments.shmbX::ezstat() on CRAN.