Contents
  1. Why this step matters
  2. What the script does in one sentence
  3. The algorithm, step by step
  4. Default parameters and why they are what they are
  5. When and why we fall back to defaults
  6. What the output file looks like
  7. Takeaway
  8. Sources

Step 16 — Random Forest Biomarker Classifier

Script: scripts/mbx_ml_classifier_run.sh

Companion files in this folder: - 16_ml_classifier.html — same content with copy buttons. - 16_ml_classifier.pptx — slide deck for the talk.


Why this step matters

Steps 10 and 14 (per-taxon stats and ANCOMBC2) answer a descriptive question: which taxa differ between groups? Step 16 answers the matching predictive question: can we PREDICT the group from the taxa?

Both questions matter, and reviewers ask them both:

The classifier of choice is Random Forest (Breiman 2001), because:

Step 16's job is to run the classifier for every (taxonomic level × categorical variable), report cross-validated accuracy + AUC + F1 + confusion matrix + ROC, and identify the taxa that drive the prediction via permutation importance + Saabas-style local SHAP- equivalent contributions.


What the script does in one sentence

For every (taxonomic level × categorical variable), it trains a Random Forest using ranger, with cross-validation auto-selected (5-fold stratified when N ≥ 20, leave-one-out when N < 20), reports accuracy/AUC/F1/sens/spec, computes permutation feature importance and Saabas-style SHAP contributions, and writes per-(level × variable) and per-variable summary spreadsheets so the user can compare classifiers across levels.


The algorithm, step by step

1. Gate on Steps 7 + 8

First the script reads 7_taxonomy_csv/mbx_taxonomy_info.txt for the metadata path and 8_cleaned_files/mbx_ezclean_info.txt for the seven cleaned XLSX paths. It refuses to run if either contract is missing.

2. Auto-detect categorical variables

Then it applies the same categorical-variable detection used by Step 9/10/12 — every column that isn't sample-id, numeric-only, singleton, or all-unique.

3. For every (level × variable): one Random Forest run

For each combination, the script invokes R:

library(ranger)
set.seed(MBX_SEED)

X <- read.xlsx(level_xlsx)        # samples × taxa wide table
y <- X[[selected_variable]]       # the response

# Drop the response column from features
X[[selected_variable]] <- NULL

# Drop other metadata columns
X <- X[, sapply(X, is.numeric), drop = FALSE]

# Cross-validation strategy
cv <- if (N >= 20) "stratified-5-fold" else "LOOCV"

# Train, predict, aggregate per-fold metrics
rf <- ranger(
  dependent.variable.name = "y",
  data            = data,
  num.trees       = NUM_TREES,             # default 500
  importance      = "permutation",
  probability     = TRUE,
  case.weights    = ...,                   # for class imbalance
  num.threads     = MBX_THREADS
)

4. Cross-validation auto-selection

Critical for small-N microbiome data:

5. Handle class imbalance

By default, the script uses ranger's case.weights to weight samples inversely to their class frequency. This stops the classifier from getting "high accuracy" by always predicting the majority class — a known pathology of microbiome classifiers when one treatment group is twice the size of another.

6. Metrics + confusion matrix + ROC

For each run the script computes:

7. Permutation feature importance

For honest variable importance, the script uses permutation importance instead of Gini importance:

This is more honest than Gini importance, which is biased toward high-cardinality features. Output: feature_importance.xlsx and a top20_importance.png barplot.

8. Saabas-style SHAP-equivalent local contributions

For per-sample interpretability, the script computes Saabas contributions (a per-tree decomposition of each prediction into a sum of feature contributions). The result is a per-sample × per-feature matrix saying "this feature pushed this sample's predicted probability up by 0.04". A heatmap shows the top features driving each sample's prediction.

Saabas is the computationally-tractable SHAP equivalent for trees — it gives the same per-sample feature-attribution interpretation without the (much more expensive) full TreeSHAP algorithm.

9. Per-variable summary across the seven levels

Finally the script aggregates the (level × variable) runs into Summary_RF_<Variable>.xlsx:

level accuracy macro_AUC F1 n_features n_used runtime_s
domain 0.55 0.61 0.52 2 2 0.4
phylum 0.71 0.74 0.69 12 12 0.8
class 0.78 0.82 0.77 23 22 1.2
...

That summary is the canonical "which taxonomic level is the most predictive for this variable?" answer.

10. Save the model

Then the script saves the trained model as an RDS file. Advanced users can load it in R and apply it to a new dataset (e.g. cross- validation against an independent cohort).


Default parameters and why they are what they are

Default Value Why this default
Classifier Random Forest (ranger) Robust to non-linearity, non-normal data, small N. The default microbiome ML choice.
Number of trees 500 Empirical sweet spot — more trees rarely changes accuracy; fewer increases variance.
Cross-validation 5-fold stratified when N ≥ 20, else LOOCV Auto-selected per (level × variable). Stratification keeps small groups represented in every fold.
Class imbalance handling case.weights inverse to class frequency Stops the classifier from gaming accuracy by always predicting the majority class.
Importance metric Permutation importance Unbiased; Gini importance favours high-cardinality features.
Local interpretation Saabas SHAP-equivalent Per-sample × per-feature contributions; the tree-friendly SHAP approximation.
Seed MBX_SEED (default 42, L'Ecuyer-CMRG) Reproducibility — required because ranger uses parallel under the hood.
Threads MBX_THREADS Single source of truth.
Plot formats PNG + SVG always; PDF on --publication-figures Publication-ready by default.

When and why we fall back to defaults

Fallback When it triggers Why this fallback exists
LOOCV instead of 5-fold N < 20 samples 5-fold on N = 12 gives 2-sample test folds — noisier than the model.
Skip a (level × variable) cell < 2 valid groups, < 5 samples per group, or no features after filtering Insufficient data to train.
Class-weight balancing One class has ≥ 2× another class's samples Prevents the majority-class gaming.
Skip SHAP if --skip-shap Large N + many features makes Saabas slow Saabas is per-sample × per-feature — quadratic in feature count. User can opt out.
Reuse cached RDS Existing trained model at the expected path Saves the most expensive sub-step on re-runs.
Hard-fail on factor-level mismatch The response column has only one level after filtering Nothing to classify; logged + skipped.

What the output file looks like

Summary_RF_<Variable>.xlsx (the canonical "which level is most predictive?" answer):

level accuracy macro_AUC F1 n_features n_used runtime_s
domain 0.55 0.61 0.52 2 2 0.4
phylum 0.71 0.74 0.69 12 12 0.8
class 0.78 0.82 0.77 23 22 1.2
order 0.83 0.87 0.82 47 41 2.0
family 0.86 0.90 0.85 84 71 2.9
genus 0.88 0.93 0.87 142 109 4.1
species 0.85 0.91 0.83 218 178 5.3

Plus per-(level × variable) RF_<level>_by_<var>/ directories with:


Takeaway

Step 16 is the predictive complement to Steps 10 and 14. A signal that shows up in ANCOMBC2 and drives a Random Forest classification is hard to dismiss. The cross-validation strategy auto-adapts to small-N microbiome reality. Permutation importance + Saabas SHAP give a reviewer the which taxa actually drive the prediction answer.


Sources