Script: scripts/mbx_network_run.sh
Companion files in this folder:
- 17_networks.html — same content with copy buttons.
- 17_networks.pptx — slide deck for the talk.
Every step up to now has treated taxa independently — even ANCOMBC2, which fits one model per taxon. But communities aren't lists; they're networks of co-occurrence and co-exclusion. Two taxa that consistently rise and fall together may share a niche, depend on each other's metabolic byproducts, or compete for the same substrate. Two taxa that consistently substitute for each other may be ecological alternatives — where one thrives, the other can't.
Co-occurrence networks make this structure visible. A node is a taxon; an edge is a statistically supported positive (co-occurrence) or negative (co-exclusion) correlation. The shape of the graph — its hubs, its modules, its density — captures community structure that no per-taxon test can see.
The trick is doing it compositionally. The same problem ANCOMBC2 solved for differential abundance shows up here: raw correlations on relative abundances are biased toward spurious negative correlations (because if one taxon goes up, the rest have to go down by the same total). Step 17 solves it the standard way: CLR transform + Spearman correlation (Gloor et al. 2017), which removes the compositional constraint and uses rank correlation for robustness to the heavy tails microbiome data carries.
For every (taxonomic level × categorical variable), it CLR-transforms the
abundance table, computes Spearman correlations between every pair of
taxa, adjusts the resulting p-values with Benjamini-Hochberg, filters
edges by |ρ| ≥ ρ_threshold AND q ≤ q_threshold, builds an igraph object,
detects modules via Louvain, identifies hub taxa, and exports per-group
network files plus a multi-panel comparison plot.
First, same gate as Step 16 — read the metadata + the seven cleaned
XLSX paths from 8_cleaned_files/mbx_ezclean_info.txt. Disk-discovery
fallback if the info file is missing (rare).
For each level requested (genus, family, species by default — those are the levels biologists actually interpret as ecological units):
--min-group-n samples (default 5).
Below that, the correlations are too noisy to be informative.Drop taxa present in fewer than --prevalence-threshold fraction of
samples (default 30 %). Rare taxa add noise without contributing signal
to the correlation matrix.
Add a small pseudocount (pseudocount = 1) to handle zeros, then
apply the Centred Log-Ratio (CLR) transform per sample:
CLR(x_i) = log( x_i / g(x) )
where g(x) is the geometric mean of the sample's relative abundances.
The CLR transform removes the unit-sum constraint of compositional data:
two CLR values can both go up at the same time. From this point on, the
data behaves like normal continuous variables.
Compute the Spearman rank correlation between every pair of taxa. Rank-based is preferred over Pearson because the CLR transform is log-distributed and heavy-tailed; Spearman handles that without an explicit normality assumption.
For each pairwise p-value, compute the BH-adjusted q-value across the
entire correlation matrix (p × (p − 1) / 2 tests). FDR control is
necessary because we're testing thousands of pairs.
Keep an edge only if BOTH:
|ρ| ≥ --rho-threshold (default 0.30) — the effect size is large enough
to mean something biologically.q ≤ --q-threshold (default 0.05) — the BH-adjusted p-value is
significant.Both filters are necessary. A very low |ρ| with a tiny q (because N is large) is a real but biologically trivial correlation. A high |ρ| with a non-significant q is suggestive but underpowered.
Construct an undirected igraph with one node per taxon and one edge per surviving pair. Compute per-node:
Run the Louvain algorithm (a greedy modularity-maximization heuristic)
to partition the graph into communities. The output modules.tsv says
which module each taxon belongs to. Modules often correspond to
ecological guilds.
Pick the top 10 % of taxa by combined degree + betweenness +
hub_score, normalised. These are the network's "keystone-suspect"
taxa — taxa whose removal would most disconnect the graph.
When per-group networks exist, the script computes:
edges / (n × (n − 1) / 2)).These go into group_comparison.xlsx and a multi-panel plot.
For each network:
MBX_PLOT_SEED) so re-runs give identical layouts.ρ (co-exclusion), blue for positive
(co-occurrence).Exported as PNG + PDF + GraphML. The GraphML can be opened in Cytoscape for interactive exploration.
mbx_networks_info.txtFinally the cross-step contract records every graph path + every
metric XLSX, plus STATUS=COMPLETE.
| Default | Value | Why this default |
|---|---|---|
| Approach | CLR + Spearman | The standard lightweight compositional-aware approach (Gloor et al. 2017). SparCC is Python-only; SpiecEasi pulls heavy Bioconductor deps. |
| Levels run | genus, family, species | The levels biologists interpret as ecological units. Higher levels are too coarse. |
| Per-group minimum N | 5 | Below five, the correlation estimates are too noisy. |
| Prevalence threshold | 0.30 | Drop taxa present in < 30 % of samples; they add noise without signal. |
| Pseudocount | 1 | Standard for CLR on count data; small enough to barely affect the transform but eliminates log(0). |
| ρ threshold | 0.30 | Below this, the effect is biologically negligible even when statistically significant. |
| q threshold | 0.05 | Standard FDR cut. |
| Multiple-testing correction | Benjamini-Hochberg | Standard FDR control. |
| Module detection | Louvain | Fast, greedy modularity maximisation; widely used. |
| Hub-detection metric | degree + betweenness + hub_score, top 10 % | Combines three orthogonal notions of "central". |
| Layout | Fruchterman-Reingold, seeded | Reproducible across re-runs (MBX_PLOT_SEED). |
| Plot formats | PNG + PDF + GraphML | GraphML opens in Cytoscape for interactive exploration. |
| Threads | MBX_THREADS |
Single source of truth. |
| Seed | MBX_SEED |
Single source of truth. |
| Fallback | When it triggers | Why this fallback exists |
|---|---|---|
| Skip per-group network | Group has < --min-group-n samples |
Too noisy to interpret. The global network still runs. |
| Skip a level | After prevalence filter, < 5 taxa remain | Nothing to correlate. |
| Disk-discovery for cleaned XLSX | Step 8 info file missing | Find the seven cleaned XLSX by their canonical filename pattern. |
| Hub list = top 5 % instead of 10 % | User opted in via --hubs-fraction |
Stricter hub definition for sparse networks. |
| Re-use cached correlation matrix | Previous run produced it | Saves the matrix computation on re-runs. |
network_summary.xlsx per network:
| metric | value |
|---|---|
| n_nodes | 142 |
| n_edges | 487 |
| density | 0.049 |
| n_modules | 6 |
| modularity | 0.418 |
| n_positive_edges | 312 |
| n_negative_edges | 175 |
hub_taxa.xlsx:
| taxon | degree | betweenness | hub_score | combined |
|---|---|---|---|---|
| Bacteroides | 24 | 0.082 | 0.81 | 1.00 (rank 1) |
| Faecalibacterium | 19 | 0.067 | 0.74 | 0.83 |
| Bilophila.wadsworthia | 17 | 0.039 | 0.62 | 0.71 |
| ... |
Plus edges.tsv, nodes.tsv, modules.tsv, network.graphml, and the
network_plot.{png,pdf}. When per-group networks ran,
group_comparison.xlsx + multi_panel_plot.{png,pdf} show side-by-side.
Step 17 is the only step that treats the community as a network rather than a list. CLR + Spearman handles the compositional problem correctly without heavy Bioconductor dependencies. Louvain modules often correspond to ecological guilds; hub taxa are the keystone-suspect species worth following up on. The group-comparison plot is the "did treatment change the network's structure?" answer.
mbXPro/scripts/mbx_network_run.sh