Contents
  1. Why this step matters
  2. What the script does in one sentence
  3. The algorithm, step by step
  4. Default parameters and why they are what they are
  5. When and why we fall back to defaults
  6. What the output file looks like
  7. Takeaway
  8. Sources

Step 17 — Co-occurrence Networks

Script: scripts/mbx_network_run.sh

Companion files in this folder: - 17_networks.html — same content with copy buttons. - 17_networks.pptx — slide deck for the talk.


Why this step matters

Every step up to now has treated taxa independently — even ANCOMBC2, which fits one model per taxon. But communities aren't lists; they're networks of co-occurrence and co-exclusion. Two taxa that consistently rise and fall together may share a niche, depend on each other's metabolic byproducts, or compete for the same substrate. Two taxa that consistently substitute for each other may be ecological alternatives — where one thrives, the other can't.

Co-occurrence networks make this structure visible. A node is a taxon; an edge is a statistically supported positive (co-occurrence) or negative (co-exclusion) correlation. The shape of the graph — its hubs, its modules, its density — captures community structure that no per-taxon test can see.

The trick is doing it compositionally. The same problem ANCOMBC2 solved for differential abundance shows up here: raw correlations on relative abundances are biased toward spurious negative correlations (because if one taxon goes up, the rest have to go down by the same total). Step 17 solves it the standard way: CLR transform + Spearman correlation (Gloor et al. 2017), which removes the compositional constraint and uses rank correlation for robustness to the heavy tails microbiome data carries.


What the script does in one sentence

For every (taxonomic level × categorical variable), it CLR-transforms the abundance table, computes Spearman correlations between every pair of taxa, adjusts the resulting p-values with Benjamini-Hochberg, filters edges by |ρ| ≥ ρ_threshold AND q ≤ q_threshold, builds an igraph object, detects modules via Louvain, identifies hub taxa, and exports per-group network files plus a multi-panel comparison plot.


The algorithm, step by step

1. Gate on Steps 7 + 8

First, same gate as Step 16 — read the metadata + the seven cleaned XLSX paths from 8_cleaned_files/mbx_ezclean_info.txt. Disk-discovery fallback if the info file is missing (rare).

2. For every (level × variable), build one global and N per-group networks

For each level requested (genus, family, species by default — those are the levels biologists actually interpret as ecological units):

3. Prevalence filter

Drop taxa present in fewer than --prevalence-threshold fraction of samples (default 30 %). Rare taxa add noise without contributing signal to the correlation matrix.

4. Pseudocount + CLR transform

Add a small pseudocount (pseudocount = 1) to handle zeros, then apply the Centred Log-Ratio (CLR) transform per sample:

CLR(x_i) = log( x_i / g(x) )

where g(x) is the geometric mean of the sample's relative abundances. The CLR transform removes the unit-sum constraint of compositional data: two CLR values can both go up at the same time. From this point on, the data behaves like normal continuous variables.

5. Spearman correlation matrix

Compute the Spearman rank correlation between every pair of taxa. Rank-based is preferred over Pearson because the CLR transform is log-distributed and heavy-tailed; Spearman handles that without an explicit normality assumption.

6. Benjamini-Hochberg FDR

For each pairwise p-value, compute the BH-adjusted q-value across the entire correlation matrix (p × (p − 1) / 2 tests). FDR control is necessary because we're testing thousands of pairs.

7. Edge filtering

Keep an edge only if BOTH:

Both filters are necessary. A very low |ρ| with a tiny q (because N is large) is a real but biologically trivial correlation. A high |ρ| with a non-significant q is suggestive but underpowered.

8. Build the graph and compute node metrics

Construct an undirected igraph with one node per taxon and one edge per surviving pair. Compute per-node:

9. Module detection via Louvain

Run the Louvain algorithm (a greedy modularity-maximization heuristic) to partition the graph into communities. The output modules.tsv says which module each taxon belongs to. Modules often correspond to ecological guilds.

10. Hub identification

Pick the top 10 % of taxa by combined degree + betweenness + hub_score, normalised. These are the network's "keystone-suspect" taxa — taxa whose removal would most disconnect the graph.

11. Group comparison

When per-group networks exist, the script computes:

These go into group_comparison.xlsx and a multi-panel plot.

12. Visualisation

For each network:

Exported as PNG + PDF + GraphML. The GraphML can be opened in Cytoscape for interactive exploration.

13. Write mbx_networks_info.txt

Finally the cross-step contract records every graph path + every metric XLSX, plus STATUS=COMPLETE.


Default parameters and why they are what they are

Default Value Why this default
Approach CLR + Spearman The standard lightweight compositional-aware approach (Gloor et al. 2017). SparCC is Python-only; SpiecEasi pulls heavy Bioconductor deps.
Levels run genus, family, species The levels biologists interpret as ecological units. Higher levels are too coarse.
Per-group minimum N 5 Below five, the correlation estimates are too noisy.
Prevalence threshold 0.30 Drop taxa present in < 30 % of samples; they add noise without signal.
Pseudocount 1 Standard for CLR on count data; small enough to barely affect the transform but eliminates log(0).
ρ threshold 0.30 Below this, the effect is biologically negligible even when statistically significant.
q threshold 0.05 Standard FDR cut.
Multiple-testing correction Benjamini-Hochberg Standard FDR control.
Module detection Louvain Fast, greedy modularity maximisation; widely used.
Hub-detection metric degree + betweenness + hub_score, top 10 % Combines three orthogonal notions of "central".
Layout Fruchterman-Reingold, seeded Reproducible across re-runs (MBX_PLOT_SEED).
Plot formats PNG + PDF + GraphML GraphML opens in Cytoscape for interactive exploration.
Threads MBX_THREADS Single source of truth.
Seed MBX_SEED Single source of truth.

When and why we fall back to defaults

Fallback When it triggers Why this fallback exists
Skip per-group network Group has < --min-group-n samples Too noisy to interpret. The global network still runs.
Skip a level After prevalence filter, < 5 taxa remain Nothing to correlate.
Disk-discovery for cleaned XLSX Step 8 info file missing Find the seven cleaned XLSX by their canonical filename pattern.
Hub list = top 5 % instead of 10 % User opted in via --hubs-fraction Stricter hub definition for sparse networks.
Re-use cached correlation matrix Previous run produced it Saves the matrix computation on re-runs.

What the output file looks like

network_summary.xlsx per network:

metric value
n_nodes 142
n_edges 487
density 0.049
n_modules 6
modularity 0.418
n_positive_edges 312
n_negative_edges 175

hub_taxa.xlsx:

taxon degree betweenness hub_score combined
Bacteroides 24 0.082 0.81 1.00 (rank 1)
Faecalibacterium 19 0.067 0.74 0.83
Bilophila.wadsworthia 17 0.039 0.62 0.71
...

Plus edges.tsv, nodes.tsv, modules.tsv, network.graphml, and the network_plot.{png,pdf}. When per-group networks ran, group_comparison.xlsx + multi_panel_plot.{png,pdf} show side-by-side.


Takeaway

Step 17 is the only step that treats the community as a network rather than a list. CLR + Spearman handles the compositional problem correctly without heavy Bioconductor dependencies. Louvain modules often correspond to ecological guilds; hub taxa are the keystone-suspect species worth following up on. The group-comparison plot is the "did treatment change the network's structure?" answer.


Sources