Gut microbiome-based ML pipeline for early CRC and adenoma screening
Find a file
Ross Fox 73e850c875
Some checks failed
Release / test (3.10) (push) Successful in 45s
Release / test (3.12) (push) Successful in 41s
Release / test (3.11) (push) Successful in 35s
Release / test (3.14) (push) Successful in 38s
Release / release (push) Failing after 9s
chore: migrate from master to main branch
- CI trigger branch: master -> main
- semantic-release branches: master -> main
- Forgejo default branch updated to main
2026-05-08 21:55:01 +00:00
.forgejo/workflows chore: migrate from master to main branch 2026-05-08 21:55:01 +00:00
.husky feat: initial implementation of CRC microbiome screening pipeline 2026-05-08 21:10:54 +00:00
scripts feat: initial implementation of CRC microbiome screening pipeline 2026-05-08 21:10:54 +00:00
src/crc_microbiome_pipeline fix: add missing click import in slurm module 2026-05-08 21:49:30 +00:00
tests feat: initial implementation of CRC microbiome screening pipeline 2026-05-08 21:10:54 +00:00
.commitlintrc.json feat: initial implementation of CRC microbiome screening pipeline 2026-05-08 21:10:54 +00:00
.gitignore feat: initial implementation of CRC microbiome screening pipeline 2026-05-08 21:10:54 +00:00
.releaserc.json chore: migrate from master to main branch 2026-05-08 21:55:01 +00:00
CHANGELOG.md feat: add Slurm job submission support 2026-05-08 21:35:17 +00:00
package.json feat: initial implementation of CRC microbiome screening pipeline 2026-05-08 21:10:54 +00:00
pyproject.toml feat: initial implementation of CRC microbiome screening pipeline 2026-05-08 21:10:54 +00:00
README.md feat: initial implementation of CRC microbiome screening pipeline 2026-05-08 21:10:54 +00:00
uv.lock feat: initial implementation of CRC microbiome screening pipeline 2026-05-08 21:10:54 +00:00

crc-microbiome-pipeline

A Python implementation of the gut microbiome-based machine learning pipeline for early colorectal cancer and adenoma screening, based on the paper:

Gut microbiome-based machine learning model for early colorectal cancer and adenoma screening
Gut Pathogens (2025) 17:80 — doi:10.1186/s13099-025-00750-z

Overview

This pipeline reproduces the analysis described in the paper: processing 16S rRNA amplicon sequencing data through differential abundance analysis, random forest classification, and microbial risk score (MRS) calculation to distinguish healthy controls, adenomas, and colorectal cancer (CRC) patients.

Pipeline Stages

  1. Preprocessing — Normalize ASV/OTU tables (total-sum scaling), create binary presence/absence matrices, filter low-prevalence features (<5%)
  2. Differential Abundance — ANCOM-BC and chi-square testing to identify discriminatory microbial taxa
  3. Random Forest Classification — Nested stratified 10-fold cross-validation with hyperparameter tuning
  4. Microbial Risk Score (MRS) — Pruning-and-thresholding approach combining alpha diversity indices
  5. Evaluation — AUC, sensitivity, specificity, calibration metrics

Installation

pip install git+https://git.reslate.solutions/ross/crc-microbiome-pipeline.git

Or for development:

git clone https://git.reslate.solutions/ross/crc-microbiome-pipeline.git
cd crc-microbiome-pipeline
pip install -e ".[dev]"

Quick Start

CLI Usage

# Run the full pipeline on an ASV/OTU table
crc-pipeline run \
  --asv-table data/asv_table.tsv \
  --metadata data/metadata.tsv \
  --output results/

# Run with pre-split train/test sets
crc-pipeline run \
  --train-asv data/train_asv.tsv \
  --train-meta data/train_meta.tsv \
  --test-asv data/test_asv.tsv \
  --test-meta data/test_meta.tsv \
  --output results/

# Just compute microbial risk score
crc-pipeline mrs \
  --asv-table data/asv_table.tsv \
  --metadata data/metadata.tsv \
  --output results/mrs.csv

# Evaluate a trained model
crc-pipeline evaluate \
  --model results/model.pkl \
  --asv-table data/test_asv.tsv \
  --metadata data/test_meta.tsv

Python API

from crc_microbiome_pipeline import CRCPipeline

# Load and preprocess data
pipeline = CRCPipeline()
pipeline.load_data(asv_table="data/asv_table.tsv", metadata="data/metadata.tsv")

# Run the full pipeline
results = pipeline.run()

# Access results
print(f"AUC: {results['auc']:.3f}")
print(f"Sensitivity: {results['sensitivity']:.3f}")
print(f"Specificity: {results['specificity']:.3f}")

# Microbial risk scores
mrs = pipeline.compute_mrs()
print(mrs.head())

Input Format

ASV/OTU Table

Tab-separated file with samples as rows and features (ASVs/OTUs/taxa) as columns:

sample_id	ASV_001	ASV_002	ASV_003	...
SAMPLE_01	125	0	43	...
SAMPLE_02	0	87	12	...
...

Metadata

Tab-separated file with sample metadata:

sample_id	group	age	sex	...
SAMPLE_01	CRC	65	M	...
SAMPLE_02	Control	54	F	...
...

Groups should be one of: Control, Adenoma, CRC (case-insensitive).

Reproducing Paper Results

# Download datasets (SRA accessions in paper)
crc-pipeline fetch-data --output data/

# Run with paper-specific parameters
crc-pipeline run \
  --asv-table data/processed/asv_table.tsv \
  --metadata data/processed/metadata.tsv \
  --ancom-bc \
  --chi2 \
  --prevalence 0.05 \
  --nested-cv 10 \
  --output results/

Dependencies

  • Python ≥ 3.9
  • scikit-learn ≥ 1.2
  • pandas, numpy, scipy
  • statannotations (for ANCOM-BC-like analysis)

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Commit convention
# Uses conventional commits: feat:, fix:, chore:, docs:, etc.

License

MIT