Gut microbiome-based ML pipeline for early CRC and adenoma screening

Python 99.9%

Find a file

Ross Fox 73e850c875 Some checks failed Release / test (3.10) (push) Successful in 45s Details Release / test (3.12) (push) Successful in 41s Details Release / test (3.11) (push) Successful in 35s Details Release / test (3.14) (push) Successful in 38s Details Release / release (push) Failing after 9s Details chore: migrate from master to main branch - CI trigger branch: master -> main - semantic-release branches: master -> main - Forgejo default branch updated to main		2026-05-08 21:55:01 +00:00
.forgejo/workflows	chore: migrate from master to main branch	2026-05-08 21:55:01 +00:00
.husky	feat: initial implementation of CRC microbiome screening pipeline	2026-05-08 21:10:54 +00:00
scripts	feat: initial implementation of CRC microbiome screening pipeline	2026-05-08 21:10:54 +00:00
src/crc_microbiome_pipeline	fix: add missing click import in slurm module	2026-05-08 21:49:30 +00:00
tests	feat: initial implementation of CRC microbiome screening pipeline	2026-05-08 21:10:54 +00:00
.commitlintrc.json	feat: initial implementation of CRC microbiome screening pipeline	2026-05-08 21:10:54 +00:00
.gitignore	feat: initial implementation of CRC microbiome screening pipeline	2026-05-08 21:10:54 +00:00
.releaserc.json	chore: migrate from master to main branch	2026-05-08 21:55:01 +00:00
CHANGELOG.md	feat: add Slurm job submission support	2026-05-08 21:35:17 +00:00
package.json	feat: initial implementation of CRC microbiome screening pipeline	2026-05-08 21:10:54 +00:00
pyproject.toml	feat: initial implementation of CRC microbiome screening pipeline	2026-05-08 21:10:54 +00:00
README.md	feat: initial implementation of CRC microbiome screening pipeline	2026-05-08 21:10:54 +00:00
uv.lock	feat: initial implementation of CRC microbiome screening pipeline	2026-05-08 21:10:54 +00:00

README.md

crc-microbiome-pipeline

A Python implementation of the gut microbiome-based machine learning pipeline for early colorectal cancer and adenoma screening, based on the paper:

Gut microbiome-based machine learning model for early colorectal cancer and adenoma screening
Gut Pathogens (2025) 17:80 — doi:10.1186/s13099-025-00750-z

Overview

This pipeline reproduces the analysis described in the paper: processing 16S rRNA amplicon sequencing data through differential abundance analysis, random forest classification, and microbial risk score (MRS) calculation to distinguish healthy controls, adenomas, and colorectal cancer (CRC) patients.

Pipeline Stages

Preprocessing — Normalize ASV/OTU tables (total-sum scaling), create binary presence/absence matrices, filter low-prevalence features (<5%)
Differential Abundance — ANCOM-BC and chi-square testing to identify discriminatory microbial taxa
Random Forest Classification — Nested stratified 10-fold cross-validation with hyperparameter tuning
Microbial Risk Score (MRS) — Pruning-and-thresholding approach combining alpha diversity indices
Evaluation — AUC, sensitivity, specificity, calibration metrics

Installation

pip install git+https://git.reslate.solutions/ross/crc-microbiome-pipeline.git

Or for development:

git clone https://git.reslate.solutions/ross/crc-microbiome-pipeline.git
cd crc-microbiome-pipeline
pip install -e ".[dev]"

Quick Start

CLI Usage

# Run the full pipeline on an ASV/OTU table
crc-pipeline run \
  --asv-table data/asv_table.tsv \
  --metadata data/metadata.tsv \
  --output results/

# Run with pre-split train/test sets
crc-pipeline run \
  --train-asv data/train_asv.tsv \
  --train-meta data/train_meta.tsv \
  --test-asv data/test_asv.tsv \
  --test-meta data/test_meta.tsv \
  --output results/

# Just compute microbial risk score
crc-pipeline mrs \
  --asv-table data/asv_table.tsv \
  --metadata data/metadata.tsv \
  --output results/mrs.csv

# Evaluate a trained model
crc-pipeline evaluate \
  --model results/model.pkl \
  --asv-table data/test_asv.tsv \
  --metadata data/test_meta.tsv

Python API

from crc_microbiome_pipeline import CRCPipeline

# Load and preprocess data
pipeline = CRCPipeline()
pipeline.load_data(asv_table="data/asv_table.tsv", metadata="data/metadata.tsv")

# Run the full pipeline
results = pipeline.run()

# Access results
print(f"AUC: {results['auc']:.3f}")
print(f"Sensitivity: {results['sensitivity']:.3f}")
print(f"Specificity: {results['specificity']:.3f}")

# Microbial risk scores
mrs = pipeline.compute_mrs()
print(mrs.head())

Input Format

ASV/OTU Table

Tab-separated file with samples as rows and features (ASVs/OTUs/taxa) as columns:

sample_id	ASV_001	ASV_002	ASV_003	...
SAMPLE_01	125	0	43	...
SAMPLE_02	0	87	12	...
...

Metadata

Tab-separated file with sample metadata:

sample_id	group	age	sex	...
SAMPLE_01	CRC	65	M	...
SAMPLE_02	Control	54	F	...
...

Groups should be one of: Control, Adenoma, CRC (case-insensitive).

Reproducing Paper Results

# Download datasets (SRA accessions in paper)
crc-pipeline fetch-data --output data/

# Run with paper-specific parameters
crc-pipeline run \
  --asv-table data/processed/asv_table.tsv \
  --metadata data/processed/metadata.tsv \
  --ancom-bc \
  --chi2 \
  --prevalence 0.05 \
  --nested-cv 10 \
  --output results/

Dependencies

Python ≥ 3.9
scikit-learn ≥ 1.2
pandas, numpy, scipy
statannotations (for ANCOM-BC-like analysis)

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Commit convention
# Uses conventional commits: feat:, fix:, chore:, docs:, etc.

License

MIT