- Python 99.9%
- CI trigger branch: master -> main - semantic-release branches: master -> main - Forgejo default branch updated to main |
||
|---|---|---|
| .forgejo/workflows | ||
| .husky | ||
| scripts | ||
| src/crc_microbiome_pipeline | ||
| tests | ||
| .commitlintrc.json | ||
| .gitignore | ||
| .releaserc.json | ||
| CHANGELOG.md | ||
| package.json | ||
| pyproject.toml | ||
| README.md | ||
| uv.lock | ||
crc-microbiome-pipeline
A Python implementation of the gut microbiome-based machine learning pipeline for early colorectal cancer and adenoma screening, based on the paper:
Gut microbiome-based machine learning model for early colorectal cancer and adenoma screening
Gut Pathogens (2025) 17:80 — doi:10.1186/s13099-025-00750-z
Overview
This pipeline reproduces the analysis described in the paper: processing 16S rRNA amplicon sequencing data through differential abundance analysis, random forest classification, and microbial risk score (MRS) calculation to distinguish healthy controls, adenomas, and colorectal cancer (CRC) patients.
Pipeline Stages
- Preprocessing — Normalize ASV/OTU tables (total-sum scaling), create binary presence/absence matrices, filter low-prevalence features (<5%)
- Differential Abundance — ANCOM-BC and chi-square testing to identify discriminatory microbial taxa
- Random Forest Classification — Nested stratified 10-fold cross-validation with hyperparameter tuning
- Microbial Risk Score (MRS) — Pruning-and-thresholding approach combining alpha diversity indices
- Evaluation — AUC, sensitivity, specificity, calibration metrics
Installation
pip install git+https://git.reslate.solutions/ross/crc-microbiome-pipeline.git
Or for development:
git clone https://git.reslate.solutions/ross/crc-microbiome-pipeline.git
cd crc-microbiome-pipeline
pip install -e ".[dev]"
Quick Start
CLI Usage
# Run the full pipeline on an ASV/OTU table
crc-pipeline run \
--asv-table data/asv_table.tsv \
--metadata data/metadata.tsv \
--output results/
# Run with pre-split train/test sets
crc-pipeline run \
--train-asv data/train_asv.tsv \
--train-meta data/train_meta.tsv \
--test-asv data/test_asv.tsv \
--test-meta data/test_meta.tsv \
--output results/
# Just compute microbial risk score
crc-pipeline mrs \
--asv-table data/asv_table.tsv \
--metadata data/metadata.tsv \
--output results/mrs.csv
# Evaluate a trained model
crc-pipeline evaluate \
--model results/model.pkl \
--asv-table data/test_asv.tsv \
--metadata data/test_meta.tsv
Python API
from crc_microbiome_pipeline import CRCPipeline
# Load and preprocess data
pipeline = CRCPipeline()
pipeline.load_data(asv_table="data/asv_table.tsv", metadata="data/metadata.tsv")
# Run the full pipeline
results = pipeline.run()
# Access results
print(f"AUC: {results['auc']:.3f}")
print(f"Sensitivity: {results['sensitivity']:.3f}")
print(f"Specificity: {results['specificity']:.3f}")
# Microbial risk scores
mrs = pipeline.compute_mrs()
print(mrs.head())
Input Format
ASV/OTU Table
Tab-separated file with samples as rows and features (ASVs/OTUs/taxa) as columns:
sample_id ASV_001 ASV_002 ASV_003 ...
SAMPLE_01 125 0 43 ...
SAMPLE_02 0 87 12 ...
...
Metadata
Tab-separated file with sample metadata:
sample_id group age sex ...
SAMPLE_01 CRC 65 M ...
SAMPLE_02 Control 54 F ...
...
Groups should be one of: Control, Adenoma, CRC (case-insensitive).
Reproducing Paper Results
# Download datasets (SRA accessions in paper)
crc-pipeline fetch-data --output data/
# Run with paper-specific parameters
crc-pipeline run \
--asv-table data/processed/asv_table.tsv \
--metadata data/processed/metadata.tsv \
--ancom-bc \
--chi2 \
--prevalence 0.05 \
--nested-cv 10 \
--output results/
Dependencies
- Python ≥ 3.9
- scikit-learn ≥ 1.2
- pandas, numpy, scipy
- statannotations (for ANCOM-BC-like analysis)
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Commit convention
# Uses conventional commits: feat:, fix:, chore:, docs:, etc.
License
MIT