Metadata-Version: 2.4
Name: cell-extract
Version: 1.3.1
Summary: Extract specific columns from multiple tabular files and merge by row identifier.
License: MIT
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Bioinformatics
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: python-semantic-release>=9.0; extra == "dev"
Dynamic: license-file

# cell-extract

**Extract specific columns from multiple tabular files and merge them by row identifier.**

`cell-extract` reads CSV, TSV, or custom-delimited files, extracts a user-specified column from each, and produces a single merged output table. Duplicate row IDs within a file are handled by averaging their numeric values.

## Installation

```bash
pip install -e .
```

Or with dev dependencies (for testing):

```bash
pip install -e ".[dev]"
```

## Quick Start

Given two files:

**alpha.csv**
```
gene,expr,pval
BRCA1,2.3,0.01
TP53,5.1,0.001
```

**beta.csv**
```
gene,expr,pval
BRCA1,4.7,0.02
TP53,3.2,0.05
```

Run:

```bash
cell-extract alpha.csv beta.csv --column expr
```

Output:

```csv
Origin,alpha,beta
BRCA1,2.3,4.7
TP53,5.1,3.2
```

## Usage

```
cell-extract [OPTIONS] FILE [FILE ...]
```

### Required Argument

| Option | Description |
|--------|-------------|
| `--column COLUMN`, `-c COLUMN` | Name of the column to extract from each file |

### Options

| Option | Description |
|--------|-------------|
| `--output FILE`, `-o FILE` | Write output to FILE instead of stdout |
| `--output-format {csv,tsv}`, `-f {csv,tsv}` | Output format (default: csv) |
| `--with-source`, `-s` | Add a `Source` column recording which original column was extracted |
| `--delimiter DELIM`, `-d DELIM` | Input delimiter override (e.g. `tab`, `pipe`, `\|`, `;`) |
| `--skip-lines N` | Skip N leading lines in each file before parsing headers |
| `--id-column COLUMN`, `-i COLUMN` | Column to use as row identifier (default: first column) |
| `--version`, `-V` | Show version and exit |
| `--help`, `-h` | Show help message |

## Behavior

- **Row union**: All row identifiers from all files are merged. If a row exists in file A but not file B, the cell for file B will be empty.
- **Duplicate rows averaged**: If a row identifier appears multiple times within a file, the values are averaged. Non-numeric values in the target column are skipped with a warning.
- **Integer formatting**: Averages that are whole numbers display without a decimal point (e.g. `6`); fractional results include a decimal (e.g. `1.5`).
- **Empty cells**: Missing values are written as empty fields.
- **Delimiter detection**: `.csv` → comma; `.tsv` or `.tab` → tab. Use `--delimiter` to override.
- **Output format**: CSV by default; use `-f tsv` for TSV output.
- **Row identifier**: Defaults to the first column of each file. Use `--id-column` to specify a different column.

## Examples

### Multiple files with different row sets

```bash
cell-extract set1.csv set2.csv set3.csv -c expression
```

### TSV input and output

```bash
cell-extract -c count -f tsv sample_A.tsv sample_B.tsv
```

### Save to file

```bash
cell-extract -c fold_change -o merged.csv *.csv
```

### With source tracing column

```bash
cell-extract alpha.csv beta.csv -c expr -s
# Origin,alpha,beta,Source
# BRCA1,2.3,4.7,expr
# TP53,5.1,3.2,expr
```

### Custom delimiter (pipe-delimited files)

```bash
cell-extract -c val -d pipe data1.csv data2.csv
cell-extract -c val -d "|" data1.csv data2.csv
```

### Skip metadata header lines

```bash
# Files with experiment labels before the column headers:
#   Experiment: RNA-seq
#   Date: 2025-02-14
#   gene,expr,pval
#   BRCA1,2.3,0.01

cell-extract -c expr --skip-lines 2 file1.csv file2.csv
```

### Custom row identifier column

```bash
# Use the "label" column instead of the first column as the row key
cell-extract -c val -i label data.csv
```

### Duplicate rows averaged

```csv
# input.csv:
# id,val
# X,10
# X,30
```

```bash
cell-extract input.csv -c val
# Origin,input
# X,20        ← average of (10+30)/2
```

### All features together

```bash
cell-extract -c result -i sample_id -d pipe --skip-lines 3 -s -f tsv -o out.tsv *.csv
```

## Development

Run tests:

```bash
pip install -e ".[dev]"
pytest tests/
```

## License

MIT
