features2proteins: Unified Pipeline¶
The features2proteins command is the recommended way to go from raw feature data to protein quantification. It handles loading, filtering, normalization, quantification, batch correction, IRS, differential expression, and visualization in a single step.
Basic Usage¶
from mokume.pipeline import QuantificationPipeline, PipelineConfig
from mokume.pipeline.config import (
InputConfig, QuantificationConfig, NormalizationConfig,
)
config = PipelineConfig(
input=InputConfig(parquet="features.parquet", sdrf="experiment.sdrf.tsv"),
quantification=QuantificationConfig(method="maxlfq"),
)
pipeline = QuantificationPipeline(config)
proteins = pipeline.run()
Quantification Methods¶
| Method | CLI Flag | FASTA Required | Description |
|---|---|---|---|
| MaxLFQ | --quant-method maxlfq |
No | Delayed normalization (default) |
| DirectLFQ | --quant-method directlfq |
No | Hierarchical alignment (requires extra) |
| iBAQ | --quant-method ibaq |
Yes | Absolute quantification |
| TopN | --quant-method topn |
No | Average of N most intense peptides |
| Sum | --quant-method sum |
No | Sum of all peptides |
| Median | --quant-method median |
No | Median peptide intensity |
| Ratio | --quant-method ratio |
No | Log2 sample/reference (TMT) |
In practice:
- Use
maxlfqas the default starting point for standard LFQ workflows. - Use
directlfqwhen you explicitly want the DirectLFQ package to handle normalization and quantification together. - Use
ibaqwhen you need absolute-style quantification and have a FASTA file. - Use
ratiofor TMT PS-style reference-based analysis.
# iBAQ (requires FASTA)
mokume features2proteins \
-p features.parquet -o proteins.csv \
--quant-method ibaq --fasta proteome.fasta
# TopN (Top5)
mokume features2proteins \
-p features.parquet -o proteins.csv \
--quant-method topn --topn 5
# DirectLFQ (pip install mokume[directlfq])
mokume features2proteins \
-p features.parquet -o proteins.csv \
--quant-method directlfq --directlfq-cores 4
Normalization Options¶
Run-Level Normalization¶
Adjusts for intensity differences between MS runs within each sample.
mokume features2proteins \
-p features.parquet -o proteins.csv \
--run-normalization median # median, mean, max, global, max_min, iqr, none
Sample-Level Normalization¶
Adjusts for systematic differences across samples.
# Global median (default)
mokume features2proteins -p data.parquet -o out.csv \
--sample-normalization globalMedian
# Hierarchical (DirectLFQ-style)
mokume features2proteins -p data.parquet -o out.csv \
--sample-normalization hierarchical
# TMM normalization
mokume features2proteins -p data.parquet -o out.csv \
--sample-normalization tmm
# With specific normalization proteins
mokume features2proteins -p data.parquet -o out.csv \
--sample-normalization hierarchical \
--normalization-proteins housekeeping.txt
globalMedianis the default and a good general-purpose starting point.hierarchicalis useful when you want DirectLFQ-style normalization with a non-DirectLFQ quantification method.tmmis available for composition-bias-aware sample normalization.
IRS Normalization (Multi-Plex TMT)¶
For TMT experiments with shared reference channels across plexes:
# Auto-detect references from SDRF
mokume features2proteins \
-p features.parquet -o proteins.csv -s experiment.sdrf.tsv \
--quant-method median \
--irs --irs-remove-reference
# Explicit reference samples
mokume features2proteins \
-p features.parquet -o proteins.csv -s experiment.sdrf.tsv \
--quant-method median \
--irs --irs-reference-samples "p1_11,p2_11"
# Custom regex for reference detection
mokume features2proteins \
-p features.parquet -o proteins.csv -s experiment.sdrf.tsv \
--irs --irs-reference-regex "pool|bridge|control"
| IRS Option | Default | Description |
|---|---|---|
--irs |
off | Enable IRS normalization |
--irs-reference-samples |
auto | Comma-separated reference sample names |
--irs-sdrf-column |
auto | SDRF column for reference detection |
--irs-sdrf-values |
auto | Values indicating reference samples |
--irs-reference-regex |
pool\|powder\|ref\|reference\|bridge |
Regex for auto-detection |
--irs-stat |
median |
Statistic for plex reference: median or mean |
--irs-remove-reference |
off | Remove reference samples from output |
Ratio Quantification (TMT PS Protocol)¶
For multi-plex TMT with per-plex reference division:
mokume features2proteins \
-p features.parquet -o proteins.csv -s experiment.sdrf.tsv \
--quant-method ratio \
--coverage-threshold 0.65 \
--ratio-fraction-merge mean
Info
Ratio quantification handles cross-plex normalization inherently via per-plex reference division. The --irs flag is ignored in ratio mode.
Batch Correction¶
mokume features2proteins \
-p features.parquet -o proteins.csv -s experiment.sdrf.tsv \
--quant-method maxlfq \
--batch-correction \
--batch-method sample_prefix \
--batch-covariates "characteristics[sex],characteristics[organism part]"
from mokume.pipeline.config import BatchCorrectionConfig
config = PipelineConfig(
input=InputConfig(parquet="data.parquet", sdrf="experiment.sdrf.tsv"),
quantification=QuantificationConfig(method="maxlfq"),
batch=BatchCorrectionConfig(
enabled=True,
method="sample_prefix",
covariates=["characteristics[sex]", "characteristics[organism part]"],
),
)
Differential Expression¶
Contrasts must be explicitly specified via --de-contrasts (inline) or --de-contrasts-file (TSV). Both can be combined.
| DE Option | Default | Description |
|---|---|---|
--de |
off | Enable differential expression |
--de-contrasts |
— | Comma-separated contrasts (e.g., "A vs B,A vs C") |
--de-contrasts-file |
— | TSV file with columns group1, group2 |
--de-method |
auto |
Method: auto, limrots, deqms, or proda |
--de-log2fc |
0.5 | Minimum absolute log2 fold change |
--de-fdr |
0.05 | Maximum FDR threshold |
--de-fdr-method |
bh |
FDR correction: bh or ihw |
--de-output |
auto | Output file for DE results |
Contrasts are required
If --de is enabled but no contrasts are provided
(neither --de-contrasts nor --de-contrasts-file),
the pipeline raises an error listing available conditions.
Use " vs " as the delimiter to support hyphenated
condition names.
Tip
--de-method auto chooses deqms for directlfq
quantification and limrots for all others. Use proda
explicitly when dropout-aware modeling is more appropriate
for your matrix. See Differential Expression
concepts for a
detailed comparison of methods.
Plots and Reports¶
mokume features2proteins \
-p features.parquet -o proteins.csv -s experiment.sdrf.tsv \
--quant-method maxlfq \
--de --de-contrasts "NASH-HL" \
--plot-dir plots/ \
--plot-volcano --plot-heatmap --plot-pca \
--highlight-genes "COL10A1,FN1,ALB" \
--interactive-report --report-output qc_report.html
Exporting Intermediate Data¶
# Export normalized peptides and ions
mokume features2proteins \
-p features.parquet -o proteins.csv \
--quant-method directlfq \
--export-peptides peptides.csv \
--export-ions ions.csv
Full Example¶
A complete TMT multi-plex analysis:
mokume features2proteins \
-p features.parquet \
-o proteins.csv \
-s experiment.sdrf.tsv \
--quant-method median \
--run-normalization median \
--sample-normalization globalMedian \
--min-unique 2 \
--remove-contaminants \
--irs --irs-remove-reference \
--batch-correction --batch-method sample_prefix \
--de --de-contrasts "NASH-HL" --de-method limrots --de-fdr-method ihw \
--plot-dir plots/ --plot-volcano --plot-pca \
--interactive-report --report-output qc_report.html