Normalization¶
Normalization corrects systematic biases so that intensity differences between samples reflect true biological variation rather than technical artifacts.
mokume applies normalization at two levels: run-level (within samples) and sample-level (across samples).
Pipeline Overview¶
graph LR
A[Raw Features] --> B[Run Normalization]
B --> C[Peptidoform Aggregation]
C --> D[Sample Normalization]
D --> E[Protein Quantification]
style B fill:#e8eaf6
style D fill:#e8eaf6
Run-Level Normalization¶
Run normalization (--run-normalization) adjusts for intensity differences between technical replicates within each sample. Applied when technical_replicates > 1.
| Method | Description | Formula |
|---|---|---|
median |
Normalize by median | intensity / median(intensity) |
mean |
Normalize by mean | intensity / mean(intensity) |
max |
Normalize by max | intensity / max(intensity) |
global |
Normalize by sum | intensity / sum(intensity) |
max_min |
Min-max scaling | (intensity - min) / (max - min) |
iqr |
Interquartile range | Uses IQR for scaling |
none |
No normalization | — |
Sample-Level Normalization¶
Sample normalization (--sample-normalization) adjusts for systematic differences across samples. These methods fall into two categories:
Per-Sample Methods¶
Applied during data loading, one sample at a time:
| Method | Description |
|---|---|
globalMedian |
Divides each sample by its median, normalized to the global median |
conditionMedian |
Same as globalMedian but within each experimental condition |
none |
No normalization |
Dataset-Level Methods¶
Applied after all samples are loaded, operating on the complete dataset:
| Method | Description |
|---|---|
hierarchical |
DirectLFQ-style hierarchical clustering normalization |
tmm |
Trimmed Mean of M-values (Robinson & Oshlack, 2010) |
When to use hierarchical normalization
Use --sample-normalization hierarchical when you want DirectLFQ-style normalization combined with a different quantification method (e.g., iBAQ). This gives you the normalization quality of DirectLFQ with the quantification approach of your choice.
Global Median¶
The default method. For each sample, computes:
$$\text{normalized} = \frac{\text{intensity}}{\text{sample_median} / \text{global_median}}$$
This ensures all samples have comparable median intensities.
Hierarchical Normalization¶
Uses the DirectLFQ hierarchical clustering approach (Ammar et al., 2023) implemented natively in mokume:
- Convert to log2 scale
- Align samples using variance-guided pairwise normalization
- Convert back to linear scale
You can optionally specify a set of proteins to use for normalization:
mokume features2proteins -p data.parquet -o out.csv \
--sample-normalization hierarchical \
--normalization-proteins housekeeping_proteins.txt
TMM Normalization¶
Trimmed Mean of M-values computes normalization factors robust to composition bias from highly abundant proteins. Based on Robinson & Oshlack (2010).
DirectLFQ Mode¶
DirectLFQ handles its own normalization
When using --quant-method directlfq, mokume delegates all processing (normalization + quantification) to the DirectLFQ package. The --run-normalization and --sample-normalization options are ignored.