Next Generation Sequencing Data Normalization

Chip-seq peak score normalization

A recent review, Beyond library size: a field guide to NGS Normalization, published last week nicely summarizes the effect normalization technique can have on the number of genes called in differential expression experiments and peaks called in ChIP-seq. The article emphasizes a point we frequently try to convey to researchers beginning the analysis of their NGS data sets, namely, normalization methods depend on the data being analyzed, experimental conditions and how the experiment was performed. This article highlights several points researchers must take in consideration during normalization:

  1. Library size
  2. Technical variation amongst samples
  3. Biases during sequencing, e.g. longer fragments are sampled more frequently
  4. Preferential enrichment of specific sequences (ChIP-seq)

The challenge is that these considerations must be made while being careful not to mask real biological differences between samples.

Three popular normalization methods of RNA-seq data were compared, RPKM, library size total count and DESeq scaling factors. In terms of differentially expressed genes, DESeq and library size normalization resulted in >90% of the same identified genes, while RPKM identified a smaller number of genes. The RPKM method however as expected, more closely matched genome distribution while DESeq and library size norm were biased toward longer genes.

For ChIP-Seq, TC-based scaling, SPP and NCIS were tested on Drosophila embryonic segmentation transcription factors. The authors found that TC-based scaling between a ChIP and matched input raised false discovery rates.  The final set of peaks called depends on whether SPP or MACS is used as a peak caller. Both use cross-correlation to find lag between reads mapped to the + or – strand of DNA-protein regions. Background models are used to remove noise from the sample or from GC content, mappability, before peaks are finally called above a user defined signal to noise ratio.

Most importantly, before starting the analysis on any dataset it’s good to examine the biases present and choose a normalization method to counter these.  The article concludes that correct experimental design should be the first step in countering biases inherent in technique, a point we believe can’t be emphasized enough ! 

If you’re just getting started, connect with a service provider with expertise in data normalization.