Next-generation sequencing technology has been advancing at an incredibly rapid rate; what started as only genome sequencing now encompasses an incredible amount of RNA sequencing techniques as well. These range from standard RNA-seq, to miRNA-seq, Ribo-seq, to HITS-CLIP (high-throughput sequencing of RNA isolated by crosslinking immunoprecipiation). While these technological advances are now widely used (and have been invaluable to the scientific community), they are not fully mature technologies and we are still learning about potential artifacts that may arise and how to combat them; mispriming events are a significant and under-studied contributor to errors in sequencing data.
What is a mispriming event?
Reverse transcription is an important part of any RNA-sequencing technique. The RNA in question is first converted into cDNA, which is then PCR amplified and converted in a library from there (there are various methods for library preparation, depending on what kind of technique you are using). However, the conversion of RNA into cDNA by reverse transcriptase requires a DNA primer to start the process. This primer is complementary to the RNA, binding to it and allowing for reverse transcription to take place. A mispriming event is when this process occurs at a place where the DNA primer is not perfectly complementary to the RNA.
Two recent papers have highlighted how reverse transcription mispriming events can have a considerable impact on the library preparation process and result in error. Gurp, McIntyre and Verhoeven  conducted an RNA-seq experiment focusing on reads that mapped to ERCC spikes (artificial and known RNA fragments that are added to RNA-seq experiments as a control). Because the sequence of these ERCC spikes is already known, detecting mismatches in the sequences is relatively straightforward.
Their findings were striking: they found that 1) RNA-to-DNA mispriming events were the leading cause of deviations from the true sequence (as opposed to DNA-to-DNA mispriming events that can occur later on in the library preparation process), and 2) these mispriming events are non-random and indeed show specific and predictable patterns. For example, if the first nucleotide of an RNA-seq read starts with A or T, rA-dC and rU-dC mispriming events are common. In positions 2 – 6, rU-dG and rG-dT are also quite common, which lines up with the observation that these are the most stable mismatched pairs . Needless to say, these kind of mispriming events can cause huge issues for various type of downstream analysis, particularly identification of SNPs and RNA-editing sites; eliminating these biases will be extremely important for future experiments (Figure 1).
As of right now, we do not have good, sophisticated methods of eliminating these types of mispriming events from our datasets. Eliminating the first 10 bases of reads will solve the problem, but will also involve throwing out real data with the artifacts. Given the fact that these mispriming events do follow predictable patterns, it is possible that in the future, we could devise programs to identify and correct mispriming events, or even modify hexamer design to exclude ones that result in frequent mispriming.
Figure 1: Common base mismatches and their locations 
Frustratingly, mispriming events can occur even when the priming oligo is quite lengthy. HITS-CLIP has been greatly important in discovering many protein-RNA interactions ; however, a recent paper published by Gillen et al.  demonstrated that mispriming events even with a long DNA primer can create a significant artifact, creating read pileups that align to the genomic occurrences of the adaptor sequence, making it appear as though there are protein-RNA interactions occurring at that locus.
Part of HITS-CLIP library preparation involves attachment of a 3’ RNA adaptor to the protein bound RNA. A DNA oligo perfectly complementary to this RNA sequence serves as the primer for conversion of this RNA into cDNA, and it is this DNA oligo that leads to significant mispriming events. Although the DNA primer is long enough to be extremely specific, sequences that are complementary to only the last 6 nucleotides of the primer are still enough to result in a mispriming event, which converts alternative RNAs into cDNAs that eventually get amplified in the library.
Gillen et al. analyzed 44 experiments from 17 research groups, and showed that the adaptor sequence was overrepresented by 1.5-fold on average–and sometimes as high as 6-fold (Figure 2)!
Figure 2: Over-representation of DNA primer sequences can be found in multiple datasets from different groups, indicating the possibility of a widespread problem.
And since only 6 complementary nucleotides are needed to result in a mispriming event, how can we eliminate this artifactual data?
Gillen et al. devised an ingenious and simple method of reducing this artifact by using a nested reverse transcription primer (Figure 3). By ‘nested primer’, they are referring to a primer that is not perfectly complementary to the 3’ adaptor, but rather stops 3 nucleotides short of being fully flush with the adaptor. This, combined with a full-length PCR primer (that is, flush with the adaptor sequence) with a ‘protected’ final 3 nucleotides (note: in this instance, ‘protected’ mean usage of phosphorothioate bonds in the final 3 oligo bases, which prevents degradation by exonucleases. Without this protective bond, the mispriming artifact is simply shifted downstream 3 bases), is enough to almost completely eliminate mispriming artifacts. This allows for significantly improved library quality and increased sensitivity!
Figure 3: A nested reverse transcription primer combined with a protected PCR primer can eliminate sequencing artifacts almost entirely.
Although we have been working with sequencing technologies for many years now, we still have a lot to discover about hidden artifacts in the data. It’s becoming increasingly important to stay aware of emerging discoveries of these biases and make sure we are doing everything we can to eliminate this from our data.
Have you ever had an experience with sequencing artifacts in your data? Tell us in the comments!