Sanger Sequencing Turns 40: Retrospectives and Perspectives on DNA Sequencing Technologies

Retrospective: What have we accomplished with DNA sequencing so far?

Sanger wasn’t the first person to attempt sequencing, but before his classic method was invented, the process was painfully slow and cumbersome. Before Sanger, Gilbert and Maxam sequenced 24 bases of the lactose-repressor binding site by copying it into RNA and sequencing the RNA–which took a total of 2 years [1]!

Sanger’s process sequencing made the process much more efficient. Original Sanger sequencing took a ‘sequencing by synthesis’ approach, creating 4 extension reactions, each with a different radioactive chain-terminating nucleotide to identify what base lay at each position along a DNA fragment. When he ran each of those reactions out on a gel, it became relatively simple to identify the sequence of the DNA fragment (see Figure 1) [2].


Figure 1: Gel from the paper that originally described Sanger sequencing.

Of course, refinements have been made to the process since then. We now label each of the nucleotides with a different fluorescent dye, which allows for the same process to occur but using only one extension reaction instead of 4, greatly simplifying the protocol. Sanger received his second Nobel Prize for this discovering in 1980 (well-deserved, considering it is still used today).

An early version of the Human Genome Project (HGP) began not long after this discovery in 1987. The project was created by the United States Department of Energy, which was interested in obtaining a better understanding of the human genome and how to protect it from the effects of radiation. A more formalized version of this project was approved by Congress in 1988 and a five-year plan was submitted in 1990 [3]. The basic overview of the protocol for the HGP emerged as follows: the large DNA fragments were cloned in bacterial artificial chromosomes (BACs), which were then fragmented, size-selected, and sub-cloned. The purified DNA was then used for Sanger sequencing, and individual reads were then assembled based on overlaps between the reads.

Given how large the human genome is, and the limitations of Sanger sequencing, it quickly became apparent that more efficient and better technologies were necessary, and indeed, a significant part of the HGP was dedicated to creating these technologies. Several advancements in both wet-lab protocol and data analysis pipelines were made during this time, including the advent of paired-end sequencing and the automation of quality metrics for base calls.

Due to the relatively short length of the reads produced, the highly repetitive parts of the human genome (such as centromeres, telomeres and other areas of heterochromatin) remained intractable to this sequencing method. Despite this, a draft of the HGP was submitted in 2001, with a finished sequence submitted in 2004–all for the low, low cost $2.7 billion.

Since then, there have been many advancements to the process of DNA sequencing, but the most important of these is called multiplexing. Multiplexing involves tagging different samples with a specific DNA barcode, which allows us to sequence multiple samples in one reaction tube, vastly increasing the amount of data we can obtain per sequencing run. Interestingly, the most frequently used next-generation sequencing method today (the Illumina platforms–check them out here) still uses the basics of Sanger sequencing (i.e., detection of fluorescently labelled nucleotides), combined with multiplexing and a process called bridge amplification to sequence hundreds of millions of reads per run.

Figure 2

Figure 2: Cost of WGS has decreased faster than we could have imagined.

Rapid advancement in genome sequencing since 2001 have greatly decreased the cost of sequencing, as you can see in Figure 2 [4]. We are quickly approaching sequencing of the human genome for less than $1,000–which you can see here on our website.

What are we doing with sequencing today?

Since the creation of next-generation DNA sequencing, scientists have continued to utilize this technology in increasingly complex and exciting new ways. RNA-sequencing, which involves isolating the RNA from an organism, converting it into cDNA, and then sequencing resulting cDNA, was invented shortly after the advent of next-generation sequencing and has since become a staple of the molecular biology and genetics fields. ChIP-seq, Ribo-Seq, RIP-seq, and methyl-seq followed and have all become standard experimental protocols as well. In fact, as expertly put by Shendure et al. (2017), ‘DNA sequencers are increasingly to the molecular biologist what a microscope is to the cellular biologist–a basic and essential tool for making measurements. In the long run, this may prove to be the greatest impact of DNA sequencing.’ [5] In my own experience, utilizing these methods in ways that complement each other (like cross-referencing ChIP-seq or Ribo-Seq data with RNA-seq data) can produce some of the most exciting scientific discoveries.

Figure 3

Figure 3: Model of the MinION system.

Although Illumina sequencing still reigns supreme on the market, there are some up and coming competitor products as well. Of great interest is the MinION from Oxford Nanopore Technologies (ONT) (see more about them here). MinION offers a new type of sequencing that offers something the Illumina platforms lack–the ability to sequence long regions of DNA, which is of enormous value when sequencing through highly repetitive regions. MinION works via a process called nanopore sequencing, a system which applies voltage across hundreds of small protein pores. At the top of these pores sits an enzyme that processively unwinds DNA down through the pore, causing a disruption in the voltage flow which can measured at the nucleotide level (see Figure 3) [6]. These reads can span thousands of base pairs, orders of magnitude greater than the Illumina platforms, which greatly simplifies genome assembly. Other new options for long-read sequencing include the PacBio system from Pacific Biosciences (look for pricing options for this service here).

Like any new technology, there have been setbacks. The early accuracy of MinION cells was quite low compared with Illumina, and the output was quite low as well. And although these issues have mainly been addressed, MinION still trails in the market compared to Illumina platforms, which are seen as more reliable and well-characterized. However, MinION has several advantages that could eventually lead to it being more commonly used in the future: for one, it literally fits in the palm of your hand, making it much more feasible for people like infectious diseases researchers, who are in desperate need of sequencing capabilities in remote locales. It’s fast as well; in one example, a researcher in Australia was able to identify antibiotic resistance genes in cultured bacteria in 10 hours [7]–an absolutely incredible feat that couldn’t have been imagined until very recently. This kind of technology could easily be used in hospitals to assist in identifying appropriate patient treatments, hopefully within a few years.

Although we are not regularly able to utilize sequencing technology for medical treatments as of yet, there are a few areas where this is currently happening. Detecting Down’s syndrome in a fetus during pregnancy used to be a much more invasive process, but with improvements in sequencing technology, new screens have been invented that allow for the detection of chromosomal abnormalities circulating in the maternal blood [8]. Millions of women have already benefitted from this improved screen.

Perspective: What does the future of DNA sequencing hold?

As the Chinese poet Lao Tzu said, ‘Those who have knowledge, don’t predict’, and that’s as true as ever when it comes to DNA sequencing technology. We’re capable today of things we couldn’t even have dreamed of 40 years ago, so who knows where we’ll be in the next 40 years?

But as a scientist, I’ve always enjoyed making educated guesses, so here’s some limited predictions about what the future might hold.

Clinical applications: I’ve never been a fan of the term personalized medicine, since it implies that one day doctors will be able to design individual treatments for each patient’s specific illness. I find this scenario unlikely (at least in the near future), because even though the cost and time of DNA sequencing has decreased by astonishing amounts, it still is expensive and time-consuming enough that it doesn’t seem likely to be of great use for clinical applications (to say nothing of cost and time for developing new drug regiments). However, I have high hopes for the future of precision medicine, particularly in cancer treatments. Although we may never be able to design the perfect drug specifically designed to target one individual’s cancer, we can certainly create drugs that are designed to interact with the frequently observed mutations we see in cancers. This could allow for a more individualized drug regiment for patients. Given that cancer is a disease with such extremely wide variations, we will almost certainly need to start taking more targeted approach to its treatment, and genome sequencing will be of great benefit to us in this regard.

A fully complete human genome: As I mentioned previously, one drawback of Illumina sequencing is that it is not capable of sequencing across highly repetitive regions, and unfortunately, large swaths of the human genome are highly repetitive. As such, while we have what is very close to a complete human genome, we do not have the full telomere-to-telomere sequence down as of yet. However, with the new long-read technologies that are currently being implemented, the day when this will be completed is likely not far off.

A complete tapestry of human genetic variation: Millions of people have already had their genomes sequenced to some degree (I’m one of them! Any others?), and millions more are sure to come. Widespread genome re-sequencing could one day allow us to have a full catalog of virtually every heterozygous gene variant in the world, which could allow for an even greater understanding of the connection between our genetics and specific traits.

Faster and better data analysis: Data analysis is probably the biggest bottleneck we’re currently experience when it comes to DNA sequencing. There is what seems like an infinite amount of data out there and unfortunately, a finite number of people who are capable of and interested in analyzing it. As these technologies become more and more mature and established, new and better data analysis pipelines will eventually be created, speeding up analysis time and increasing our understanding the data. Hopefully, one day even scientists with only moderate technical savvy will be capable of performing their own data analysis.

I’m certain the future of DNA sequencing will also hold things that I can’t even imagine. It’s an amazing time to be a scientist right now, as researchers are continuously discovering new technologies, and finding ways to put our current technologies to even more interesting uses.

What do you think the next big thing in DNA sequencing will be? Tell us in the comments!

RIN Numbers: How they’re calculated, what they mean and why they’re important

High-quality sequencing data is an important part of ensuring that your data is reliable and replicable, and obtaining high-quality sequencing data means using high-quality starting material. For RNA-seq data, this means using RNA that has a high RIN (RNA Integrity Number), a 10-point scale from 1 – 10 that provides a standardized number to researchers indicating the quality of their RNA, removing individual bias and interpretation from the process.

The RIN is a significant improvement over the way that RNA integrity was previously calculated: the 28S and 18S ratio. Because 28S is approximately 5 kb and 18S is approximately 2 kb, the ideal 28S:18S ratio is 2.7:1–but the benchmark is considered about 2:1. However, this measurement relies on the assumption that the quality of rRNA (a very stable molecule) is linearly reflective of mRNA quality, which is actually much less stable and experience higher turnover [1].


Figure 1: RNA traces of RNA samples with different RIN values. Note the difference between high and low quality samples.

Fortunately, Agilent Technologies has developed a better method: the RIN value. Agilent has developed a sophisticated algorithm that calculates the RIN value, a measurement that is a considerable improvement over the 28S:18S ratio. RIN is an improvement in that it takes into account the entirety of the RNA sample, not just the rRNA measurements, as you can see in Figure 1 [2]

The importance of RNA integrity in determining the quality of gene expression was examined by Chen et al. [3] in 2014 by comparing RNA samples of 4 different RIN numbers (from 4.5 – 9.4) and 3 different library preparation methods (poly-A selected, rRNA-depleted, and total RNA) for a total of 12 samples. They then calculated the correlation coefficient of gene expression between the highest quality RNA and the more degraded samples between library preparation methods.


Figure 2: Only poly-A selected RNA library preparations experience a decrease in data quality with a decrease in RIN value.

Fascinatingly, the only library preparation method that showed a significant decrease in the correlation between high quality and low quality RNA was the poly-A selected library preparation method. The other two library preparation methods still had correlation coefficients of greater than 0.95 even at low RINs (see Figure 2 [3])!

Chen et al. theorize that the reason behind this is that degraded samples that are poly-A selected will result in an increasingly 3′ biased library preparation, and that therefore you will lose valuable reads from your data. Because the other methods involve either no treatment or rRNA removal (as opposed to selection), there will be considerably less bias in the overall sample.

Even though it seems as though only the poly-A selected library preparation method suffers from having a low RIN, providers still prefer to work with relatively high quality RNA samples for all library preparation methods. However, if you do have important samples that are of lower quality RIN, it may be worth still discussing your options with a provider directly–and we at Genohub are more than happy to help facilitate your discussions! Please contact us here if you have any further questions about sequencing of samples with poor RIN.

How mispriming events could be creating artifacts in your library prep (and what you can do to prevent it)

Next-generation sequencing technology has been advancing at an incredibly rapid rate; what started as only genome sequencing now encompasses an incredible amount of RNA sequencing techniques as well. These range from standard RNA-seq, to miRNA-seq, Ribo-seq, to HITS-CLIP (high-throughput sequencing of RNA isolated by crosslinking immunoprecipiation). While these technological advances are now widely used (and have been invaluable to the scientific community), they are not fully mature technologies and we are still learning about potential artifacts that may arise and how to combat them; mispriming events are a significant and under-studied contributor to errors in sequencing data.

What is a mispriming event?

Reverse transcription is an important part of any RNA-sequencing technique. The RNA in question is first converted into cDNA, which is then PCR amplified and converted in a library from there (there are various methods for library preparation, depending on what kind of technique you are using). However, the conversion of RNA into cDNA by reverse transcriptase requires a DNA primer to start the process. This primer is complementary to the RNA, binding to it and allowing for reverse transcription to take place. A mispriming event is when this process occurs at a place where the DNA primer is not perfectly complementary to the RNA.

Two recent papers have highlighted how reverse transcription mispriming events can have a considerable impact on the library preparation process and result in error. Gurp, McIntyre and Verhoeven [1] conducted an RNA-seq experiment focusing on reads that mapped to ERCC spikes (artificial and known RNA fragments that are added to RNA-seq experiments as a control). Because the sequence of these ERCC spikes is already known, detecting mismatches in the sequences is relatively straightforward.

Their findings were striking: they found that 1) RNA-to-DNA mispriming events were the leading cause of deviations from the true sequence (as opposed to DNA-to-DNA mispriming events that can occur later on in the library preparation process), and 2) these mispriming events are non-random and indeed show specific and predictable patterns. For example, if the first nucleotide of an RNA-seq read starts with A or T, rA-dC and rU-dC mispriming events are common. In positions 2 – 6, rU-dG and rG-dT are also quite common, which lines up with the observation that these are the most stable mismatched pairs [2]. Needless to say, these kind of mispriming events can cause huge issues for various type of downstream analysis, particularly identification of SNPs and RNA-editing sites; eliminating these biases will be extremely important for future experiments (Figure 1). 


Figure 1: Common base mismatches and their locations [1]

As of right now, we do not have good, sophisticated methods of eliminating these types of mispriming events from our datasets. Eliminating the first 10 bases of reads will solve the problem, but will also involve throwing out real data with the artifacts. Given the fact that these mispriming events do follow predictable patterns, it is possible that in the future, we could devise programs to identify and correct mispriming events, or even modify hexamer design to exclude ones that result in frequent mispriming.

Frustratingly, mispriming events can occur even when the priming oligo is quite lengthy. HITS-CLIP has been greatly important in discovering many protein-RNA interactions [3]; however, a recent paper published by Gillen et al. [4]  demonstrated that mispriming events even with a long DNA primer can create a significant artifact, creating read pileups that align to the genomic occurrences of the adaptor sequence, making it appear as though there are protein-RNA interactions occurring at that locus.

Part of HITS-CLIP library preparation involves attachment of a 3’ RNA adaptor to the protein bound RNA. A DNA oligo perfectly complementary to this RNA sequence serves as the primer for conversion of this RNA into cDNA, and it is this DNA oligo that leads to significant mispriming events. Although the DNA primer is long enough to be extremely specific, sequences that are complementary to only the last 6 nucleotides of the primer are still enough to result in a mispriming event, which converts alternative RNAs into cDNAs that eventually get amplified in the library.

Gillen et al. analyzed 44 experiments from 17 research groups, and showed that the adaptor sequence was overrepresented by 1.5-fold on average–and sometimes as high as 6-fold (Figure 2)!


Figure 2: Over-representation of DNA primer sequences can be found in multiple datasets from different groups, indicating the possibility of a widespread problem. 

And since only 6 complementary nucleotides are needed to result in a mispriming event, how can we eliminate this artifactual data?

Gillen et al. devised an ingenious and simple method of reducing this artifact by using a nested reverse transcription primer (Figure 3). By ‘nested primer’, they are referring to a primer that is not perfectly complementary to the 3’ adaptor, but rather stops 3 nucleotides short of being fully flush with the adaptor. This, combined with a full-length PCR primer (that is, flush with the adaptor sequence) with a ‘protected’ final 3 nucleotides (note: in this instance, ‘protected’ mean usage of phosphorothioate bonds in the final 3 oligo bases, which prevents degradation by exonucleases. Without this protective bond, the mispriming artifact is simply shifted downstream 3 bases), is enough to almost completely eliminate mispriming artifacts. This allows for significantly improved library quality and increased sensitivity!


Figure 3: A nested reverse transcription primer combined with a protected PCR primer can eliminate sequencing artifacts almost entirely. 

Although we have been working with sequencing technologies for many years now, we still have a lot to discover about hidden artifacts in the data. It’s becoming increasingly important to stay aware of emerging discoveries of these biases and make sure we are doing everything we can to eliminate this from our data.

Have you ever had an experience with sequencing artifacts in your data? Tell us in the comments!

Ribo-Seq: Understanding the effect of translational regulation on protein abundance in the cell

Examining changes in gene expression has become one of the most powerful tools in molecular biology today. However, the correlation between mRNA expression and protein levels is often poor. Thus, being able to identify precisely which transcripts are being actively translated, and the rate at which they are being translated, could be a huge boon to the field and give us more insight into which genes are carried through all the way from the mRNA to the protein level–and Ribo-seq (also known as ribosome profiling) technology gives us just that!

Historic nature of ribosome profiling

Ribo-seq is based upon the much older technique of in vitro ribosome footprinting, which stretches back nearly 50 years ago and was used by Joan Steitz and Marilyn Kozak in important studies to map the locations of translation initiation [1, 2]. Due to the technological limitations of the time, these experiments were performed with cell-free in vitro translational systems. These days, we can actually extract actively translating ribosomes from cells and directly observe their locations on the mRNAs they are translating!


So how does this innovative new technique work? The workflow is actually remarkably simple.

  1. We start by lysing the cells by first flash-freezing them, and then harvesting them in the presence of cyclohexamide (see explanation for this under ‘Drawbacks and complications’).
  2. Next, we treat the lysates with RNase 1, which digests the part of the mRNA not protected by the ribosome.
  3. The ribosomes are then separated using a sucrose cushion and centrifugation at very high speeds.
  4. RNA from the ribosome fraction obtained above are then purified with a miRNeasy kit and then gel purified to obtain the 26 – 34 nt region. These are the ribosome footprints.
  5. From there, the RNA is dephosphorylated and the linker DNA is added.
  6. The hybrid molecule is then subjected to reverse transcription into cDNA.
  7. The cDNA is then circularized, PCR amplified, and then used for deep sequencing.

Ribo-seq vs. RNA-seq

Ribosome profiling as a next-generation sequencing technique was developed quite recently by Nicholas Ingolia and Jonathan Weissman [3, 4]. One of their most interesting findings was that there is a roughly 100-fold range of translation efficiency across the yeast transcriptome, meaning that just because an mRNA is very abundant, that does not mean that it is highly translated. They concluded that translation efficiency, which cannot be measured by RNA-seq experiments, is a significant factor in whether or not a gene makes it all the way from an mRNA to a protein product.

Additionally, they looked at the correlation between the abundance of proteins (measured by mass spectrometry) and either the data obtained from Ribo-seq or RNA-seq. They found that Ribo-seq measurements had a much higher correlation with protein abundance than RNA-seq ( = 0.60 vs. = 0.17), meaning that Ribo-seq is actually a better measurement of gene expression analysis (depending on the type of experiment you’re interested in performing).

Of course, there are still significant advantages to RNA-seq over Ribo-seq–Ribo-seq will not be able to capture the expression of non-coding RNAs, for instance. Additionally, RNA-seq is considerably cheaper and easier to perform as of this moment. However, I believe that we are likely to see a trend towards ribosome profiling as this technique becomes more mature.

What else can we learn from ribosome profiling?

Ribosome profiling has already taught us many new things, including:

  • discovering that RNAs that had previously been thought to be non-coding RNAs due to their short length are actually translated, and indeed code for short peptide sequences, the exact functions of which remain unknown. [5]
  • detection of previously unknown translated short upstream ORFs (uORFs), which often possess a non-AUG start codon. These uORFs are likely responsible for regulating the protein-coding ORFs (which is true of the GCN4 gene)[6], though it remains to be seen if that is true for all uORFs or if they have other currently unknown functionalities.
  • determination of the approximate translation elongation rate (330 codons per minute)
  • examples of ribosome pausing or stalling at consecutive proline codons in eukaryotes and yeast [7, 6]

But who knows what else we will learn in the future? This technique can teach us a lot about how gene expression can be regulatated at the translational level. Additionally, we can learn a lot about how translation affects various diseases states, most notably cancer, since cellular stress will very likely affect both translation rate and regulation.

Drawbacks and complications 

While this technique is extremely powerful, there are a few drawbacks. The most prominent amongst them is that any attempt to lyse and harvest the cells for this procedure causes a change in the ribosome profile, making this technique particularly vulnerable to artefacts. Researchers often attempt to halt translation before harvesting with a 5 minute incubation with cyclohexamide, a drug that blocks translation elongation, to prevent ribosome run-off; however, this can result in an enormous increase in ribosome signal at the initiation sites, as ribosomes will still initiate translation and begin to pile up.

The best method of combatting these artefacts is to flash-freeze the cells prior to harvesting, lyse over them dry ice, and then continue the protocol in the presence of cyclohexamide. This technique should result in the best balance between prevention of run-off, and prevention of excessive ribosome accumulation at the initiation site [8].


Our understanding of the mechanisms involved in regulation of translation has been sorely limited by our lack of ability to study it directly. Ribosome profiling now provides a method for us to do just that. We’ve already made huge strides in our understanding of many events in the translation process, including the discovery of hundreds of non-canonical translation initiation sites as well the realization that not all ‘non-coding’ RNAs are non-coding after all! I expect that we’ll continue to see this technique put to new and innovate questions about translation and its role in the cell as the technology matures.

If you’re interested in Ribo-Seq services enter your basic project parameters on Genohub and send us a request. We’ll be happy to help.



6 QC methods post library construction for NGS

After nucleic acid extraction and sample QC, the next step in the NGS workflow is library preparation. NGS libraries are prepared to meet the platform requirements with respect to size, purity, concentration and efficient ligation of adaptors. Assessing the quality of a sequencing library before committing it to a full-scale sequencing run ensures maximum sequencing efficiency, leading to accurate sequencing data with more even coverage.

In this blog post, we list the various ways to QC libraries in order of most stringent to least stringent.

1. qPCR

qPCR is a method of quantifying DNA based on PCR. qPCR tracks target concentration as a function of PCR cycle number to derive a quantitative estimate of the initial template concentration in a sample. As with conventional PCR, it uses a polymerase, dNTPs, and two primers designed to match sequences within a template. For the QC protocol, the primers match sequences within the adapters flanking a sequencing library.

Therefore, qPCR is an ideal method for measuring libraries in advance of generating clusters, because it will only measure templates that have both adaptor sequences on either end which will subsequently form clusters on a flow cell. In addition, qPCR is a very sensitive method of measuring DNA and therefore dilute libraries with concentrations below the threshold of detection of conventional spectrophotometric methods can be quantified by qPCR.

KAPA Biosystems SYBR FAST ‘Library Quantification Kit for Illumina Sequencing Platforms is commonly used with qPCR. This kit measures absolute numbers of molecules containing the Illumina adapter sequences, thus providing a highly accurate measurement of amplifiable molecules available for cluster generation.

2. MiSeq

The MiSeq system uses the same library prep methods and proven sequencing by synthesis chemistry as the HiSeq system. Thus, it is ideal for analyzing prepared libraries prior to performing high-throughput sequencing. Performing library quality control (QC) using the MiSeq system before committing it to a fullscale HiSeq run can save time and money while leading to better sequencing results.

Data generated by the MiSeq system is comparable to other Illumina next-generation sequencing platforms, ensuring a smooth transition from one instrument to another. Based on the individual experimental requirements, metrics obtained from performing simple QC can be used to streamline and improve your sequencing projects.

Using a single library prep method and taking only a single day, detailed QC parameters, including cluster density, library complexity, percent duplication, GC bias, and index representation can be generated on the MiSeq system. The MiSeq system has the unique ability to do paired-end (PE) sequencing for accurately assessing insert size. Library cluster density can also be determined and used to predict HiSeq cluster density, maximizing yield and reducing rework.

3. Fluorometric method

Quantifying DNA libraries using a fluorometric method that involves intercalating dyes specifically binding to DNA or RNA is highly useful. This method is very precise as DNA dyes do not bind to RNA and vice versa.

The Invitrogen™ Qubit™ Fluorometer a popular fluorometer that accurately measures DNA, RNA, and protein using the highly sensitive Invitrogen™ Qubit™ quantitation assays. The concentration of the target molecule in the sample is reported by a fluorescent dye that emits a signal only when bound to the target, which minimizes the effects of contaminants—including degraded DNA or RNA—on the result.

4. Automated electrophoresis

Several automated electrophoretic instruments are useful in estimating the size of the NGS libraries. The Agilent 2100 Bioanalyzer system provides sizing, quantitation, and purity assessments for DNA, RNA, and protein samples. The Agilent 2200 TapeStation system is a tape-based platform for reliable electrophoresis platform for accurate size selection of generated libraries. PerkinElmer LabChip GX can be used for DNA and RNA quantitation and sizing using automated capillary electrophoresis separation. The Qiagen QIAxcel Advanced system fully automates sensitive, high-resolution capillary electrophoresis of up to 96 samples per run that can be used for library QC as well. All these instruments are accompanied by convenient analysis and data documentation software that make the library QC step faster and easier.

5. UV-Visible Spectroscopy

A UV-Vis spectrophotometer can be used to analyze spectral absorbance to measure the nucleic acid libraries and can differentiate between DNA, RNA and other absorbing contaminants. However, this method is not super accurate and should be paired with one of the other QC methods to ensure high-quality libraries. There are several US-Vis spectrophotometers currently available, such as currently available such as Thermo Scientific™ NanoDrop™ UV-Vis spectrophotometer, Qiagen QIAExpert System, Shimadzu Biospec-nano etc.

6. Bead normalization

This is the preferred QC method if < 12 libraries are to be QCed or if library yields are less than 15 nM, highly variable and unpredictable or Users are working with uncharacterized genomes and are inexperienced with the Nextera XT DNA Library Prep Kit protocol.

During bead-based normalization, DNA is bound to normalization beads and eluted off the beads at approximately the same concentration for each sample. Bead-based normalization enables scientists to bypass time-consuming library quantitation measurements and manual pipetting steps before loading libraries onto the sequencer. Bead-based normalization can provide significant cost and time savings for researchers processing many samples, or for researchers without access to any of the QC  instruments listed in the above methods.





Top 3 Sample QC steps prior to library preparation for NGS

Before beginning library preparation for next-generation sequencing, it is highly recommended to perform sample quality control (QC) to check the nucleic acid quantity, purity and integrity. The starting material for NGS library construction might be any type of nucleic acid that is or can be converted into double-stranded DNA (dsDNA). These materials, often gDNA, RNA, PCR amplicons, and ChIP samples, must have high purity and integrity and sufficient concentration for the sequencing reaction.

1. Nucleic Acid Quantification

Measuring the concentration of nucleic acid samples is a key QC step to determine the fit and amount of nucleic acid available for further processing.

  • Absorbance Method:

A UV-Vis spectrophotometer can be used to analyze spectral absorbance to measure the whole nucleic acid profile and can differentiate between DNA, RNA and other absorbing contaminants. Different molecules such as nucleic acids, proteins, and chemical contaminants absorb light in their own pattern. By measuring the amount of light absorbed at a defined wavelength, the concentration of the molecules of interest can be calculated. Most laboratories are equipped with a US-Vis spectrophotometer to quantify nucleic acids or proteins for their day-to-day experiments. Customers can choose from several spectrophotometers currently available such as Thermo Scientific™ NanoDrop™ UV-Vis spectrophotometer, Qiagen QIAExpert System, Shimadzu Biospec-nano etc.

  • Fluorescence Method:

Fluorescence methods are more sensitive than absorbance, particularly for low-concentration samples, and the use of DNA-binding dyes allows more specific measurement of DNA than spectrophotometric methods. Fluorescence measurements are set at excitation and emission values that vary depending on the dye chosen (Hoechst bis-benzimidazole dyes, PicoGreen® or QuantiFluor™ dsDNA dyes). The concentration of unknown samples is calculated based on comparison to a standard curve generated from samples of known DNA concentration.

The availability of single-tube and microplate fluorometers gives flexibility for reading samples in PCR tubes, cuvettes or multiwell plates and makes fluorescence measurement a convenient modern alternative to the more traditional absorbance methods. Thermo Scientific (Invitrogen) Qubit™ Fluorometer is one of the most commonly used fluorometers that accurately measure low concentration DNA, RNA, and protein.


2. Nucleic Acid Purity

Nucleic acid samples can become contaminated by other molecules with which they were co-extracted and eluted during the purification process or by chemicals from upstream applications. Purification methods involving phenol extraction, ethanol precipitation or salting-out may not completely remove all contaminants or chemicals from the final eluates. The resulting impurities can significantly decrease the sensitivity and efficiency of your downstream enzymatic reactions.

  • UV spectrophotometry measurements enable calculation of nucleic acid concentrations based on the sample’s absorbance at 260 nm. The absorbance at 280 nm and 230 nm can be used to assess the level of contaminating proteins or chemicals, respectively. The absorbance ratio of nucleic acids to contaminants provides an estimation of the sample purity, and this number can be used as acceptance criteria for inclusion or exclusion of samples in downstream applications.
  • Contaminants such as RNA, proteins or chemicals can interfere with library preparation and the sequencing reactions. When sequencing DNA, an RNA removal step is highly recommended, and when sequencing RNA, a gDNA removal step is recommended. Sample purity can be assessed following nucleic acid extraction and throughout the library preparation workflow using UV/Vis spectrophotometry. For DNA and RNA samples the relative abundance of proteins in the sample can be assessed by determining the A260/A280ratio, which should be between 1.8–2.0. Contamination by organic compounds can be assessed using the A260/A230 ratio, which should be higher than 2.0 for DNA and higher than 1.5 for RNA. Next-generation spectrophotometry with the Qiagen QIAxpert system enables spectral content profiling, which can discriminate DNA and RNA from sample contaminants without using a dye.


  • qPCR:

Quantitative PCR, or real-time PCR, (qPCR) uses the linearity of DNA amplification to determine absolute or relative quantities of a known sequence in a sample. By using a fluorescent reporter in the reaction, it is possible to measure DNA generation in the qPCR assay. In qPCR, DNA amplification is monitored at each cycle of PCR. When the DNA is in the log-linear phase of amplification, the amount of fluorescence increases above the background. The point at which the fluorescence becomes measurable is called the threshold cycle (CT) or crossing point. By using multiple dilutions of a known amount of standard DNA, a standard curve can be generated of log concentration against CT. The amount of DNA or cDNA in an unknown sample can then be calculated from its CT value.

qPCR-based assays can accurately qualify and quantify amplifiable DNA in challenging samples. For example, DNA derived from Formalin-fixed paraffin-embedded tissue samples, is oftentimes highly fragmented, cross-linked with protein and has a high proportion of single-stranded DNA making it challenging to perform library preparation steps. For FFPE samples, the Agilent NGS FFPE QC kit enables functional DNA quality assessment of input DNA.

3. Nucleic Acid Integrity (Size distribution)

Along with quantity and purity, size distribution is a critical QC parameter that provides valuable insight into sample quality. Analyzing nucleic acid size informs you about your sample’s integrity and indicates whether the samples are fragmented or contaminated by other DNA or RNA products. Various electrophoretic methods can be used to assess the size distribution of your sample.

  • Agarose Gel Electrophoresis

In this method, a horizontal gel electrophoresis tank with an external power supply, analytical-grade agarose, an appropriate running buffer (e.g., 1X TAE) and an intercalating DNA dye along with appropriately sized DNA standards are required. A sample of the isolated DNA is loaded into a well of the agarose gel and then exposed to an electric field. The negatively charged DNA backbone migrates toward the anode. Since small DNA fragments migrate faster, the DNA is separated by size. The percentage of agarose in the gel will determine what size range of DNA will be resolved with the greatest clarity. Any RNA, nucleotides, and protein in the sample migrate at different rates compared to the DNA so the band(s) containing the DNA will be distinct.


Analyzing PCR amplicons or RFLP fragments confirms the presence of the expected size fragments and alerts you to the presence of any non-specific amplicons. Electrophoresis also helps you assess the ligation efficiency yield for plasmid cloning procedures as well as the efficiency of removal of primer–dimers or other unspecific fragments during sample cleanup.

For complex samples such as genomic DNA (gDNA) or total RNA, the shape and position of the smear from electrophoresis analysis directly correlates with the integrity of the samples. Nucleic acid species of larger size tend to be degraded first and provide degradation products of lower molecular weight. Samples of poor integrity generally have a higher abundance of shorter fragments, while high-quality samples contain intact nucleic acid molecules with higher molecular size.

Eukaryotic RNA samples have unique electrophoretic signatures, which consist of a smear with major fragments corresponding to 28S, 18S and 5S ribosomal RNA (rRNA). These electrophoretic patterns correlate with the integrity of the RNA samples. The RNA integrity can either be assessed manually or with automation that employs a dedicated algorithm such as the RNA Integrity Number (RIN) that gives an objective integrity grade to RNA samples ranging from 1–10. RNA samples of highest quality usually have a score of 8 or above.

  • Capillary Electrophoresis

In this method, charged DNA or RNA molecules are injected into a capillary and are resolved during migration through a gel-like matrix. Nucleic acids are detected as they pass by a detector that captures signals of specific absorbance. Results are presented in the form of an electropherogram, which is a plot of signal intensity against migration time. The fragment sizes are precisely determined using a size marker consisting of fragments of known size. This method provides highly resolving and sensitive nucleic acid analysis that is faster and safer.



Hybrid Read Sequencing: Applications and Tools

Next-generation sequencing (Illumina) and long read sequencing (PacBio/Oxford Nanopore) platforms each have their own strengths and weaknesses. Recent advances in single molecule real-time (SMRT) and nanopore sequencing technologies have enabled high-quality assemblies from long and inaccurate reads. However, these approaches require high coverage by long reads and remain expensive. On the other hand, the inexpensive short reads technologies produce accurate but fragmented assemblies. Thus, the combination of these techniques led to a new improved approach known as hybrid sequencing.

The hybrid sequencing methods utilize the high-throughput and high-accuracy short read data to correct errors in the long reads. This approach reduces the required amount of costlier long-read sequence data as well as results in more complete assemblies including the repetitive regions. Moreover, PacBio long reads can provide reliable alignments, scaffolds, and rough detections of genomic variants, while short reads refine the alignments, assemblies, and detections to single-nucleotide resolution. The high coverage of short read sequencing data output can also be utilized in downstream quantitative analysis1.


De novo sequencing

As alternatives to using PacBio sequencing alone for eukaryotic de novo assemblies, error correction strategies using hybrid sequencing have also been developed.

  • Koren et al. developed the PacBio corrected Reads (PBcR) approach for using short reads to correct the errors in long reads2. PBcR has been applied to reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced parrot (Melopsittacus undulates) The long-read correction approach, has achieved >99.9% base-call accuracy, leading to substantially better assemblies than non-hybrid sequencing strategies.
  • Also, Bashir et al. used hybrid sequencing data to assemble the two-chromosome genome of a Haitian cholera outbreak strain at >99.9% accuracy in two nearly finished contigs, completely resolving complex regions with clinically relevant structures3.
  • More recently, Goodwin et al. developed an open-source error correction algorithm Nanocorr, specifically for hybrid error correction of Oxford Nanopore reads. They used this error correction method with complementary MiSeq data to produce a highly contiguous and accurate de novo assembly of the Saccharomyces cerevisiae The contig N50 length was more than ten times greater than an Illumina-only assembly with >99.88% consensus identity when compared to the reference. Additionally, this assembly offered a complete representation of the features of the genome with correctly assembled gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly4.

Transcript structure and Gene isoform identification

Besides genome assembly, hybrid sequencing can also be applied to the error correction of PacBio long reads of transcripts. Moreover, it could improve gene isoform identification and abundance estimation.

  • Along with genome assembly, Koren et al. used the PBcR method to identify and confirm full-length transcripts and gene isoforms. As the length of the single-molecule PacBio reads from RNA-Seq experiments is within the size distribution of most transcripts, many PacBio reads represent near full-length transcripts. These long reads can therefore greatly reduce the need for transcript assembly, which requires complex algorithms for short reads and confidently detect alternatively spliced isoforms. However, the predominance of indel errors makes analysis of the raw reads challenging. Both sets of PacBio reads (before and after error-correction) were aligned to the reference genome to determine the ones that matched the exon structure over the entire length of the annotated transcripts. Before correction, only 41 (0.1%) of the PacBio reads exactly matched the annotated exon structure that rose to 12, 065 (24.1%) after correction.
  • Au et al. developed a computational tool called LSC for the correction of raw PacBio reads by short reads5. Applying this tool to 100,000 human brain cerebellum PacBio subreads and 64 million 75-bp Illumina short reads, they reduced the error rate of the long reads by more than 3-fold. In order to identify and quantify full-length gene isoforms, they also developed an Isoform Detection and Prediction tool (IDP), which makes use of TGS long reads and SGS short reads6. Applying LSC and IDP to PacBio long reads and Illumina short reads of the human embryonic stem cell transcriptome, they detected several thousand RefSeq-annotated gene isoforms at full-length. IDP-fusion has also been released for the identification of fusion genes, fusion sites, and fusion gene isoforms from cancer transcriptomes7.
  • Ning et al. developed an analysis method HySeMaFi to decipher gene splicing and estimate the gene isoforms abundance8. Firstly, the method establishes the mapping relationship between the error-corrected long reads and the longest assembled contig in every corresponding gene. According to the mapping data, the true splicing pattern of the genes is detected, followed by quantification of the isoforms.

Personal transcriptomes

Personal transcriptomes are expected to have applications in understanding individual biology and disease, but short read sequencing has been shown to be insufficiently accurate for the identification and quantification of an individual’s genetic variants and gene isoforms9.

  • Using a hybrid sequencing strategy combining PacBio long reads and Illumina short reads, Tilgner et al. sequenced the lymphoblastoid transcriptomes of three family members in order to produce and quantify an enhanced personalized genome annotation. Around 711,000 CCS reads were used to identify novel isoforms, and ∼100 million Illumina paired-end reads were used to quantify the personalized annotation, which cannot be accomplished by the relatively small number of long reads alone. This method produced reads representing all splice sites of a transcript for most sufficiently expressed genes shorter than 3 kb. It provides a de novo approach for determining single-nucleotide variations, which could be used to improve RNA haplotype inference10.

Epigenetics research

  • Beckmann et al. demonstrated the ability of PacBio sequencing to recover previously-discovered epigenetic motifs with m6A and m4C modifications in both low-coverage and high-contamination scenarios11. They were also able to recover many motifs from three mixed strains ( E. coliG. metallireducens, and C. salexigens), even when the motif sequences of the genomes of interest overlap substantially, suggesting that PacBio sequencing is applicable to metagenomics. Their studies infer that hybrid sequencing would be more cost-effective than using PacBio sequencing alone to detect and accurately define k-mers for low proportion genomes.

Hybrid assembly tools

Several algorithms have been developed that can help in the single molecule de novo assembly of genomes along with hybrid error correction using the short, high-fidelity sequences.

  • Jabba is a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. It uses a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds12. The tool is available here:
  • HALC is a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement and constructs a contig graph. This tool was applied on E. coliA. thaliana and Maylandia zebra data sets and has been showed to achieve up to 41 % higher throughput than other existing algorithms while maintaining comparable accuracy13. HALC can be downloaded here:
  • The HYBRIDSPADES algorithm was developed for assembling short and long reads and benchmarked on several bacterial assembly projects. HYBRIDSPADES generated accurate assemblies (even in projects with relatively low coverage by long reads), thus reducing the overall cost of genome sequencing. This method was used to demonstrate the first complete circular chromosome assembly of a genome from single cells of Candidate Phylum TM6using SMRT reads14. The tool is publicly available on this page:

Due to the constant development of new long read error correction tools, La et al. have recently published an open-source pipeline that evaluates the accuracy of these different algorithms15. LRCstats analyzed the accuracy of four hybrid correction methods for PacBio long reads over three data sets and can be downloaded here:

Sović et al. evaluated the different non-hybrid and hybrid assembly methods for de novo assembly using nanopore reads16. They benchmarked five non-hybrid assembly pipelines and two hybrid assemblers that use nanopore sequencing data to scaffold Illumina assemblies. Their results showed that hybrid methods are highly dependent on the quality of NGS data, but much less on the quality and coverage of nanopore data and performed relatively well on lower nanopore coverages. The implementation of this DNA Assembly benchmark is available here:


  1. Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics, Proteomics Bioinforma. 13, 278–289 (2015).
  2. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotech 30, 693–700 (2012).
  3. Bashir, A. et al. A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol 30, (2012).
  4. Goodwin, S. et al. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res 25, (2015).
  5. Au, K. F., Underwood, J. G., Lee, L. & Wong, W. H. Improving PacBio Long Read Accuracy by Short Read Alignment. PLoS One 7, e46679 (2012).
  6. Au, K. F. et al. Characterization of the human ESC transcriptome by hybrid sequencing. Proc. Natl. Acad. Sci. 110, E4821–E4830 (2013).
  7. Weirather, J. L. et al. Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing. Nucleic Acids Res. 43, e116 (2015).
  8. Ning, G. et al. Hybrid sequencing and map finding (HySeMaFi): optional strategies for extensively deciphering gene splicing and expression in organisms without reference genome. 7, 43793 (2017).
  9. Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq.(ANALYSIS OPEN)(Report). Nat. Methods 10, 1177 (2013).
  10. Tilgner, H., Grubert, F., Sharon, D. & Snyder, M. P. Defining a personal, allele-specific, and single-molecule long-read transcriptome. Proc. Natl. Acad. Sci. 111, 9869–9874 (2014).
  11. Beckmann, N. D., Karri, S., Fang, G. & Bashir, A. Detecting epigenetic motifs in low coverage and metagenomics settings. BMC Bioinformatics 15, S16 (2014).
  12. Miclotte, G. et al. Jabba: hybrid error correction for long sequencing reads. Algorithms Mol. Biol. 11, 10 (2016).
  13. Bao, E. & Lan, L. HALC: High throughput algorithm for long read error correction. BMC Bioinformatics 18, 204 (2017).
  14. Antipov, D., Korobeynikov, A., McLean, J. S. & Pevzner, P. A. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2016).
  15. La, S., Haghshenas, E. & Chauve, C. LRCstats, a tool for evaluating long reads correction methods. Bioinformatics (2017). doi:10.1093/bioinformatics/btx489
  16. Sović, I., Križanović, K., Skala, K. & Šikić, M. Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads . Bioinformatics 32, 2582–2589 (2016).