Nanopore Sequencing: The Future of NGS?

As I mentioned in my previous post, nanopore sequencing using the MinION instrument is one of the hottest new sequencing techniques currently available. It has several benefits over the current generation of short-read sequencing instruments, including measuring epigenetic DNA modifications and ultra-long reads, which allows for improved coverage of difficult-to-sequence regions.

It does have a few drawbacks however, including the fact that it has a fairly low output, which mostly relegates it to sequencing microbial genomes. However, a recent paper by Jain et al. from UCSC [1] used the minuscule MinION instrument to sequence the human genome and compare it to the current reference genome.

There were several items of note in this paper, not the least of which is that this is the most contiguous human genome to date, getting us closer and closer to a telomere-to-telomere sequence. Additionally, they were able to close 12 gaps, each of which was more than 50 kb in length, significantly improving completion of the genome.

Amazingly, since nanopore sequencing does not utilize PCR amplification, epigenetic modifications are maintained and are actually measurable by the MinION. The instrument is capable of detecting 5-methylcytosine modifications, and this data showed good concordance with whole genome bisulfite sequencing performed in the past.

Furthermore, they were able to map several of their ultra-long reads with telomeric repeats to specific chromosomal regions. They were then able to identify the start of the telomeric repeats and calculate the length of the repeat sequence. Overall, they found evidence for repeat regions that span 2 – 11 kb.

Long and ultra-long reads are absolutely critical when it comes to annotating these highly repetitive regions. There are other sequencers, including the PacBio SMRT Sequel sequencing system, that allows for very long reads compared to the Illumina instruments. But Jain et al. were able to obtain reads that were up to a staggering 882 kb in length.

Jain et al. were able to effectively show that the MinION system is capable of being used to sequence something as complex as a human genome. Interestingly, they theorized that the MinION system may have no intrinsic limit to read length–meaning that this protocol can be improved even further by finding methods of purifying high molecular weight DNA without fragmentation. Additionally, MinION reads are still considerably less accurate than Illumina sequencing, so this aspect could be improved as well. Nonetheless, this is a truly astonishing accomplishment that indicates what the future of DNA sequencing holds in store.

If you’re interested in finding a provider of nanopore sequencing, please send us an email at projects@genohub.com and we’d love to help you with your project!

 

Sanger Sequencing Turns 40: Retrospectives and Perspectives on DNA Sequencing Technologies

Retrospective: What have we accomplished with DNA sequencing so far?

Sanger wasn’t the first person to attempt sequencing, but before his classic method was invented, the process was painfully slow and cumbersome. Before Sanger, Gilbert and Maxam sequenced 24 bases of the lactose-repressor binding site by copying it into RNA and sequencing the RNA–which took a total of 2 years [1]!

Sanger’s process sequencing made the process much more efficient. Original Sanger sequencing took a ‘sequencing by synthesis’ approach, creating 4 extension reactions, each with a different radioactive chain-terminating nucleotide to identify what base lay at each position along a DNA fragment. When he ran each of those reactions out on a gel, it became relatively simple to identify the sequence of the DNA fragment (see Figure 1) [2].

Figure1

Figure 1: Gel from the paper that originally described Sanger sequencing.

Of course, refinements have been made to the process since then. We now label each of the nucleotides with a different fluorescent dye, which allows for the same process to occur but using only one extension reaction instead of 4, greatly simplifying the protocol. Sanger received his second Nobel Prize for this discovering in 1980 (well-deserved, considering it is still used today).

An early version of the Human Genome Project (HGP) began not long after this discovery in 1987. The project was created by the United States Department of Energy, which was interested in obtaining a better understanding of the human genome and how to protect it from the effects of radiation. A more formalized version of this project was approved by Congress in 1988 and a five-year plan was submitted in 1990 [3]. The basic overview of the protocol for the HGP emerged as follows: the large DNA fragments were cloned in bacterial artificial chromosomes (BACs), which were then fragmented, size-selected, and sub-cloned. The purified DNA was then used for Sanger sequencing, and individual reads were then assembled based on overlaps between the reads.

Given how large the human genome is, and the limitations of Sanger sequencing, it quickly became apparent that more efficient and better technologies were necessary, and indeed, a significant part of the HGP was dedicated to creating these technologies. Several advancements in both wet-lab protocol and data analysis pipelines were made during this time, including the advent of paired-end sequencing and the automation of quality metrics for base calls.

Due to the relatively short length of the reads produced, the highly repetitive parts of the human genome (such as centromeres, telomeres and other areas of heterochromatin) remained intractable to this sequencing method. Despite this, a draft of the HGP was submitted in 2001, with a finished sequence submitted in 2004–all for the low, low cost $2.7 billion.

Since then, there have been many advancements to the process of DNA sequencing, but the most important of these is called multiplexing. Multiplexing involves tagging different samples with a specific DNA barcode, which allows us to sequence multiple samples in one reaction tube, vastly increasing the amount of data we can obtain per sequencing run. Interestingly, the most frequently used next-generation sequencing method today (the Illumina platforms–check them out here) still uses the basics of Sanger sequencing (i.e., detection of fluorescently labelled nucleotides), combined with multiplexing and a process called bridge amplification to sequence hundreds of millions of reads per run.

Figure 2

Figure 2: Cost of WGS has decreased faster than we could have imagined.

Rapid advancement in genome sequencing since 2001 have greatly decreased the cost of sequencing, as you can see in Figure 2 [4]. We are quickly approaching sequencing of the human genome for less than $1,000–which you can see here on our website.

What are we doing with sequencing today?

Since the creation of next-generation DNA sequencing, scientists have continued to utilize this technology in increasingly complex and exciting new ways. RNA-sequencing, which involves isolating the RNA from an organism, converting it into cDNA, and then sequencing resulting cDNA, was invented shortly after the advent of next-generation sequencing and has since become a staple of the molecular biology and genetics fields. ChIP-seq, Ribo-Seq, RIP-seq, and methyl-seq followed and have all become standard experimental protocols as well. In fact, as expertly put by Shendure et al. (2017), ‘DNA sequencers are increasingly to the molecular biologist what a microscope is to the cellular biologist–a basic and essential tool for making measurements. In the long run, this may prove to be the greatest impact of DNA sequencing.’ [5] In my own experience, utilizing these methods in ways that complement each other (like cross-referencing ChIP-seq or Ribo-Seq data with RNA-seq data) can produce some of the most exciting scientific discoveries.

Figure 3

Figure 3: Model of the MinION system.

Although Illumina sequencing still reigns supreme on the market, there are some up and coming competitor products as well. Of great interest is the MinION from Oxford Nanopore Technologies (ONT) (see more about them here). MinION offers a new type of sequencing that offers something the Illumina platforms lack–the ability to sequence long regions of DNA, which is of enormous value when sequencing through highly repetitive regions. MinION works via a process called nanopore sequencing, a system which applies voltage across hundreds of small protein pores. At the top of these pores sits an enzyme that processively unwinds DNA down through the pore, causing a disruption in the voltage flow which can measured at the nucleotide level (see Figure 3) [6]. These reads can span thousands of base pairs, orders of magnitude greater than the Illumina platforms, which greatly simplifies genome assembly. Other new options for long-read sequencing include the PacBio system from Pacific Biosciences (look for pricing options for this service here).

Like any new technology, there have been setbacks. The early accuracy of MinION cells was quite low compared with Illumina, and the output was quite low as well. And although these issues have mainly been addressed, MinION still trails in the market compared to Illumina platforms, which are seen as more reliable and well-characterized. However, MinION has several advantages that could eventually lead to it being more commonly used in the future: for one, it literally fits in the palm of your hand, making it much more feasible for people like infectious diseases researchers, who are in desperate need of sequencing capabilities in remote locales. It’s fast as well; in one example, a researcher in Australia was able to identify antibiotic resistance genes in cultured bacteria in 10 hours [7]–an absolutely incredible feat that couldn’t have been imagined until very recently. This kind of technology could easily be used in hospitals to assist in identifying appropriate patient treatments, hopefully within a few years.

Although we are not regularly able to utilize sequencing technology for medical treatments as of yet, there are a few areas where this is currently happening. Detecting Down’s syndrome in a fetus during pregnancy used to be a much more invasive process, but with improvements in sequencing technology, new screens have been invented that allow for the detection of chromosomal abnormalities circulating in the maternal blood [8]. Millions of women have already benefitted from this improved screen.

Perspective: What does the future of DNA sequencing hold?

As the Chinese poet Lao Tzu said, ‘Those who have knowledge, don’t predict’, and that’s as true as ever when it comes to DNA sequencing technology. We’re capable today of things we couldn’t even have dreamed of 40 years ago, so who knows where we’ll be in the next 40 years?

But as a scientist, I’ve always enjoyed making educated guesses, so here’s some limited predictions about what the future might hold.

Clinical applications: I’ve never been a fan of the term personalized medicine, since it implies that one day doctors will be able to design individual treatments for each patient’s specific illness. I find this scenario unlikely (at least in the near future), because even though the cost and time of DNA sequencing has decreased by astonishing amounts, it still is expensive and time-consuming enough that it doesn’t seem likely to be of great use for clinical applications (to say nothing of cost and time for developing new drug regiments). However, I have high hopes for the future of precision medicine, particularly in cancer treatments. Although we may never be able to design the perfect drug specifically designed to target one individual’s cancer, we can certainly create drugs that are designed to interact with the frequently observed mutations we see in cancers. This could allow for a more individualized drug regiment for patients. Given that cancer is a disease with such extremely wide variations, we will almost certainly need to start taking more targeted approach to its treatment, and genome sequencing will be of great benefit to us in this regard.

A fully complete human genome: As I mentioned previously, one drawback of Illumina sequencing is that it is not capable of sequencing across highly repetitive regions, and unfortunately, large swaths of the human genome are highly repetitive. As such, while we have what is very close to a complete human genome, we do not have the full telomere-to-telomere sequence down as of yet. However, with the new long-read technologies that are currently being implemented, the day when this will be completed is likely not far off.

A complete tapestry of human genetic variation: Millions of people have already had their genomes sequenced to some degree (I’m one of them! Any others?), and millions more are sure to come. Widespread genome re-sequencing could one day allow us to have a full catalog of virtually every heterozygous gene variant in the world, which could allow for an even greater understanding of the connection between our genetics and specific traits.

Faster and better data analysis: Data analysis is probably the biggest bottleneck we’re currently experience when it comes to DNA sequencing. There is what seems like an infinite amount of data out there and unfortunately, a finite number of people who are capable of and interested in analyzing it. As these technologies become more and more mature and established, new and better data analysis pipelines will eventually be created, speeding up analysis time and increasing our understanding the data. Hopefully, one day even scientists with only moderate technical savvy will be capable of performing their own data analysis.

I’m certain the future of DNA sequencing will also hold things that I can’t even imagine. It’s an amazing time to be a scientist right now, as researchers are continuously discovering new technologies, and finding ways to put our current technologies to even more interesting uses.

What do you think the next big thing in DNA sequencing will be? Tell us in the comments!

RIN Numbers: How they’re calculated, what they mean and why they’re important

High-quality sequencing data is an important part of ensuring that your data is reliable and replicable, and obtaining high-quality sequencing data means using high-quality starting material. For RNA-seq data, this means using RNA that has a high RIN (RNA Integrity Number), a 10-point scale from 1 – 10 that provides a standardized number to researchers indicating the quality of their RNA, removing individual bias and interpretation from the process.

The RIN is a significant improvement over the way that RNA integrity was previously calculated: the 28S and 18S ratio. Because 28S is approximately 5 kb and 18S is approximately 2 kb, the ideal 28S:18S ratio is 2.7:1–but the benchmark is considered about 2:1. However, this measurement relies on the assumption that the quality of rRNA (a very stable molecule) is linearly reflective of mRNA quality, which is actually much less stable and experience higher turnover [1].

Figure1

Figure 1: RNA traces of RNA samples with different RIN values. Note the difference between high and low quality samples.

Fortunately, Agilent Technologies has developed a better method: the RIN value. Agilent has developed a sophisticated algorithm that calculates the RIN value, a measurement that is a considerable improvement over the 28S:18S ratio. RIN is an improvement in that it takes into account the entirety of the RNA sample, not just the rRNA measurements, as you can see in Figure 1 [2]

The importance of RNA integrity in determining the quality of gene expression was examined by Chen et al. [3] in 2014 by comparing RNA samples of 4 different RIN numbers (from 4.5 – 9.4) and 3 different library preparation methods (poly-A selected, rRNA-depleted, and total RNA) for a total of 12 samples. They then calculated the correlation coefficient of gene expression between the highest quality RNA and the more degraded samples between library preparation methods.

Figure2

Figure 2: Only poly-A selected RNA library preparations experience a decrease in data quality with a decrease in RIN value.

Fascinatingly, the only library preparation method that showed a significant decrease in the correlation between high quality and low quality RNA was the poly-A selected library preparation method. The other two library preparation methods still had correlation coefficients of greater than 0.95 even at low RINs (see Figure 2 [3])!

Chen et al. theorize that the reason behind this is that degraded samples that are poly-A selected will result in an increasingly 3′ biased library preparation, and that therefore you will lose valuable reads from your data. Because the other methods involve either no treatment or rRNA removal (as opposed to selection), there will be considerably less bias in the overall sample.

Even though it seems as though only the poly-A selected library preparation method suffers from having a low RIN, providers still prefer to work with relatively high quality RNA samples for all library preparation methods. However, if you do have important samples that are of lower quality RIN, it may be worth still discussing your options with a provider directly–and we at Genohub are more than happy to help facilitate your discussions! Please contact us here if you have any further questions about sequencing of samples with poor RIN.

How mispriming events could be creating artifacts in your library prep (and what you can do to prevent it)

Next-generation sequencing technology has been advancing at an incredibly rapid rate; what started as only genome sequencing now encompasses an incredible amount of RNA sequencing techniques as well. These range from standard RNA-seq, to miRNA-seq, Ribo-seq, to HITS-CLIP (high-throughput sequencing of RNA isolated by crosslinking immunoprecipiation). While these technological advances are now widely used (and have been invaluable to the scientific community), they are not fully mature technologies and we are still learning about potential artifacts that may arise and how to combat them; mispriming events are a significant and under-studied contributor to errors in sequencing data.

What is a mispriming event?

Reverse transcription is an important part of any RNA-sequencing technique. The RNA in question is first converted into cDNA, which is then PCR amplified and converted in a library from there (there are various methods for library preparation, depending on what kind of technique you are using). However, the conversion of RNA into cDNA by reverse transcriptase requires a DNA primer to start the process. This primer is complementary to the RNA, binding to it and allowing for reverse transcription to take place. A mispriming event is when this process occurs at a place where the DNA primer is not perfectly complementary to the RNA.

Two recent papers have highlighted how reverse transcription mispriming events can have a considerable impact on the library preparation process and result in error. Gurp, McIntyre and Verhoeven [1] conducted an RNA-seq experiment focusing on reads that mapped to ERCC spikes (artificial and known RNA fragments that are added to RNA-seq experiments as a control). Because the sequence of these ERCC spikes is already known, detecting mismatches in the sequences is relatively straightforward.

Their findings were striking: they found that 1) RNA-to-DNA mispriming events were the leading cause of deviations from the true sequence (as opposed to DNA-to-DNA mispriming events that can occur later on in the library preparation process), and 2) these mispriming events are non-random and indeed show specific and predictable patterns. For example, if the first nucleotide of an RNA-seq read starts with A or T, rA-dC and rU-dC mispriming events are common. In positions 2 – 6, rU-dG and rG-dT are also quite common, which lines up with the observation that these are the most stable mismatched pairs [2]. Needless to say, these kind of mispriming events can cause huge issues for various type of downstream analysis, particularly identification of SNPs and RNA-editing sites; eliminating these biases will be extremely important for future experiments (Figure 1). 

journal.pone.0085583.g002

Figure 1: Common base mismatches and their locations [1]

As of right now, we do not have good, sophisticated methods of eliminating these types of mispriming events from our datasets. Eliminating the first 10 bases of reads will solve the problem, but will also involve throwing out real data with the artifacts. Given the fact that these mispriming events do follow predictable patterns, it is possible that in the future, we could devise programs to identify and correct mispriming events, or even modify hexamer design to exclude ones that result in frequent mispriming.

Frustratingly, mispriming events can occur even when the priming oligo is quite lengthy. HITS-CLIP has been greatly important in discovering many protein-RNA interactions [3]; however, a recent paper published by Gillen et al. [4]  demonstrated that mispriming events even with a long DNA primer can create a significant artifact, creating read pileups that align to the genomic occurrences of the adaptor sequence, making it appear as though there are protein-RNA interactions occurring at that locus.

Part of HITS-CLIP library preparation involves attachment of a 3’ RNA adaptor to the protein bound RNA. A DNA oligo perfectly complementary to this RNA sequence serves as the primer for conversion of this RNA into cDNA, and it is this DNA oligo that leads to significant mispriming events. Although the DNA primer is long enough to be extremely specific, sequences that are complementary to only the last 6 nucleotides of the primer are still enough to result in a mispriming event, which converts alternative RNAs into cDNAs that eventually get amplified in the library.

Gillen et al. analyzed 44 experiments from 17 research groups, and showed that the adaptor sequence was overrepresented by 1.5-fold on average–and sometimes as high as 6-fold (Figure 2)!

12864_2016_2675_Fig1_HTML

Figure 2: Over-representation of DNA primer sequences can be found in multiple datasets from different groups, indicating the possibility of a widespread problem. 

And since only 6 complementary nucleotides are needed to result in a mispriming event, how can we eliminate this artifactual data?

Gillen et al. devised an ingenious and simple method of reducing this artifact by using a nested reverse transcription primer (Figure 3). By ‘nested primer’, they are referring to a primer that is not perfectly complementary to the 3’ adaptor, but rather stops 3 nucleotides short of being fully flush with the adaptor. This, combined with a full-length PCR primer (that is, flush with the adaptor sequence) with a ‘protected’ final 3 nucleotides (note: in this instance, ‘protected’ mean usage of phosphorothioate bonds in the final 3 oligo bases, which prevents degradation by exonucleases. Without this protective bond, the mispriming artifact is simply shifted downstream 3 bases), is enough to almost completely eliminate mispriming artifacts. This allows for significantly improved library quality and increased sensitivity!

12864_2016_2675_Fig2_HTML

Figure 3: A nested reverse transcription primer combined with a protected PCR primer can eliminate sequencing artifacts almost entirely. 

Although we have been working with sequencing technologies for many years now, we still have a lot to discover about hidden artifacts in the data. It’s becoming increasingly important to stay aware of emerging discoveries of these biases and make sure we are doing everything we can to eliminate this from our data.

Have you ever had an experience with sequencing artifacts in your data? Tell us in the comments!

International biological material shipment information for various countries

Many scientific researchers prefer to outsource their next generation sequencing projects to commercial service providers to get access to the latest instruments and scientific expertise.

However, there are some countries in the world that do not allow the export of biological samples (tissue samples, DNA, RNA etc.) or require several formal agreements and multi-level clearance.

In this post, we’ll highlight some general information about shipping samples out of several major countries, primarily to the US. Some of this is based on our experience working with many international researchers who use Genohub to outsource their sequencing.

China

China, for example, does not allow the import or export of biological samples, as confirmed by multiple courier service agents1. Major Chinese service providers require biological samples to be shipped to their Hong Kong address to avoid delay or loss of samples2,3.

In a rare situation, a Chinese group of researchers was able to ship DNA samples to the US using FedEx. They have also detailed their experience and have some advice regarding sample shipment that can be potentially useful to other groups willing to do the same4.

Brazil

To export biological material from Brazil, several documents such as Material Transfer Agreement and Institutional invoice of specimen exported, are required for customs clearance. A detailed cover letter in both Portuguese and English that can help Customs officials in Brazil (IBAMA) and the USA (USFWS) properly assess the authorization to export and import specimens is also required5. It could take several weeks to obtain these documents so researchers need to plan their work in advance.

India

Until 2016, The Indian Council of Medical Research made decisions on shipment of biological samples on a case-by-case basis6. However, these regulations have since been lifted since August 2016 and researchers have to follow several guidelines for biological materials to qualify for transport to foreign countries for research purposes7.

According to a FedEx India employee, a non-infectious certificate from an authentic laboratory and a detailed description of the included biological samples is sufficient for customs clearance from India. Any pathogenic material is not allowed to be shipped internationally.

Europe

We haven’t come across any issues shipping samples from European countries and generally, a properly declared biological shipment can be exported without any hassles.

The current Universal Postal Union regulations for shipping biological material have been comprehensively summarized in an official document. This document also lists the countries that allow or ban the import/export of biological substances8.

Please consult our shipping guide for more details on how to prepare your shipment to ship samples to USA – https://genohub.com/dna-rna-shipping-for-ngs/#USA.

If you know of any countries that require a lot of formal paperwork for export of biological substances for research or sequencing purposes, feel free to comment below. I’ll update the blog with this information.

References:

(1)     China Country Snapshot https://smallbusiness.fedex.com/international/country-snapshots/china.html.

(2)     Sample Preparation; Shipping – Novogene https://en.novogene.com/support/sample-preparation/.

(3)     Sample submission guidelines – BGI http://www.bgisample.com/yangbenjianyi/BGI-TS-03-12-01-001 Suggestions for Sample Delivery(NGS) B0.pdf.

(4)     Community/ZJU-China Letter about Shipping DNA – 2015.igem.org http://2015.igem.org/Community/ZJU-China_Letter_about_Shipping_DNA.

(5)     Shipping and Customs http://symbiont.ansp.org/ixingu/shipping/index.html.

(6)    Centre removes ICMR approval for import/export of human biological samples http://www.dnaindia.com/india/report-centre-removes-icmr-approval-for-importexport-of-human-biological-samples-2245910.

(7)     Indian Council of Medical Research http://icmr.nic.in/ihd/ihd.htm.

(8)     WFCC Regulations http://www.wfcc.info/pdf/wfcc_regulations.pdf

Sequencing trends in early 2017

Every month, ~5,000 unique queries for sequencing are submitted using Genohub’s NGS project matching engine: https://genohub.com/ngs/. Briefly, a user chooses the NGS application they are interested in (e.g. exome, RNA-Seq), the number of reads or coverage they’d like to achieve and the number of samples they plan on sequencing. Genohub’s matching engine, takes this input calculates the sequencing output required to meet the desired coverage and recommends services, filterable by sequencing instrument, read length, and library preparation kit. Results can be sorted by price, turnaround time and selected for immediate ordering.

Every query that’s submitted is recorded giving us a unique perspective into what types of NGS services researchers are actually interested in.

DNA-Seq

First, it’s important to note that DNA-seq is our default option in the matching engine: https://genohub.com/ngs/. Due to this bias, you can’t really compare it to other services being ordered so it’s a good idea to just throw away this data point. Of DNA-seq services that are actually ordered, this breaks down into: whole human genome sequencing, re-sequencing, and metagenomics sequencing. The most frequently used instruments for this service are currently the HiSeq X, HiSeq 3000/4000 and NextSeq. With PacBio’s release of the Sequel, requests have significantly increased this quarter compared to PacBio service requests in the last 4 quarters. We expect this trend to continue through 2017.

RNA-Seq

The pie chart above breaks down the types of RNA-seq services requested in the first three months of 2017. Total RNA-seq represents all applications where rRNA is depleted prior to library preparation, whereas mRNA-seq represents all applications where mRNA is enriched. In 2016, the number of Total RNA-seq projects was half that of this year. We attribute this to a growing interest in non-coding RNA and the availability of higher throughput sequencing runs. As sequencing costs drop and rRNA depletion becomes more affordable, researchers are asking for more biological information.  Today, the Nextseq and HiSeq 3000/4000 are the most commonly used instruments for any RNA-seq application. Counting applications continue to dominate, although requests for de novo transcriptome alignments are steady rising over the previous year. Whereas in the past, 1×50 and 1×75 were the most frequently requested read length for RNA counting applications, around 2x more researchers are requesting paired-end sequencing versus last year.

Methylation analysis

Compared to last year, there is an increased interest in WGBS as compared to RRBS and MeDIP. With the advent of the HiSeq X and it’s compatibility with WGBS applications, more researchers are finding whole genome based applications easier and more informative than reduced representation bisulfite sequencing.

Instrument trends

By far the biggest trend this year was the number of long read requests on the PacBio Sequel. Whereas in the past, Mate-pair library prep was more popular, we’re starting to see this service decline, and long read sequencing be ordered more frequently. Hybrid Ilumina/PacBio reads are also being more frequently ordered to improve the quality of assemblies. Long-reads are being requested to detect functional elements in human genomes that are missed by short-read sequencing. We should add that requests for 10X Genomics services have started to increase, although they are too small right now to make any meaningful comments. We currently don’t have providers offering Oxford Nanopore services on Genohub, so can’t comment here either.

This month NovaSeq services are expected to be available on Genohub. We expect there to be a lag phase as kinks are worked out, before this becomes a popular instrument request.

The future

Having spent the last 4 years receiving sequencing requests and performing consultation, it’s clear that new technology does influence behavior. With reduced sequencing costs, we see clients not only including more control and duplicates, but also looking at RNA-seq from a more global perspective, and beginning to become more interested in long reads. Clients that previously only performed exome-seq are now turning to whole genome sequencing on the HiSeq X. Researchers that normally only look at coding RNA’s are starting to show interest in long non-coding and small RNAs. Overall, faster and cheaper sequencing does tend to promote better science. Gone are the n=1 days of sequencing.

Beginner’s Handbook to High Throughput Sequencing

book-311432_640

As sequencing becomes more ubiquitous, we find researchers struggling with concepts like ‘paired-end’, designing a custom sequencing primer, cluster density, and technical library prep details, like why can’t small RNA and mRNA both be prepared in the same library and sequenced? This is partially the fault of industry, e.g. are 100M ‘paired-end reads’ comprised of 200M, 100M or 50M single reads [We like to denote this as 100M paired end reads (50M reads in each direction)], and partially due to all the moving parts: new sequencing and library prep chemistries, technology jargon and complexities in data analysis.

Seeing first time researchers struggle (on hundreds of sequencing projects), we sought to put together a guide to help the sequencing novice get a strong foothold on starting a sequencing project. This guide is called our Beginner’s Handbook to Next Generation Sequencing.

The guide is broken up into four main sections:

  1. Sequencing instruments and design of a sequencing project
  2. Library prep
  3. Sample isolation
  4. Providers we recommend you contact for analyzing your data

Whether you are new to NGS or an experienced NGS user, we recommend you check it out and ask questions. We’ll be updating the guide on a regular basis, so if you have recommendations, please post them here. Thanks!

 

 

RNA-Seq considerations when working with nucleic acid derived from FFPE

RNA-seq from FFPE samples

Millions of formalin-fixed paraffin-embedded (FFPE) tissue sections are stored in oncology tissue banks and pathology laboratories around the world. Formalin fixation followed by embedding paraffin has historically been a popular preservation method in histological studies as morphological features of the original tissue remain intact. However for RNA-seq or other gene expression methods, formalin fixation and paraffin embedding can degrade and modify RNA, complicating retrospective analysis using this commonly used archival method.

During the fixation and embedding process RNA is affected in the following ways:

  1. Degradation of RNA to short ~100 base fragments as a result of sample treatment during fixation or long term storage in paraffin.
  2. Formaldehyde modification of RNA. Formaldehyde modification can block base pairing and can cause cross-linking to other macromolecules. These RNA modifications include hydroxymethyl and methylene bridge cross-links on amine moieties of adenine bases.
  3. High variability in the degree of RNA degradation and modification in FFPE samples precludes transcriptomic similarity and gene expression correlation studies, or simply forces researchers to exclude certain samples.
  4. Oligo-dT approaches are not recommended when amplifying RNA as most RNA fragments derived from FFPE no long contain a poly(A) tail making rRNA depletion a necessary first step prior to RNA-seq.

If formalin fixation and paraffin embedding can’t be avoided, Ahlfen et al., nicely summarize best practices for improving RNA quality and yield from FFPE samples. These include:

  1. Starting fixation and cutting samples into thin pieces to avoid tissue autolysis.
  2. Reduction of fixation time (< 24 hours) to reduce irreversible cross-linking and RNA fragmentation during storage of FFPE blocks.
  3. Utilizing a method to reverse cross-linking during RNA isolation. These include heating RNA to remove some formaldehyde cross-linking. Reaction of formaldehyde with amino groups in bases and proteins are largely irreversible and inhibit cDNA synthesis.
  4. Use of a rRNA depletion step and random priming as opposed to oligo-dT based reversed transcription.
  5. RNA QC methods such as a measurement of RNA integrity or one of several RT-PCR based kits to qualify a sample prior to RNA-seq.

Despite these challenges, FFPE samples are frequently used in transcriptomic studies and in many cases correlate nicely with fresh frozen samples (Hedegaard et al., 2014; Li et al., 2014; Zhao et al., 2014). The study of somatic mutations continues to remain a challenge in FFPE tissue due to fragmentation and the presence of artifacts. Nevertheless, RNA molecules from FFPE are being used regularly for investigating both non-coding and coding parts of the genome.

If you have FFPE blocks or total RNA and would like to perform gene expression analysis by RNA-Seq, we recommend you start with a NGS service provider who has specific experience with FFPE RNA isolation, QC, library preparation, sequencing and data analysis. Providers with this experience can be found using this search on Genohub: https://genohub.com/ngs/?r=mt3789#q=4c5f2d036f.

 

Accurate measurement of error rate and base quality in Illumina sequencing runs

With new instrumentation, cluster chemistries, software updates and continuously updated library preparation reagents; accurately monitoring sequencing run quality has become increasing difficult.  In a recent paper by Manley et al., 2016, the authors develop an open source tool called the Percent Perfect Reads (PPR) plot to monitor base quality.

PPR uses PhiX alignment and calculates percent of reads with 0–4 mismatches.  A PPR plot contains a cycle-by-cycle representation of the percentage of reads with mismatches. PPR was originally introduced with the original Genome Analyzer and retired in 2014.

PPR is developed as an alternative to the Phred-like Q score for determining run quality and has the following advantages:

  1. PPR is independently calculated, unlike Illumina’s Q Score which is calculated with instrument dependent variables (vary by instrument, chemistry, software)
  2. PPR is a direct measure of error unlike Q score’s which rely on a table of data, generated under ideal sequencing circumstances
  3. Q scores tend to overestimate quality
  4. Unlike with Q scores, PPR allows the user to identify the source of sequencing error

By examining a PPR profile, the following issues are distinguishable:

  1. Adapter read through (sequencing cycles are longer than the library insert and the run reads through the adapter sequence)
  2. Repetitive or low diversity sequences
  3. Imaging problems
  4. Over/under clustering
  5. Chemistry problems (cluster reagents are not working properly)

The PPR plot program is compatible with HiSeq 2000/2500, NextSeq 500, and MiSeq instruments. It’s written in Perl and R, and accepts FASTQ files as input. The PPR software package is available at http://openwetware.org/wiki/BioMicroCenter:PPR_Program (BioMicro Center, Massachusetts Institute of Technology, Cambridge, MA, USA).

 

Illumina unveils NovaSeq 5000 and 6000

Illumina NovaSeq

Today, at the annual J.P. Morgan Healthcare Conference, Illumina announced the release of a new series of instruments called NovaSeq. Continuing the use of ExAmp cluster amplification and patterned nano-wells that form the basis of HiSeq 3000/4000 HiSeq X Ten and HiSeq X Five flow cell technology, Illumina further reduced the spacing between nanowells to increase cluster density and data output. In the end, this promises to produce ~ 2-3x more reads than a single 8 lane HiSeq X flow cell.

Here are the specs available on day 1 of launch:

Number of instruments being launched: 2; NovaSeq 5000 and 6000

Non-technical application based restrictions: No, unlike the HiSeq X Ten or HiSeq X Five; these instruments will not have application based restrictions. Illumina plans to continue restricting HiSeq X instruments to WGS applications (1).

Potential technical based restrictions: Notable is the absence of Nextera based DNA or Nextera Exome in the list of compatible library preparation kits. Mate-pair based Nextera kits are however listed as compatible (2). This may indicate there are template (library) size restrictions on this instrument (similar to HiSeq 3000/4000 and HiSeq X).

Instrument availability: NovaSeq 6000 will begin shipping in March 2017 and NovaSeq 5000 will begin shipping mid-2017.

Anticipated availability on GenohubIn April 2017, researchers will be able to order NovaSeq based sequencing. This hinges on on-time instrument delivery to our partnering service providers.

Instrument cost: NovaSeq 5000 and 6000 Systems are priced at $850,000 and $985,000 respectively

Target Market: Research labs that cannot afford the capital cost of a HiSeq X Five or HiSeq X Ten System and don’t want to deal with the restrictions. HiSeq X Five and Ten systems are restricted from running RNA-seq or exome based libraries.

Other updates: RFID added to make sure loading is done properly, reduction in the number of steps in a sequencing workflow (from 38 to 8) (1) and flow cell loading is automated.

Cbot or onboard clustering: onboard

Tunable output: 4 flow cells are available. NovaSeq S1 and S2 flow cells are compatible with both NoveSeq 5000 and 6000 systems while NovaSeq S3 and S4 are exclusive to NovaSeq 6000 instruments.

Two color or Four color chemistry: Two color, like the NextSeq 500

Number of lanes: S1 and S2 have two lanes whereas S3 and S4 have four lanes

Available read lengths: 2×50, 2×100 and 2×150

Run times: < 19, 29 and 40 hours for 2×50, 2×100 and 2×150 bp read lengths respectively

Output: 

Instrument and flow cell Reads per flow cell *(billion) Output from 2×150 bp run (Gb) *
NovaSeq 5000/6000 S1 1.6 500
NovaSeq 5000/6000 S2 3.3 1000
NovaSeq 6000 S3 6.6 2000
NovaSeq 6000 S4 10 3000

*Output and read numbers based on a single flow cell

Number of flow cells that can be run at once: 1 or 2 flow cells can be run on both the NovaSeq 5000 or 6000

So what does this mean for the sequencing industry? Clearly the Novaseq was launched to target research labs that can’t afford the capital costs of the HiSeq X series but want to upgrade from their current HiSeq instruments. NovaSeq S3 and S4 flow cells promise to produce 2-3x more reads than a single 8 lane HiSeq X flow cell (2.6-3 billion reads).  Of course,  if NovaSeq is priced to run 2-3x more expensive than a HiSeq X flow cell, the cost it takes to sequence a genome will be the same. When reagent pricing is available, this will be more clear.

2016 was a tough year for Illumina as it lost one third of its value. As Illumina launches another instrument geared for the research market, much continues to hinge on federally funded research grants to fuel growth. A focus on developing clinical based applications, insurance reimbursable tests and a global shift toward diagnostics is going to be required for sustained growth. ‘Market generation’ activities, as were initiatives like Helix and Grail are steps in this direction.