Sequencing trends in early 2017

Every month, ~5,000 unique queries for sequencing are submitted using Genohub’s NGS project matching engine: https://genohub.com/ngs/. Briefly, a user chooses the NGS application they are interested in (e.g. exome, RNA-Seq), the number of reads or coverage they’d like to achieve and the number of samples they plan on sequencing. Genohub’s matching engine, takes this input calculates the sequencing output required to meet the desired coverage and recommends services, filterable by sequencing instrument, read length, and library preparation kit. Results can be sorted by price, turnaround time and selected for immediate ordering.

Every query that’s submitted is recorded giving us a unique perspective into what types of NGS services researchers are actually interested in.

DNA-Seq

First, it’s important to note that DNA-seq is our default option in the matching engine: https://genohub.com/ngs/. Due to this bias, you can’t really compare it to other services being ordered so it’s a good idea to just throw away this data point. Of DNA-seq services that are actually ordered, this breaks down into: whole human genome sequencing, re-sequencing, and metagenomics sequencing. The most frequently used instruments for this service are currently the HiSeq X, HiSeq 3000/4000 and NextSeq. With PacBio’s release of the Sequel, requests have significantly increased this quarter compared to PacBio service requests in the last 4 quarters. We expect this trend to continue through 2017.

RNA-Seq

The pie chart above breaks down the types of RNA-seq services requested in the first three months of 2017. Total RNA-seq represents all applications where rRNA is depleted prior to library preparation, whereas mRNA-seq represents all applications where mRNA is enriched. In 2016, the number of Total RNA-seq projects was half that of this year. We attribute this to a growing interest in non-coding RNA and the availability of higher throughput sequencing runs. As sequencing costs drop and rRNA depletion becomes more affordable, researchers are asking for more biological information.  Today, the Nextseq and HiSeq 3000/4000 are the most commonly used instruments for any RNA-seq application. Counting applications continue to dominate, although requests for de novo transcriptome alignments are steady rising over the previous year. Whereas in the past, 1×50 and 1×75 were the most frequently requested read length for RNA counting applications, around 2x more researchers are requesting paired-end sequencing versus last year.

Methylation analysis

Compared to last year, there is an increased interest in WGBS as compared to RRBS and MeDIP. With the advent of the HiSeq X and it’s compatibility with WGBS applications, more researchers are finding whole genome based applications easier and more informative than reduced representation bisulfite sequencing.

Instrument trends

By far the biggest trend this year was the number of long read requests on the PacBio Sequel. Whereas in the past, Mate-pair library prep was more popular, we’re starting to see this service decline, and long read sequencing be ordered more frequently. Hybrid Ilumina/PacBio reads are also being more frequently ordered to improve the quality of assemblies. Long-reads are being requested to detect functional elements in human genomes that are missed by short-read sequencing. We should add that requests for 10X Genomics services have started to increase, although they are too small right now to make any meaningful comments. We currently don’t have providers offering Oxford Nanopore services on Genohub, so can’t comment here either.

This month NovaSeq services are expected to be available on Genohub. We expect there to be a lag phase as kinks are worked out, before this becomes a popular instrument request.

The future

Having spent the last 4 years receiving sequencing requests and performing consultation, it’s clear that new technology does influence behavior. With reduced sequencing costs, we see clients not only including more control and duplicates, but also looking at RNA-seq from a more global perspective, and beginning to become more interested in long reads. Clients that previously only performed exome-seq are now turning to whole genome sequencing on the HiSeq X. Researchers that normally only look at coding RNA’s are starting to show interest in long non-coding and small RNAs. Overall, faster and cheaper sequencing does tend to promote better science. Gone are the n=1 days of sequencing.

Beginner’s Handbook to High Throughput Sequencing

book-311432_640

As sequencing becomes more ubiquitous, we find researchers struggling with concepts like ‘paired-end’, designing a custom sequencing primer, cluster density, and technical library prep details, like why can’t small RNA and mRNA both be prepared in the same library and sequenced? This is partially the fault of industry, e.g. are 100M ‘paired-end reads’ comprised of 200M, 100M or 50M single reads [We like to denote this as 100M paired end reads (50M reads in each direction)], and partially due to all the moving parts: new sequencing and library prep chemistries, technology jargon and complexities in data analysis.

Seeing first time researchers struggle (on hundreds of sequencing projects), we sought to put together a guide to help the sequencing novice get a strong foothold on starting a sequencing project. This guide is called our Beginner’s Handbook to Next Generation Sequencing.

The guide is broken up into four main sections:

  1. Sequencing instruments and design of a sequencing project
  2. Library prep
  3. Sample isolation
  4. Providers we recommend you contact for analyzing your data

Whether you are new to NGS or an experienced NGS user, we recommend you check it out and ask questions. We’ll be updating the guide on a regular basis, so if you have recommendations, please post them here. Thanks!

 

 

RNA-Seq considerations when working with nucleic acid derived from FFPE

RNA-seq from FFPE samples

Millions of formalin-fixed paraffin-embedded (FFPE) tissue sections are stored in oncology tissue banks and pathology laboratories around the world. Formalin fixation followed by embedding paraffin has historically been a popular preservation method in histological studies as morphological features of the original tissue remain intact. However for RNA-seq or other gene expression methods, formalin fixation and paraffin embedding can degrade and modify RNA, complicating retrospective analysis using this commonly used archival method.

During the fixation and embedding process RNA is affected in the following ways:

  1. Degradation of RNA to short ~100 base fragments as a result of sample treatment during fixation or long term storage in paraffin.
  2. Formaldehyde modification of RNA. Formaldehyde modification can block base pairing and can cause cross-linking to other macromolecules. These RNA modifications include hydroxymethyl and methylene bridge cross-links on amine moieties of adenine bases.
  3. High variability in the degree of RNA degradation and modification in FFPE samples precludes transcriptomic similarity and gene expression correlation studies, or simply forces researchers to exclude certain samples.
  4. Oligo-dT approaches are not recommended when amplifying RNA as most RNA fragments derived from FFPE no long contain a poly(A) tail making rRNA depletion a necessary first step prior to RNA-seq.

If formalin fixation and paraffin embedding can’t be avoided, Ahlfen et al., nicely summarize best practices for improving RNA quality and yield from FFPE samples. These include:

  1. Starting fixation and cutting samples into thin pieces to avoid tissue autolysis.
  2. Reduction of fixation time (< 24 hours) to reduce irreversible cross-linking and RNA fragmentation during storage of FFPE blocks.
  3. Utilizing a method to reverse cross-linking during RNA isolation. These include heating RNA to remove some formaldehyde cross-linking. Reaction of formaldehyde with amino groups in bases and proteins are largely irreversible and inhibit cDNA synthesis.
  4. Use of a rRNA depletion step and random priming as opposed to oligo-dT based reversed transcription.
  5. RNA QC methods such as a measurement of RNA integrity or one of several RT-PCR based kits to qualify a sample prior to RNA-seq.

Despite these challenges, FFPE samples are frequently used in transcriptomic studies and in many cases correlate nicely with fresh frozen samples (Hedegaard et al., 2014; Li et al., 2014; Zhao et al., 2014). The study of somatic mutations continues to remain a challenge in FFPE tissue due to fragmentation and the presence of artifacts. Nevertheless, RNA molecules from FFPE are being used regularly for investigating both non-coding and coding parts of the genome.

If you have FFPE blocks or total RNA and would like to perform gene expression analysis by RNA-Seq, we recommend you start with a NGS service provider who has specific experience with FFPE RNA isolation, QC, library preparation, sequencing and data analysis. Providers with this experience can be found using this search on Genohub: https://genohub.com/ngs/?r=mt3789#q=4c5f2d036f.

 

Accurate measurement of error rate and base quality in Illumina sequencing runs

With new instrumentation, cluster chemistries, software updates and continuously updated library preparation reagents; accurately monitoring sequencing run quality has become increasing difficult.  In a recent paper by Manley et al., 2016, the authors develop an open source tool called the Percent Perfect Reads (PPR) plot to monitor base quality.

PPR uses PhiX alignment and calculates percent of reads with 0–4 mismatches.  A PPR plot contains a cycle-by-cycle representation of the percentage of reads with mismatches. PPR was originally introduced with the original Genome Analyzer and retired in 2014.

PPR is developed as an alternative to the Phred-like Q score for determining run quality and has the following advantages:

  1. PPR is independently calculated, unlike Illumina’s Q Score which is calculated with instrument dependent variables (vary by instrument, chemistry, software)
  2. PPR is a direct measure of error unlike Q score’s which rely on a table of data, generated under ideal sequencing circumstances
  3. Q scores tend to overestimate quality
  4. Unlike with Q scores, PPR allows the user to identify the source of sequencing error

By examining a PPR profile, the following issues are distinguishable:

  1. Adapter read through (sequencing cycles are longer than the library insert and the run reads through the adapter sequence)
  2. Repetitive or low diversity sequences
  3. Imaging problems
  4. Over/under clustering
  5. Chemistry problems (cluster reagents are not working properly)

The PPR plot program is compatible with HiSeq 2000/2500, NextSeq 500, and MiSeq instruments. It’s written in Perl and R, and accepts FASTQ files as input. The PPR software package is available at http://openwetware.org/wiki/BioMicroCenter:PPR_Program (BioMicro Center, Massachusetts Institute of Technology, Cambridge, MA, USA).

 

Illumina unveils NovaSeq 5000 and 6000

Illumina NovaSeq

Today, at the annual J.P. Morgan Healthcare Conference, Illumina announced the release of a new series of instruments called NovaSeq. Continuing the use of ExAmp cluster amplification and patterned nano-wells that form the basis of HiSeq 3000/4000 HiSeq X Ten and HiSeq X Five flow cell technology, Illumina further reduced the spacing between nanowells to increase cluster density and data output. In the end, this promises to produce ~ 2-3x more reads than a single 8 lane HiSeq X flow cell.

Here are the specs available on day 1 of launch:

Number of instruments being launched: 2; NovaSeq 5000 and 6000

Non-technical application based restrictions: No, unlike the HiSeq X Ten or HiSeq X Five; these instruments will not have application based restrictions. Illumina plans to continue restricting HiSeq X instruments to WGS applications (1).

Potential technical based restrictions: Notable is the absence of Nextera based DNA or Nextera Exome in the list of compatible library preparation kits. Mate-pair based Nextera kits are however listed as compatible (2). This may indicate there are template (library) size restrictions on this instrument (similar to HiSeq 3000/4000 and HiSeq X).

Instrument availability: NovaSeq 6000 will begin shipping in March 2017 and NovaSeq 5000 will begin shipping mid-2017.

Anticipated availability on GenohubIn April 2017, researchers will be able to order NovaSeq based sequencing. This hinges on on-time instrument delivery to our partnering service providers.

Instrument cost: NovaSeq 5000 and 6000 Systems are priced at $850,000 and $985,000 respectively

Target Market: Research labs that cannot afford the capital cost of a HiSeq X Five or HiSeq X Ten System and don’t want to deal with the restrictions. HiSeq X Five and Ten systems are restricted from running RNA-seq or exome based libraries.

Other updates: RFID added to make sure loading is done properly, reduction in the number of steps in a sequencing workflow (from 38 to 8) (1) and flow cell loading is automated.

Cbot or onboard clustering: onboard

Tunable output: 4 flow cells are available. NovaSeq S1 and S2 flow cells are compatible with both NoveSeq 5000 and 6000 systems while NovaSeq S3 and S4 are exclusive to NovaSeq 6000 instruments.

Two color or Four color chemistry: Two color, like the NextSeq 500

Number of lanes: S1 and S2 have two lanes whereas S3 and S4 have four lanes

Available read lengths: 2×50, 2×100 and 2×150

Run times: < 19, 29 and 40 hours for 2×50, 2×100 and 2×150 bp read lengths respectively

Output: 

Instrument and flow cell Reads per flow cell *(billion) Output from 2×150 bp run (Gb) *
NovaSeq 5000/6000 S1 1.6 500
NovaSeq 5000/6000 S2 3.3 1000
NovaSeq 6000 S3 6.6 2000
NovaSeq 6000 S4 10 3000

*Output and read numbers based on a single flow cell

Number of flow cells that can be run at once: 1 or 2 flow cells can be run on both the NovaSeq 5000 or 6000

So what does this mean for the sequencing industry? Clearly the Novaseq was launched to target research labs that can’t afford the capital costs of the HiSeq X series but want to upgrade from their current HiSeq instruments. NovaSeq S3 and S4 flow cells promise to produce 2-3x more reads than a single 8 lane HiSeq X flow cell (2.6-3 billion reads).  Of course,  if NovaSeq is priced to run 2-3x more expensive than a HiSeq X flow cell, the cost it takes to sequence a genome will be the same. When reagent pricing is available, this will be more clear.

2016 was a tough year for Illumina as it lost one third of its value. As Illumina launches another instrument geared for the research market, much continues to hinge on federally funded research grants to fuel growth. A focus on developing clinical based applications, insurance reimbursable tests and a global shift toward diagnostics is going to be required for sustained growth. ‘Market generation’ activities, as were initiatives like Helix and Grail are steps in this direction.

 

 

 

 

 

Clustering densities for standard and non-standard library preparation applications

illumina_cluster_generation

Illumina sequencing follows three very simple steps:

  1. Libraries are prepared from DNA or RNA samples
  2. Single molecular DNA templates are bridge amplified to form clonal clusters inside a flow cell
  3. Clusters are sequenced by massive parallel synthesis

Template molecules are immobilized on a flow cell surface and amplified by isothermal bridge amplification to create  individual dense clonal clusters containing ~2,000 molecules each (see figure above).

The exact density of these clusters can influence:

  1. Run quality
  2. Reads passing filter
  3. Q30 scores
  4. Total number of reads

This makes proper loading of an Illumina flow cell crucial to the success of a sequencing run.

In a recent guide, we review recommended loading concentrations and cluster densities for each Illumina instrument. See a summary in Table 1. below:

Illumina flow cell loading recommendations by instrument

While this table includes recommendations for standard library applications where libraries are sufficiently diverse, researchers shouldn’t follow these recommendation for libraries that have poor diversity. Sequence diversity refers to the balance of nucleotides (A, T, G, C) at each position of a template library. Applications where you should load a library at a concentration below Illumina’s standard recommendations include:

  1. Any amplicon based library where primers are included in the read insert
  2. GBS or RAD-seq libraries that start with a similar restriction site
  3. 16S or 18S libraries that start with the same primer or variable domain sequence
  4. MeDIP or other low diversity targeting approach

If you’re working with a non-standard library preparation application or one where libraries have poor sequence diversity, submit a request here: genohub.com/ngs and a scientist will recommend flow cell loading concentrations.

 

 

Assessing CLIA / CAP Certified Next Generation Sequencing Facilities

clia-ngs-lab

According to the Centers for Medicare and Medicaid Services (CMS), Clinical Laboratory Improvement Amendment (CLIA) registration is required for entities that perform a single test on, “materials derived from the human body for the purpose of providing information for the diagnosis, prevention or treatment of any disease or impairment of, or the assessment of the health of, human beings”.

To date, only two next generation sequencing (NGS) instruments/tests have been approved or cleared by the FDA. All other NGS based tests are developed in house as laboratory developed tests (LDTs), and are regulated under CLIA. CLIA regulations are required to certify the validity of a test. Validity is established by measuring:

  1. Accuracy
  2. Precision
  3. Analytical sensitivity and specificity
  4. Reportable reference range or interval

For next generation sequencing tests this means several sequencing based metrics are required:

Assessment Test Next Generation Sequencing Specification Sample Material
Accuracy Coverage and Quality or Phred Scores Known variants (SNP, indel) in targeted region
Precision Sequence replication and coverage distribution between different operators and instruments Reference with known variants
Specificity False positive rate, degree with which a false variant is identified at a specific coverage threshold Several samples with well characterized targets
Sensitivity Likelihood test detects known variant Several samples with well characterized targets
Reportable Range Intron buffer and exon region of one or more genes Target material with repeat regions, indels, allele drop outs
Reference interval Sequence variation background measurement Derived from an unaffected population, same as patient

In addition to CLIA, the College of American Pathologists (CAP) has several specific guidelines for NGS labs. These include consideration for validated sample extraction, library preparation, barcoding, pooling and target enrichment. Each protocol has specific quality metrics associated with it. In addition to the wet lab, bioinformatics pipelines must be validated and tested for how precise and sensitive variants are called.

Clinical regulation of NGS based tests are undergoing rapid change as new NGS tests enter the clinic, and older ones are improved. As these changes happen, both CAP and CLIA requirements for NGS are updated on a yearly basis.

The most common NGS based assays or tests performed in a CLIA/CAP setting today include:

  1. Exome sequencing
  2. NGS gene panel sequencing
  3. Whole genome sequencing
  4. Cell free DNA sequencing
  5. Metagenomic sequencing

Genohub has existing relationships with 7 service providers offering nucleic acid extraction, library preparation, sequencing and data analysis under CLIA and CAP. To obtain NGS services under CLIA/CAP accreditation, submit a request here: https://genohub.com/ngs.

Isolation of cell free / circulating tumor DNA from plasma

tubes

Identification of biomarkers that indicate presence of disease are highly sought after. Non-invasive methods to measure those biomarkers are even more valuable. By extracting and measuring cell-free DNA, scientists have satisfy both.

Cell free DNA are degraded fragments released in plasma. Elevated levels of cfDNA are found in cancer states, making assessment of somatic genomic alterations from tumors possible using sequencing. Cell free fetal DNA (cffDNA) can be found as early as 7 weeks gestation, and analysis of cffDNA is already being used in non-invasive prenatal diagnostics. Cell free DNA (cfDNA) in blood was first described by Mandel and Metais in 1948 [1] but only recently has been identified as having utility for prenatal testing and disease diagnostics and monitoring.

Unlike mutations that are passed from a parent to child and are in every cell of your body, somatic mutations form during a person’s life. These somatic mutations are present in tumor cell DNA and are an excellent biomarker if they can be measured and monitored.

Acquiring tumor DNA often requires a biopsy, a potentially risky and invasive procedure. In many cases presence of a tumor or the ability to biopsy is not even an option for patient. During tumor turnover and progression, apoptotic and necrotic cells release small pieces of their DNA (cfDNA) into the bloodstream. The amount of cfDNA in the blood steam is influenced by clearance and filtering of the blood and lymphatic circulation.

Detecting cfDNA in plasma is called a ‘liquid biopsy’ and is already a popular method for obtaining clinical samples for prenatal testing, disease diagnostics and monitoring. One of the challenges of liquid biopsies, are standardization of the isolation procedure and maintaining  uniform specificity and sensitivity. Extraction of cfDNA can be carried out using magnetic beads or silica matrices along with chaotrophic salts, such as guanidine thiocyanate. While several commercial approaches (Table 1) exist, none have undergone rigorous large patient scale studies. Once more information is known, universal standardization should allow greater clinical utility.

Commercial kits for extraction of cfDNA need to be designed to extract uniform DNA copies from varying biopsy volumes. Scalability and adaptability for cell free fetal and ctDNA are important considerations. Below we highlight current kits available in the market. In a future blog post we’ll discuss isolation and sequencing standardizations required for broader use of cfDNA liquid biopsy.

Table 1.

Kit Company Method Digestion Prep Time (min) Plasma Volume (mL) Elution (uL) DNA sizes

(bp)

NextPrep-Mag

 

Bioo Scientific Mag Beads Proteinase K (optional) 30 1 – >5 12 >50
Chemagic cfNA Chemagen Mag Beads Proteinase K 120 2 – 10 60 >100
MagMAX Cell Free DNA Kit Thermo Fisher Mag Beads Proteinase K

(optional)

40 1 – >5 15 >50
QIAamp

 

Qiagen Column Proteinase K 120 1-5 20 >70
Quick-cfDNA

 

Zymo Research Column Proteinase K 60 3- 10 35 >100

Targeted gene panels vs. whole exome sequencing

gene-panels

One frequent question we hear on Genohub is, ‘should I make a custom panel for this gene set, or not bother and do whole exome sequencing?’. While whole genome sequencing approaches can capture all possible mutations, whole exome or targeted gene panel sequencing are cost-effective approaches for capturing phenotype altering mutations. We go into the advantages of WGS vs. WES in an earlier blog post. A remaining question however is, among targeting approaches, which is best. We attempt to address this here:

Advantages of targeting all exons – whole exome sequencing (WES)

If your study is discovery based, in other words you don’t know what genes you need to target, WES is the obvious choice.

  • Better for discovery based applications where you’re not sure what genes you should be targeting.
  • Exome panels are commercially available, they don’t need to be customized or designed.
  • Exome sequencing services are fairly standard, costs range between $550-800 for 100-150x mean on target coverage.

Advantages of targeted gene panels (amplicon-seq or targeted hybridization methods)

Targeted gene panels are ideal for analyzing specific mutations or genes that have suspected associations with disease.

  • Focusing on individual genes or gene regions allows you to sequence at a much higher depth than exome-seq, e.g. 2,000-10,000x as opposed to 200x which is typical with exome-seq.
  • High depth sequencing enables the identification of rare variants
  • Can be customized for different samples types, e.g. FFPE, cf/ctDNA, degraded samples.
  • Lower input amounts can be used with targeted gene panels (1 ng vs. 100 ng with whole exome sequencing).
  • Gene panels can be customized to only include genomic regions of interest. Why sequence everything when you don’t need that extra information?
  • Panels can be easily designed for non-human species. Designing a non-human exome is much more laborious.
  • Gene panel workflows are a lot simpler and time to results is often as little as 1-2 days.
  • You can process thousands of samples on a single sequencing run. Targeted gene panels can be run at a higher throughput and are often more cost-effective than whole exome sequencing.

By focusing on genes likely to be involved with disease, you can reduce expense and focus sequencing resources on your targeted region. However, if you only have a few samples that you need to sequence at a low depth of coverage, consider whether it’s worth designing a panel vs. performing whole exome sequencing using an existing commercial panel.

If you’re interested in designing a custom gene panel or already have an existing panel you’d like to sequence, submit a request describing your project or view several of the existing commercially available panels here.

Guides to improve your next generation sequencing project

read length, output and instrument recommendations for next generation sequencing

If you’re new to next generation sequencing or if you’re simply looking for tips to improve your next project, we recommend you take some time to look at the guides available on Genohub. As researchers order sequencing services it’s completely normal for there to be numerous questions related to nucleic acid extraction, library prep and best practices for loading a sequencing instrument. Over the years, we’ve curated these questions and published guides to help those embarking on their next NGS project. Topics covered include: library prep applications, batch effects, optimal cluster densities, read lengths and instrument output.

Next generation sequencing is a tool that can be applied to answer any number of questions related to the genome, transcriptome or epigenome. Regardless of the organism being sequenced or the library method used to prepare nucleic acid from that organism, the fundamentals of how a sequencing platform works, is similar across all samples. There are currently four main sequencing platforms that researchers regularly use. These include Illumina, Ion, PacBio and Oxford Nanopore. The guides below tend to be Illumina focused because that’s the platform most people are currently using today. Despite that, we review the read throughput of each available instrument and discuss hybrid methodologies where short and long reads are combined from two different instruments to improve assemblies.

Guides for sequencing

  1. Designing a sequencing project
  2. Recommended coverage by library preparation application
  3. Comparison of instrument read lengths and read outputs

Guides for preparing your samples

  1. Best practices for shipping tissue and nucleic acid 
  2. Library preparation kits and tips

Guides by application

  1. Transcriptome and mRNA-Seq
  2. Genome sequencing and re-sequencing
  3. Exome
  4. Metagenomics 
  5. Small RNA (microRNA)
  6. WGS vs. WES

Tips and considerations for commonly used sequencing instruments

  1. HiSeq X
  2. HiSeq 3000/4000
  3. Nextseq and low diversity

These are evolving guides, meaning our goal is to continuously improve them. If you have any feedback or would like to contribute please send us a message. We hope these guides will be helpful in designing your next NGS run. If you have technical questions related to an upcoming NGS project, feel free to submit them on our consultation page.