Accurate measurement of error rate and base quality in Illumina sequencing runs

With new instrumentation, cluster chemistries, software updates and continuously updated library preparation reagents; accurately monitoring sequencing run quality has become increasing difficult.  In a recent paper by Manley et al., 2016, the authors develop an open source tool called the Percent Perfect Reads (PPR) plot to monitor base quality.

PPR uses PhiX alignment and calculates percent of reads with 0–4 mismatches.  A PPR plot contains a cycle-by-cycle representation of the percentage of reads with mismatches. PPR was originally introduced with the original Genome Analyzer and retired in 2014.

PPR is developed as an alternative to the Phred-like Q score for determining run quality and has the following advantages:

  1. PPR is independently calculated, unlike Illumina’s Q Score which is calculated with instrument dependent variables (vary by instrument, chemistry, software)
  2. PPR is a direct measure of error unlike Q score’s which rely on a table of data, generated under ideal sequencing circumstances
  3. Q scores tend to overestimate quality
  4. Unlike with Q scores, PPR allows the user to identify the source of sequencing error

By examining a PPR profile, the following issues are distinguishable:

  1. Adapter read through (sequencing cycles are longer than the library insert and the run reads through the adapter sequence)
  2. Repetitive or low diversity sequences
  3. Imaging problems
  4. Over/under clustering
  5. Chemistry problems (cluster reagents are not working properly)

The PPR plot program is compatible with HiSeq 2000/2500, NextSeq 500, and MiSeq instruments. It’s written in Perl and R, and accepts FASTQ files as input. The PPR software package is available at http://openwetware.org/wiki/BioMicroCenter:PPR_Program (BioMicro Center, Massachusetts Institute of Technology, Cambridge, MA, USA).

 

Illumina unveils NovaSeq 5000 and 6000

Illumina NovaSeq

Today, at the annual J.P. Morgan Healthcare Conference, Illumina announced the release of a new series of instruments called NovaSeq. Continuing the use of ExAmp cluster amplification and patterned nano-wells that form the basis of HiSeq 3000/4000 HiSeq X Ten and HiSeq X Five flow cell technology, Illumina further reduced the spacing between nanowells to increase cluster density and data output. In the end, this promises to produce ~ 2-3x more reads than a single 8 lane HiSeq X flow cell.

Here are the specs available on day 1 of launch:

Number of instruments being launched: 2; NovaSeq 5000 and 6000

Non-technical application based restrictions: No, unlike the HiSeq X Ten or HiSeq X Five; these instruments will not have application based restrictions. Illumina plans to continue restricting HiSeq X instruments to WGS applications (1).

Potential technical based restrictions: Notable is the absence of Nextera based DNA or Nextera Exome in the list of compatible library preparation kits. Mate-pair based Nextera kits are however listed as compatible (2). This may indicate there are template (library) size restrictions on this instrument (similar to HiSeq 3000/4000 and HiSeq X).

Instrument availability: NovaSeq 6000 will begin shipping in March 2017 and NovaSeq 5000 will begin shipping mid-2017.

Anticipated availability on GenohubIn April 2017, researchers will be able to order NovaSeq based sequencing. This hinges on on-time instrument delivery to our partnering service providers.

Instrument cost: NovaSeq 5000 and 6000 Systems are priced at $850,000 and $985,000 respectively

Target Market: Research labs that cannot afford the capital cost of a HiSeq X Five or HiSeq X Ten System and don’t want to deal with the restrictions. HiSeq X Five and Ten systems are restricted from running RNA-seq or exome based libraries.

Other updates: RFID added to make sure loading is done properly, reduction in the number of steps in a sequencing workflow (from 38 to 8) (1) and flow cell loading is automated.

Cbot or onboard clustering: onboard

Tunable output: 4 flow cells are available. NovaSeq S1 and S2 flow cells are compatible with both NoveSeq 5000 and 6000 systems while NovaSeq S3 and S4 are exclusive to NovaSeq 6000 instruments.

Two color or Four color chemistry: Two color, like the NextSeq 500

Number of lanes: S1 and S2 have two lanes whereas S3 and S4 have four lanes

Available read lengths: 2×50, 2×100 and 2×150

Run times: < 19, 29 and 40 hours for 2×50, 2×100 and 2×150 bp read lengths respectively

Output: 

Instrument and flow cell Reads per flow cell *(billion) Output from 2×150 bp run (Gb) *
NovaSeq 5000/6000 S1 1.6 500
NovaSeq 5000/6000 S2 3.3 1000
NovaSeq 6000 S3 6.6 2000
NovaSeq 6000 S4 10 3000

*Output and read numbers based on a single flow cell

Number of flow cells that can be run at once: 1 or 2 flow cells can be run on both the NovaSeq 5000 or 6000

So what does this mean for the sequencing industry? Clearly the Novaseq was launched to target research labs that can’t afford the capital costs of the HiSeq X series but want to upgrade from their current HiSeq instruments. NovaSeq S3 and S4 flow cells promise to produce 2-3x more reads than a single 8 lane HiSeq X flow cell (2.6-3 billion reads).  Of course,  if NovaSeq is priced to run 2-3x more expensive than a HiSeq X flow cell, the cost it takes to sequence a genome will be the same. When reagent pricing is available, this will be more clear.

2016 was a tough year for Illumina as it lost one third of its value. As Illumina launches another instrument geared for the research market, much continues to hinge on federally funded research grants to fuel growth. A focus on developing clinical based applications, insurance reimbursable tests and a global shift toward diagnostics is going to be required for sustained growth. ‘Market generation’ activities, as were initiatives like Helix and Grail are steps in this direction.

 

 

 

 

 

Clustering densities for standard and non-standard library preparation applications

illumina_cluster_generation

Illumina sequencing follows three very simple steps:

  1. Libraries are prepared from DNA or RNA samples
  2. Single molecular DNA templates are bridge amplified to form clonal clusters inside a flow cell
  3. Clusters are sequenced by massive parallel synthesis

Template molecules are immobilized on a flow cell surface and amplified by isothermal bridge amplification to create  individual dense clonal clusters containing ~2,000 molecules each (see figure above).

The exact density of these clusters can influence:

  1. Run quality
  2. Reads passing filter
  3. Q30 scores
  4. Total number of reads

This makes proper loading of an Illumina flow cell crucial to the success of a sequencing run.

In a recent guide, we review recommended loading concentrations and cluster densities for each Illumina instrument. See a summary in Table 1. below:

Illumina flow cell loading recommendations by instrument

While this table includes recommendations for standard library applications where libraries are sufficiently diverse, researchers shouldn’t follow these recommendation for libraries that have poor diversity. Sequence diversity refers to the balance of nucleotides (A, T, G, C) at each position of a template library. Applications where you should load a library at a concentration below Illumina’s standard recommendations include:

  1. Any amplicon based library where primers are included in the read insert
  2. GBS or RAD-seq libraries that start with a similar restriction site
  3. 16S or 18S libraries that start with the same primer or variable domain sequence
  4. MeDIP or other low diversity targeting approach

If you’re working with a non-standard library preparation application or one where libraries have poor sequence diversity, submit a request here: genohub.com/ngs and a scientist will recommend flow cell loading concentrations.