Illumina HiSeq v4 Sequencing Services Yielding 1Tb Data / Week Now Available on Genohub

Read output per Illumina Lane

Latest Illumina Chemistry – Reads/Lane

In line with our efforts to democratize the latest high throughput sequencing technology, we’re pleased to announce the availability of HiSeq sequencing services with v4 chemistry. Any researcher, anywhere in the world can now order this sequencing service in a matter of minutes using

The new Illumina HiSeq version 4 chemistry allows for sequencing runs with 25% greater read length (2×125 for high-output runs) and 33% more clusters.  Running two full flow cells, users can expect to generate up to 1Tb of data per week, or 167 Gb per day. At an output of 250M reads per lane, users will need at least 2 lanes to achieve ~35x coverage of the human genome. While this isn’t the most efficient tech for sequencing whole human genomes (HiSeq X Ten only requires a single lane), it certainly is for exome, transcriptome and re-sequencing applications.  Earlier this year we announced the availability of NextSeq 500 and HiSeq X Ten services on

The new HiSeq v4 chemistry not only improves output, but reduces the time it takes for a sequencing run to complete. With run times taking only 6 days, we expect several of our HiSeq Rapid Run users to begin switching over to take advantage of outputs of 250M reads / lane in one week instead of 150M reads / lane in a single day. As we point this out, it’s important to note that HiSeq Rapid read lengths have also increased to 2×250. 

If you’re not sure exactly what platform, chemistry or read length is the most efficient for your application, use our Shop by Project page and enter the numbers of reads or coverage you need. We’ll display all your options! 

Illumina’s Next Big Pivot

President of Illumina

In a recent article in MIT Technology Review, Francis de Souza, president of Illumina is quoted as saying 228,000 human genomes will be sequenced this year (2014).  He further estimates that this number will double every 12 months to reach 1.6 million genomes by 2017. In a March blog post we extrapolated 400,000 genomes in 2015 by estimating the throughput of Illumina instruments in the market, HiSeq X Ten projects initiated on Genohub and large population sequencing projects starting in the UK and other countries. Pretty close to De Souza’s latest numbers. 

80% of the genomes sequenced this year will be part of scientific research projects, making one wonder when ‘clinical genomes’ will be ready. To get there we’re going to either need greater throughput, higher coverage or lower costs. However, instead on focusing on reducing costs, Illumina is betting on simplified, targeted sequencing. According to De Souza, “It’s not clear you can get another order of magnitude out of this…people are saying the price is not the issue”.  Rather than focusing on selling complex instruments, Illumina wants to become an everyday brand in hospitals. Illumina is actually in the process of simplifying their instruments and developing clinically relevant, targeted panels to be sold as FDA approved kits.

While targeted panels for research purposes are available today, most are not-regulated. Illumina believes regulation is a necessary step the FDA will have to take in order for targeted sequencing to become more popular in the clinic. A fast track way to get there is to work with pharmaceutical companies who are in the business of getting approval from the FDA. Last month, Illumina said it was developing a universal NGS based oncology test with AstraZeneca, Janssen Biotech, and Sanofi for use as a companion diagnostic on it’s MiseqDX platform. Today, Thermo Fisher announced plans to develop NGS based tests for solid tumors on its Ion PGM Dx platform with Pfizer and GSK.  At least in the short term future, it looks like targeted re-sequencing will be a mainstay in the clinic while research based WGS will guide targeted panel design. 

Mycoplasma Contamination in your Sequencing Data

mycoplasma contamination

Mycoplasma, the bane of any cell culture lab’s existence is a genus of bacteria characterized by a lack of a cell wall.  With a relatively small genome, mycoplasma have limited biosynthetic capabilities, requiring a host to efficiently replicate. Inspired by a bout of mycoplasma contamination in their own lab, Anthony O Olarerin-George and John B Hogenesch from the University of Pennsylvania recently set out to determine how widespread mycoplasma contamination was in other labs by screening RNA-seq data deposited in the NCBI Sequence Read Archive (1). Their study estimates that ~ 11% of NCBI’s Gene Expression Omnibus (GEO) projects between 2012 and 2013 contain at least ≥ 100 reads / million reads mapping to mycoplasma’s small 0.6 Mb genome. They also reference a recent study (2) which suggests that 7% of the samples from the 1,000 Genomes project are contaminated. Bad news if you’ve recently completed a large study and are wondering why you have so many unmapped reads. While most of these are likely from regions of the genome that haven’t been sequenced, reads mapping to mycoplasma should be taken seriously as they can affect the expression of thousands of genes and slow cellular growth.

Preventing contamination in the first place along with routine monitoring is essential, but if you’ve already completed the sequencing end of your project you can start aligning your data to several completed mycoplasma genomes.

With recent drops in cost, routine sequencing of cell culture samples has become more prevalent. If you’re interested in testing your cultures, start by searching for sequencing services and providers on Genohub

1) Assessing the prevalence of mycoplasma contamination in cell culture via a survey of NCBI’s RNA-seq archive. Anthony O Olarerin-George, John B Hogenesch doi:

2) Mycoplasma contamination in the 1000 Genomes Project. William B Langdon

Beginner’s Guide to Exome Sequencing

Exome Capture Kit Comparison

With decreasing costs to sequence whole human genomes (currently $1,550 for 35X coverage), we frequently hear researchers ask, “Why should I only sequence protein coding genes” ?

First, WGS of entire populations is still quite expensive. These types of projects are currently only being performed by large centers or government entities, like Genomics England, a company owned by UK’s Department of Health, which announced that they would sequence 100,000 whole genomes by 2017. At Genohub’s rate of $1,550/genome, 100,000 genomes would cost $155 million USD. This $155 million figure only includes sequencing costs and does not take into account labor, data storage and analysis which is likely several fold greater. 

Second, the exome, or all ~180,000 exons comprise less than 2% of all sequence in the human genome, but contain 85-90% of all known disease causing variants. A more focused dataset makes interpretation and analysis a lot easier.

Let’s assume you’ve decided to proceed with exome sequencing. The next step is to either find a service provider to perform your exome capture, sequencing and analysis or do it yourself. Genohub has made it easy to find and directly order sequencing services from providers around the world. Several of our providers offer exome library prep and sequencing services. If you’re only looking for someone to help with your data analysis, you can contact one of our providers offering exome bioinformatics services. Whether you decide to send your samples to a provider or make libraries yourself, you’ll need to decide on what capture technology to use, the number of reads you’ll need and what type of read length is most appropriate for your exome-seq project.

There are currently three main capture technologies available: Agilent SureSelect, Illumina Nextera Rapid Capture, Roche Nimblegen SeqCap EZ Exome. All three are in-solution based and utilize biotinylated DNA or RNA probes (baits) that are complementary to exons. These probes are added to genomic fragment libraries and after a period of hybridization, magnetic streptavidin beads are used to pull down and enrich for fragmented exons. Each of these three exome capture technologies is compared in a detailed table: Each kit has a varying numbers of probes, probe length, target region, input DNA requirements and hybridization time. Researchers planning on exome sequencing should first determine whether the technology they’re considering covers their regions of interest. Only 26.2 Mb of total targeted bases are in common, and only small portions of the CCDS Exome are uniquely covered by each tech (Chilamakuri, 2014).

Our Exome Guide breaks down the steps you’ll need to determine how much sequencing and what read length is appropriate for your exome capture sequencing project.

rRNA Depletion / Poly-A Selection Responsible for Coverage Bias in RNA-seq

Using a pool of 1,062 in vitro transcribed (IVT) human cDNA plasmids, a group from the University of Pennsylvania sought to characterize coverage biases in RNA-seq experiments. Their paper, titled IVT-seq reveals extreme bias in RNA-sequencing was published last week.

The authors cleverly use a carefully controlled set of IVT cDNA clones whose base composition and expression levels are known. Mixing the IVT set with mouse total RNA they found > 2 fold differences in transcript coverage amongst 50% of their transcripts and 10% having up to 10 fold changes. When IVT cDNA clones are sequenced alone, in the absence of a complex genomic milieu, the authors acknowledge biases that arise from random priming, adapter ligation, and amplification, but identify polyA selection and ribosomal depletion as being the main cause for RNA coverage bias. In their experiment, they consider hexamer entropy, GC-content, similarity of sequence to rRNA and measure coverage variability as an indicator of coverage bias along with depth of coverage as measured by FPKM. They demonstrate a significant correlation between transcript similarity to rRNA and greater differences in coverage between libraries that undergo rRNA depletion and those that do not.

Overall their method demonstrates that library preparation does introduce significant biases in RNA-seq data and that developing carefully controlled synthetic test transcripts, allows users to accurately measure this bias. Development of these controlled sets will allow for further refinement to current library preparation practices.


Next Generation Sequencing Data Normalization

Chip-seq peak score normalization

A recent review, Beyond library size: a field guide to NGS Normalization, published last week nicely summarizes the effect normalization technique can have on the number of genes called in differential expression experiments and peaks called in ChIP-seq. The article emphasizes a point we frequently try to convey to researchers beginning the analysis of their NGS data sets, namely, normalization methods depend on the data being analyzed, experimental conditions and how the experiment was performed. This article highlights several points researchers must take in consideration during normalization:

  1. Library size
  2. Technical variation amongst samples
  3. Biases during sequencing, e.g. longer fragments are sampled more frequently
  4. Preferential enrichment of specific sequences (ChIP-seq)

The challenge is that these considerations must be made while being careful not to mask real biological differences between samples.

Three popular normalization methods of RNA-seq data were compared, RPKM, library size total count and DESeq scaling factors. In terms of differentially expressed genes, DESeq and library size normalization resulted in >90% of the same identified genes, while RPKM identified a smaller number of genes. The RPKM method however as expected, more closely matched genome distribution while DESeq and library size norm were biased toward longer genes.

For ChIP-Seq, TC-based scaling, SPP and NCIS were tested on Drosophila embryonic segmentation transcription factors. The authors found that TC-based scaling between a ChIP and matched input raised false discovery rates.  The final set of peaks called depends on whether SPP or MACS is used as a peak caller. Both use cross-correlation to find lag between reads mapped to the + or – strand of DNA-protein regions. Background models are used to remove noise from the sample or from GC content, mappability, before peaks are finally called above a user defined signal to noise ratio.

Most importantly, before starting the analysis on any dataset it’s good to examine the biases present and choose a normalization method to counter these.  The article concludes that correct experimental design should be the first step in countering biases inherent in technique, a point we believe can’t be emphasized enough ! 

If you’re just getting started, connect with a service provider with expertise in data normalization. 

Ask a Bioinformatician

In the last 2 years, next-gen sequencing instrument output has significantly increased; labs are now sequencing more samples with greater depth than ever before. Demand for the analysis of next generation sequencing data is also growing at an arguably even higher rate. To help accommodate this higher demand, now allows researchers to quickly find and connect directly with service providers who have specific data analysis expertise:

 Whether it’s a simple question about gluing piece together in a pipeline or a request to have your transcriptome annotated, researchers can quickly choose a bioinformatics provider based on their expertise and post queries and project requests. Services that bioinformaticians offer on Genohub are broken down into primary, secondary and tertiary data analysis:

Primary – Involves the quality analysis of raw sequence data from a sequencing platform. Primary analysis solutions are typically provided by the platform after the sequencing phase is complete. This often results in a FASTQ file, which is a combination of sequence data and Phred quality scores for each base.

Secondary – Encompasses sequence alignment, assembly and variant calling of aligned reads. Analysis is usually resource intensive, requiring a significant amount of data and compute resources. This type of analysis often requires a set of algorithms that can often be automated into a pipeline. While the simplest pipelines can be a matter of gluing together publically available tools, a certain level of expertise is required to maintain and optimize the analysis flow for a particular project. 

Tertiary Analysis – Annotation, variant call validation, data aggregation and sample or population based statistical analysis are all components of tertiary data analysis. This type of analysis is typically performed to answer a specific biologically relevant question or generate a series of new hypothesis that need testing.

Researchers that still need library prep, sequencing and data analysis services can still search find and begin projects as before using our Shop by Project page. What’s new is that researchers who only need data analysis services can now directly search for and contact a bioinformatics service provider to request a quote:

Whether you plan performing a portion of your sequencing data analysis yourself or intend on taking on the challenge of putting together your own pipeline, consultation by a seasoned expert saves time and ensures you’re the way to successfully completing your project. By adding this new service, we’re trying to make it easier to search for and identify the right provider for your analysis requirements.

If you’re a service provider and would like your services to be listed on Genohub, you can sign up for Service Provider Account or contact us to discuss the screening and approval process.


PEG Precipitation of DNA Libraries – How Ampure or SPRIselect works


One question we’ve been asked, and one that our NGS providers are frequently asked, is how in principle does PEG precipitate DNA in next generation sequencing library preparation cleanup? We usually hear the question presented as: how do Agencourt’s Ampure XP or SPRIselect beads precipitate DNA? The answer has to do with the chemical properties of DNA, polyethylene glycol (PEG), the beads being used and water.  Polystyrene – magnetite beads (Ampure) are coated with a layer of negatively charged carboxyl  groups. DNA’s highly charged phosphate backbone makes it polar, allowing it to readily dissolve in water (also polar). When PEG [ H-(O-CH2-CH2)n-OH ] is added to a DNA solution in saturating condition, DNA forms large random coils. Adding this hydrophilic molecule with the right concentration of salt (Na+) causes DNA to aggregate and precipitate out of solution from lack of solvation (1, 2). Too much salt and you’ll have a lot of salty DNA, too little will result in poor recovery. The Na+ ions shield the negative phosphate backbones causing DNA to stick together and on anything else that’s in near vicinity (including carboxylated beads). Once you’re ready to elute your DNA and put it back into solution (after you’ve done your size selection or removal of enzymes, nucleotides, etc.) an aqueous solution is added back (TE or water) fully hydrating the DNA and moving it from an aggregated state back into solution. The negative charge of the carboxyl beads now repel DNA, allowing the user to extract it in the supernatant. Changing the amount of PEG and salt concentration can aid in size selecting DNA (2). This is a common method in NGS library preparation where the user is interested in size selecting a fragment of particular size. It’s often used to replace gel steps in NGS library prep. There is already a wealth of literature out there on conditions to size select DNA, just do a simple google search. The first article we’ve found that describes this selection is referenced below (3).

Updated 7/18/2016

Since this publication on May 7th, 2014, there are several more commercial, Ampure-like size selection beads on the market:

  • MagJet – ThermoFisher
  • Mag-Bind – Omega Biotek
  • Promega Beads – Promega
  • Kapa Pure Beads – Kapa Biosystems

While we haven’t explored each one of these yet, we suspect the chemistry behind precipitation and selection is very similar. If you’d like to share information about these beads, please leave us a comment or send us an email at

If you’d like help in constructing your NGS library contact us, and we’d be happy to consult with you on your sequencing project:

If you’re looking for a NGS service provider, check out our NGS Service Matching Engine:

(1)     A Transition to a Compact Form of DNA in Polymer Solutions:

(2)    DNA Condensation by Multivalent Cations:

(3)    Size fractionation of double -stranded DNA by precipitation with polyethylene glycol:


100 Gb of Data per Day – Nextseq 500 Sequencing Services Now Available on Genohub

Nextseq 500, Genohub

Find Nextseq 500 service providers on

Access to the Nextseq 500, Illumina’s first high throughput desktop sequencing instrument, is now available on While not the highest throughput instrument on the market, it is one of the fastest with up to a 6x increase in bases read per hour (compared to HiSeq). The instrument is ideally suited for those who need a moderate amount of sequencing data (more than a MiSeq run, less than HiSeq) in a short amount of time. We expect the highest interest to be centered around targeted sequencing (exome or custom regions) and fast RNA profiling. For exome studies, you can run between 1-12 samples in a single run and get back 4 Gb at 2 x75 or 5 Gb at a 2×100 read length. If you’re interested in RNA profiling at 10M reads per sample, you can multiplex between 12-36 samples together in a single run. A 1×75 cycle run takes as few as 11 hours to complete and 2×150 runs take ~29 hours.

You can order Nextseq 500 sequencing services today and expect to receive data back in 3-4 days ! Prices for 1 lane start at $2,250. Start your search here and use our helpful filters to narrow down your choices:

After you’ve identified the service you need, communicate your questions directly to the service provider. We’ll make sure you get a fast response. Genohub also takes care of billing and invoicing, making domestic & international ordering a breeze. We also have an easy to use project management interface to keep communication and sample specification data in one place.

If you’re not familiar with Nextseq technology or how best this instrument can be applied to your samples, take advantage of our complementary consultation service: We can help with your sequencing project design and make recommendation as to what sequencing service would be best suited for your experiment.

Last month we announced the availability of HiSeq X Ten services on Genohub:

As an efficient online market for NGS services, Genohub increases your access to the latest instrumentation and technology.  You don’t have to shell out $250K or $10M for a NextSeq or HiSeq X Ten, when access to professional services is right at your fingertips !




Yearly Demand for Whole Human Genome Sequencing – 400K New Genomes in 2015 ?


(Figure courtesy of the National Human Genome Research Institute: 

Advances since the Human Genome Project ended in 2003 have been significant. With new Illumina sequencing instruments becoming operational in April, large facilities will be able to generate 18,000 whole human genomes (18,000 30x Genomes / HiSeq X Ten, a set of 10 HiSeq X Systems). As of today, these facilities include: the Broad Institute, Garvan Research Foundation, Macrogen, New York Genome Center, Novogene and WuXi PharmTech. At a rate of 1 genome / lane, this begs the question how many 30x human genomes will be sequenced in the next 3 years ? Let’s estimate that each facility will churn out around 25,000, 30x genomes/year (some of the facilities above have purchased multiple HiSeq X Tens, others have more than 10 daisy chained together). In 2015 yield from these facilities alone (assuming no one else purchased a machine) would be ~150,000 genomes. Optimistically doubling that to account for new HiSeq X Ten purchases between now and 2015 would give an estimate of ~300,000 genomes in 2015, and that’s only on the HiSeq X Ten. Assuming this year there will already be 60,000 30x (non-HiSeq X Ten) genomes sequenced, 20% growth brings this figure closer to ~400,000 genomes in 2015. While this figure certainly does not account for delays, instrument break downs, data analysis, storage and library prep bottlenecks, it represents optimistic potential for 2015.

The next question is who’s going to supply all the DNA ? Several new initiatives to sequence whole populations are quickly popping up. With £100m earmarked, the UK is planning on sequencing the genomes of up to 100,000 NHS patients by 2017 (instrument platform likely Illumina), Saudi Arabia also plans to map 100,000 of their citizens, with the Ion Proton in line ready to do all the heavy lifting: Craig Venter’s recent launch of the company Human Longevity plans to start sequencing 40,000 genomes with plans to “rapidly scale to 100,000 human genomes / year”:

Everything described above pertains to whole human genome sequencing and is not meant to undercut the significantly higher number of other species that will be sequenced between now and 2015. Our focus at Genohub is to make it easy for researchers interested in next generation sequencing services to access all the latest sequencing technology, including the HiSeq X Ten: Anyone can search for, find and order sequencing, library prep and analysis services, making owning an actual sequencing instrument not a requirement for getting access to good quality data.