Accurate de novo assembly from 1 long insert library

Long repeats, extreme GC or AT sequence and palindromic regions are several of the reasons why there are gaps in draft genome assemblies. Using a non-hybrid pre-assembly method that corrects errors in long reads, Chin and colleagues (1) demonstrated in an article published in the June issue of Nature Methods, accurate assembly of microbial species using data from just one long read SMRT shotgun library.

Typically, in hybrid strategies, short read sequences from one instrument (HiSeq or SOLiD) are used to correct errors in long sequencing reads (PacBio, Roche 454). This method relies on long reads mapping to uncovered regions (owing to AT/GC content or other reasons described above) sequenced using short read technology. It also relies on the construction of at least two different libraries and several sequencing runs on different platforms. Chin and colleagues new method, “Hierarchical genome assembly process” (HGAP) consists of:

  • Choosing the longest sequence read of a seeding data set
  • Recruiting shorter reads and preassembly using a consensus method
  • Assembly of the pre-assembled
  • Refinement using initial read data to generate a final consensus

Details of each step are described in the Online Methods of the paper.

Briefly, they examined length and accuracy of seed reads by aligning each read longer than 6kb to a reference sequence: E. coli K-12 MG1655. Seed reads were converted into pre-assembled reads with a mean length of 5,777 bp. These were subject to the Celera Assembler which yielded one 4.6 million bp contig. The same technique was also applied to Meiothermus ruber DSM1279 and a bacterial artificial chromosome. Using the HGAP approach the group was able to generate de novo assemblies from large insert template libraries of 80-100x coverage that were comparable to Sanger sequencing or hybrid approaches. While we still expect the hybrid approach to be continued for use in finishing high quality genome assemblies, HGAP combined with SMRT sequencing sounds like a promising method for unfinished and potentially eukaryotic genomes.


  • Chen-Shan Chin, David H Alexander, Patrick Marks, Aaron A Klammer, James Drake, Cheryl Heiner, Alicia Clum, Alex Copeland,John Huddleston, Evan E Eichler, Stephen W Turner & Jonas Korlach. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods, 10, 563-569 (2013).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s