Amplicon Sequencing – Short vs. Long Reads

Amplicon sequencing is a type of targeted sequencing that can be used for various purposes. Some common types of amplicon sequencing are 16S and ITS sequencing, which are used in phylogeny and taxonomy studies for the identification of bacteria and fungi, respectively. When there is a need to explore the genome more generally, amplicon sequencing can be used to discover rare somatic mutations, detect and characterize variants, and identify germline single nucleotide polymorphisms (SNPs), insertions/deletions (INDELs), and known fusions [1, 2]. Targeted gene sequencing panel projects are another example of amplicon sequencing, where these panels include genes that are often associated with a certain disease or phenotype-of-interest [3].

In this article, we will go over what amplicon sequencing is, describe the advantages and disadvantages of short- and long-read sequencing, and then explain how Genohub can help support your project.

Amplicon Sequencing

Amplicon sequencing is targeted sequencing that involves specific primer design in order to achieve high on-target rates. It’s called amplicon sequencing, because a crucial step of the process is polymerase chain reaction (PCR), which is a method that amplifies specific DNA sequences based on the primers used. Primers are small DNA oligos that are specifically designed to target only the genes/regions-of-interest. When the amplification part of PCR occurs, only these specific genes are multiplied. The final products of PCR are called amplicons, hence amplicon sequencing [1].

It’s important to think about what type of sequencing (short vs. long read) needs to be done for your specific project, because in order to sequence amplicon samples, the appropriate adapters need to be added to help them adhere to sequencing flow cells [2]. These adapters will differ depending on the flow cell, and in some cases, it may even be more cost-effective to send DNA samples and have one of our NGS partners perform all the library prep themselves.

Short read sequencing (Illumina)

Short-read amplicon sequencing is done with Illumina platforms, often the MiSeq, and has been the standard for 16S, ITS and other microbial profiling projects for many years. Being the standard for so long has advantages, as there are many targeted gene panels created and validated already for use with Illumina sequencing, which can make the workflow much easier on researchers who are new to targeted sequencing. There is also an abundance of literature with Illumina sequencing, so it’s easy for researchers to compare their findings to those of other groups. The biggest advantage is that researchers can sequence hundreds of genes in a single run, which lowers sequencing costs and turnaround time, especially if the researcher is interested in many different genes [1].

A disadvantage with short-read sequencing is that the sequencing resolution may not be as high as long-read sequencing. A comparison of short-read to long-read 16S amplicon sequencing showed that only long-read sequencing could provide strain-level community resolution and insight into novel taxa. Then for the metagenomics portion, a greater number of and more complete bacterial metagenome-assembled genomes (MAGs) were recovered from the data generated from long reads [4].

Long read sequencing (PacBio and Nanopore)

Long-read amplicon sequencing is done with either the PacBio or Oxford Nanopore platforms. They both offer complete, contiguous, uniform, and non-biased coverage across long amplicons up to 10 kb. Advantages of this type of long-read amplicon sequencing is that it’s more efficient, accurate and sensitive than short-read sequencing.

PacBio sequencing can obtain up to 99.999% single-molecule base calling accuracy and has been used to sequence full-length 16S and ITS sequences with very high accuracy as well [3].

Nanopore sequencing can provide accurate variant calling as well as robust coverage of larger targeted regions, which can help enhance the analysis of repetitive regions and improve taxonomic assignment [5]. Nanopore sequencing also tends to allow a bit more flexibility than PacBio sequencing when it comes to scaling amplicon projects at a cost-effective price [6].

The disadvantages to using long-read sequencing for amplicon projects is that it tends to be much more expensive and time-consuming than short-read sequencing, and sometimes long reads may not even be needed if the targeted amplicons themselves are already very short.

How can Genohub help you?

Genohub’s amplicon sequencing partners are experts in every step of the amplicon sequencing process, including extraction, PCR amplification, adapter ligation, library prep and data analysis. Our partners have experience extracting from many different types of environmental and biological samples, but they can work just as well with your DNA or amplicons if you prefer to extract and/or perform PCR in your own lab. From our experience, it’s more cost-effective to send DNA samples rather than amplicons, unless you can attach Illumina adapters yourself.

We know that each research project is unique, so we have partners who are also open to working with your custom primers, custom gene panels and custom bioinformatics needs! Get started today by letting us know about your amplicon sequencing project here: https://genohub.com/ngs/ .

Fungal Sequencing – ITS vs. 18S

Studying the Fungi kingdom is important, because they have so many different ecological roles, including decomposers, symbiotes and parasites. There are also more than 1 million different species of fungi, so researchers need to have high-throughput methods to explore this diversity [1]. One such method is next-generation sequencing.

In this blog, we’ll go over why and how researchers sequence for fungi, what the ITS and 18S genes are, how to choose between them and how Genohub can help with your fungal sequencing project.

Why perform sequencing for fungal community analysis?

Fungal sequencing can be used to discover novel fungal species, quantify known fungi, explore the structure of fungal communities, and determine the roles of fungi in nature. In addition, it’s important to study these communities for human health, as there are some fungi that are resistant to antifungal drugs and others that are involved in plant diseases [2]. Thus, sequencing for fungi is relevant for multiple fields, including environmental conservation, agriculture, and microbiology.

Both ITS and 18S sequencing are well-established methods for studying fungal communities, as focusing on these genes is a simple way to identify fungi within complex microbiomes or environments that would otherwise be difficult to study [3]. For example, this type of specific amplicon sequencing enables the analysis of the fungal community within very mixed environmental samples, such as soil or water.

What are ITS and 18S?

The internal transcribed spacer (ITS) region and the 18S ribosomal RNA gene are used as biomarkers to classify fungi.

Figure 1. Picture of the ITS region as spacers between the ribosomal subunit sequences.

As seen in Figure 1, the ITS region includes ITS1 and ITS2, the spacer genes located between the small-subunit rRNA and large-subunit rRNA. Generally, the ITS1/ITS4 primers are used for amplification of the ITS region, although they can be substituted with the universal primers ITS2, ITS3, and ITS5 [4].

The 18S ribosomal RNA (18S rRNA) gene codes for a component of the small 40S eukaryotic ribosomal subunit and has both conserved and variable regions. The conserved regions can reveal the family relationship among species, whereas the variable regions will show the disparities in their sequences. Regarding the variable regions, 18S rRNA gene has a total of nine, V1-V9. The regions V2, V4 and V9 together are useful for identifying samples at both the family and order levels, while V9 seems to have a higher resolution at the genus level [5].

How to choose between ITS and 18S?

Although both ITS and 18S rRNA have proven useful for assessing fungal diversity in environmental samples, there are enough differences between them that researchers may choose to focus on only one, although sequencing for both is an option as well.

There was relatively low evolutionary pressure for the ITS1 and ITS2 sequences to remain conserved, so the ITS region tends to be hypervariable between fungal species while remaining moderately unchanged among individuals from the same species. It is therefore very well suited as a marker for species identification in the classification of fungus and is often used to study relative abundance of fungi as well [2]. This can be useful if you need to perform a survey for genetic diversity at the species level or even within a species.

On the other hand, there was significant evolutionary pressure for the 18S rRNA gene to remain highly conserved as a component of the small eukaryotic 40S ribosomal subunit, an essential part of all eukaryotic cells. Due to this pressure, 18S is considered a potential biomarker for fungi classification above the species level and is often used in wide phylogenetic analyses and environmental biodiversity screenings [5].

In summary, the ITS region is mainly used for fungal diversity studies, while 18S rRNA is mainly used for high resolution taxonomic studies of fungi.

How can Genohub help?

Genohub’s ITS and 18S sequencing partners are experts in every step of the amplicon sequencing process, including extraction, PCR amplification and library preparation using validated primers based on the literature, and data analysis, including taxonomic assignment, diversity and richness analysis, comparative analysis, and evolutionary analysis. Our partners have experience extracting from many different types of environmental and biological samples, including soil, water, sludge, feces, and plant and animal tissue, but they can work just as well with DNA samples that you extract yourself.

We know that each research project is unique, so we have partners who are also open to working with your custom primers or your custom analysis needs! Get started today by letting us know about your ITS or 18S sequencing project here: https://genohub.com/ngs/ .

6 QC methods post library construction for NGS

After nucleic acid extraction and sample QC, the next step in the NGS workflow is library preparation. NGS libraries are prepared to meet the platform requirements with respect to size, purity, concentration and efficient ligation of adaptors. Assessing the quality of a sequencing library before committing it to a full-scale sequencing run ensures maximum sequencing efficiency, leading to accurate sequencing data with more even coverage.

In this blog post, we list the various ways to QC libraries in order of most stringent to least stringent.

1. qPCR

qPCR is a method of quantifying DNA based on PCR. qPCR tracks target concentration as a function of PCR cycle number to derive a quantitative estimate of the initial template concentration in a sample. As with conventional PCR, it uses a polymerase, dNTPs, and two primers designed to match sequences within a template. For the QC protocol, the primers match sequences within the adapters flanking a sequencing library.

Therefore, qPCR is an ideal method for measuring libraries in advance of generating clusters, because it will only measure templates that have both adaptor sequences on either end which will subsequently form clusters on a flow cell. In addition, qPCR is a very sensitive method of measuring DNA and therefore dilute libraries with concentrations below the threshold of detection of conventional spectrophotometric methods can be quantified by qPCR.

KAPA Biosystems SYBR FAST ‘Library Quantification Kit for Illumina Sequencing Platforms is commonly used with qPCR. This kit measures absolute numbers of molecules containing the Illumina adapter sequences, thus providing a highly accurate measurement of amplifiable molecules available for cluster generation.

2. MiSeq

The MiSeq system uses the same library prep methods and proven sequencing by synthesis chemistry as the HiSeq system. Thus, it is ideal for analyzing prepared libraries prior to performing high-throughput sequencing. Performing library quality control (QC) using the MiSeq system before committing it to a fullscale HiSeq run can save time and money while leading to better sequencing results.

Data generated by the MiSeq system is comparable to other Illumina next-generation sequencing platforms, ensuring a smooth transition from one instrument to another. Based on the individual experimental requirements, metrics obtained from performing simple QC can be used to streamline and improve your sequencing projects.

Using a single library prep method and taking only a single day, detailed QC parameters, including cluster density, library complexity, percent duplication, GC bias, and index representation can be generated on the MiSeq system. The MiSeq system has the unique ability to do paired-end (PE) sequencing for accurately assessing insert size. Library cluster density can also be determined and used to predict HiSeq cluster density, maximizing yield and reducing rework.

3. Fluorometric method

Quantifying DNA libraries using a fluorometric method that involves intercalating dyes specifically binding to DNA or RNA is highly useful. This method is very precise as DNA dyes do not bind to RNA and vice versa.

The Invitrogen™ Qubit™ Fluorometer a popular fluorometer that accurately measures DNA, RNA, and protein using the highly sensitive Invitrogen™ Qubit™ quantitation assays. The concentration of the target molecule in the sample is reported by a fluorescent dye that emits a signal only when bound to the target, which minimizes the effects of contaminants—including degraded DNA or RNA—on the result.

4. Automated electrophoresis

Several automated electrophoretic instruments are useful in estimating the size of the NGS libraries. The Agilent 2100 Bioanalyzer system provides sizing, quantitation, and purity assessments for DNA, RNA, and protein samples. The Agilent 2200 TapeStation system is a tape-based platform for reliable electrophoresis platform for accurate size selection of generated libraries. PerkinElmer LabChip GX can be used for DNA and RNA quantitation and sizing using automated capillary electrophoresis separation. The Qiagen QIAxcel Advanced system fully automates sensitive, high-resolution capillary electrophoresis of up to 96 samples per run that can be used for library QC as well. All these instruments are accompanied by convenient analysis and data documentation software that make the library QC step faster and easier.

5. UV-Visible Spectroscopy

A UV-Vis spectrophotometer can be used to analyze spectral absorbance to measure the nucleic acid libraries and can differentiate between DNA, RNA and other absorbing contaminants. However, this method is not super accurate and should be paired with one of the other QC methods to ensure high-quality libraries. There are several US-Vis spectrophotometers currently available, such as currently available such as Thermo Scientific™ NanoDrop™ UV-Vis spectrophotometer, Qiagen QIAExpert System, Shimadzu Biospec-nano etc.

6. Bead normalization

This is the preferred QC method if < 12 libraries are to be QCed or if library yields are less than 15 nM, highly variable and unpredictable or Users are working with uncharacterized genomes and are inexperienced with the Nextera XT DNA Library Prep Kit protocol.

During bead-based normalization, DNA is bound to normalization beads and eluted off the beads at approximately the same concentration for each sample. Bead-based normalization enables scientists to bypass time-consuming library quantitation measurements and manual pipetting steps before loading libraries onto the sequencer. Bead-based normalization can provide significant cost and time savings for researchers processing many samples, or for researchers without access to any of the QC  instruments listed in the above methods.

 

 

 

 

Top 3 Sample QC steps prior to library preparation for NGS

Before beginning library preparation for next-generation sequencing, it is highly recommended to perform sample quality control (QC) to check the nucleic acid quantity, purity and integrity. The starting material for NGS library construction might be any type of nucleic acid that is or can be converted into double-stranded DNA (dsDNA). These materials, often gDNA, RNA, PCR amplicons, and ChIP samples, must have high purity and integrity and sufficient concentration for the sequencing reaction.

1. Nucleic Acid Quantification

Measuring the concentration of nucleic acid samples is a key QC step to determine the fit and amount of nucleic acid available for further processing.

  • Absorbance Method:

A UV-Vis spectrophotometer can be used to analyze spectral absorbance to measure the whole nucleic acid profile and can differentiate between DNA, RNA and other absorbing contaminants. Different molecules such as nucleic acids, proteins, and chemical contaminants absorb light in their own pattern. By measuring the amount of light absorbed at a defined wavelength, the concentration of the molecules of interest can be calculated. Most laboratories are equipped with a US-Vis spectrophotometer to quantify nucleic acids or proteins for their day-to-day experiments. Customers can choose from several spectrophotometers currently available such as Thermo Scientific™ NanoDrop™ UV-Vis spectrophotometer, Qiagen QIAExpert System, Shimadzu Biospec-nano etc.

  • Fluorescence Method:

Fluorescence methods are more sensitive than absorbance, particularly for low-concentration samples, and the use of DNA-binding dyes allows more specific measurement of DNA than spectrophotometric methods. Fluorescence measurements are set at excitation and emission values that vary depending on the dye chosen (Hoechst bis-benzimidazole dyes, PicoGreen® or QuantiFluor™ dsDNA dyes). The concentration of unknown samples is calculated based on comparison to a standard curve generated from samples of known DNA concentration.

The availability of single-tube and microplate fluorometers gives flexibility for reading samples in PCR tubes, cuvettes or multiwell plates and makes fluorescence measurement a convenient modern alternative to the more traditional absorbance methods. Thermo Scientific (Invitrogen) Qubit™ Fluorometer is one of the most commonly used fluorometers that accurately measure low concentration DNA, RNA, and protein.

sho-qubit-instrument

2. Nucleic Acid Purity

Nucleic acid samples can become contaminated by other molecules with which they were co-extracted and eluted during the purification process or by chemicals from upstream applications. Purification methods involving phenol extraction, ethanol precipitation or salting-out may not completely remove all contaminants or chemicals from the final eluates. The resulting impurities can significantly decrease the sensitivity and efficiency of your downstream enzymatic reactions.

  • UV spectrophotometry measurements enable calculation of nucleic acid concentrations based on the sample’s absorbance at 260 nm. The absorbance at 280 nm and 230 nm can be used to assess the level of contaminating proteins or chemicals, respectively. The absorbance ratio of nucleic acids to contaminants provides an estimation of the sample purity, and this number can be used as acceptance criteria for inclusion or exclusion of samples in downstream applications.
  • Contaminants such as RNA, proteins or chemicals can interfere with library preparation and the sequencing reactions. When sequencing DNA, an RNA removal step is highly recommended, and when sequencing RNA, a gDNA removal step is recommended. Sample purity can be assessed following nucleic acid extraction and throughout the library preparation workflow using UV/Vis spectrophotometry. For DNA and RNA samples the relative abundance of proteins in the sample can be assessed by determining the A260/A280ratio, which should be between 1.8–2.0. Contamination by organic compounds can be assessed using the A260/A230 ratio, which should be higher than 2.0 for DNA and higher than 1.5 for RNA. Next-generation spectrophotometry with the Qiagen QIAxpert system enables spectral content profiling, which can discriminate DNA and RNA from sample contaminants without using a dye.

19647-15451

  • qPCR:

Quantitative PCR, or real-time PCR, (qPCR) uses the linearity of DNA amplification to determine absolute or relative quantities of a known sequence in a sample. By using a fluorescent reporter in the reaction, it is possible to measure DNA generation in the qPCR assay. In qPCR, DNA amplification is monitored at each cycle of PCR. When the DNA is in the log-linear phase of amplification, the amount of fluorescence increases above the background. The point at which the fluorescence becomes measurable is called the threshold cycle (CT) or crossing point. By using multiple dilutions of a known amount of standard DNA, a standard curve can be generated of log concentration against CT. The amount of DNA or cDNA in an unknown sample can then be calculated from its CT value.

qPCR-based assays can accurately qualify and quantify amplifiable DNA in challenging samples. For example, DNA derived from Formalin-fixed paraffin-embedded tissue samples, is oftentimes highly fragmented, cross-linked with protein and has a high proportion of single-stranded DNA making it challenging to perform library preparation steps. For FFPE samples, the Agilent NGS FFPE QC kit enables functional DNA quality assessment of input DNA.

3. Nucleic Acid Integrity (Size distribution)

Along with quantity and purity, size distribution is a critical QC parameter that provides valuable insight into sample quality. Analyzing nucleic acid size informs you about your sample’s integrity and indicates whether the samples are fragmented or contaminated by other DNA or RNA products. Various electrophoretic methods can be used to assess the size distribution of your sample.

  • Agarose Gel Electrophoresis

In this method, a horizontal gel electrophoresis tank with an external power supply, analytical-grade agarose, an appropriate running buffer (e.g., 1X TAE) and an intercalating DNA dye along with appropriately sized DNA standards are required. A sample of the isolated DNA is loaded into a well of the agarose gel and then exposed to an electric field. The negatively charged DNA backbone migrates toward the anode. Since small DNA fragments migrate faster, the DNA is separated by size. The percentage of agarose in the gel will determine what size range of DNA will be resolved with the greatest clarity. Any RNA, nucleotides, and protein in the sample migrate at different rates compared to the DNA so the band(s) containing the DNA will be distinct.

gel_electrophoresis_dna_bands_yourgenome

Analyzing PCR amplicons or RFLP fragments confirms the presence of the expected size fragments and alerts you to the presence of any non-specific amplicons. Electrophoresis also helps you assess the ligation efficiency yield for plasmid cloning procedures as well as the efficiency of removal of primer–dimers or other unspecific fragments during sample cleanup.

For complex samples such as genomic DNA (gDNA) or total RNA, the shape and position of the smear from electrophoresis analysis directly correlates with the integrity of the samples. Nucleic acid species of larger size tend to be degraded first and provide degradation products of lower molecular weight. Samples of poor integrity generally have a higher abundance of shorter fragments, while high-quality samples contain intact nucleic acid molecules with higher molecular size.

Eukaryotic RNA samples have unique electrophoretic signatures, which consist of a smear with major fragments corresponding to 28S, 18S and 5S ribosomal RNA (rRNA). These electrophoretic patterns correlate with the integrity of the RNA samples. The RNA integrity can either be assessed manually or with automation that employs a dedicated algorithm such as the RNA Integrity Number (RIN) that gives an objective integrity grade to RNA samples ranging from 1–10. RNA samples of highest quality usually have a score of 8 or above.

  • Capillary Electrophoresis

In this method, charged DNA or RNA molecules are injected into a capillary and are resolved during migration through a gel-like matrix. Nucleic acids are detected as they pass by a detector that captures signals of specific absorbance. Results are presented in the form of an electropherogram, which is a plot of signal intensity against migration time. The fragment sizes are precisely determined using a size marker consisting of fragments of known size. This method provides highly resolving and sensitive nucleic acid analysis that is faster and safer.

 

 

Hybrid Read Sequencing: Applications and Tools

Next-generation sequencing (Illumina) and long read sequencing (PacBio/Oxford Nanopore) platforms each have their own strengths and weaknesses. Recent advances in single molecule real-time (SMRT) and nanopore sequencing technologies have enabled high-quality assemblies from long and inaccurate reads. However, these approaches require high coverage by long reads and remain expensive. On the other hand, the inexpensive short reads technologies produce accurate but fragmented assemblies. Thus, the combination of these techniques led to a new improved approach known as hybrid sequencing.

The hybrid sequencing methods utilize the high-throughput and high-accuracy short read data to correct errors in the long reads. This approach reduces the required amount of costlier long-read sequence data as well as results in more complete assemblies including the repetitive regions. Moreover, PacBio long reads can provide reliable alignments, scaffolds, and rough detections of genomic variants, while short reads refine the alignments, assemblies, and detections to single-nucleotide resolution. The high coverage of short read sequencing data output can also be utilized in downstream quantitative analysis1.

Applications

De novo sequencing

As alternatives to using PacBio sequencing alone for eukaryotic de novo assemblies, error correction strategies using hybrid sequencing have also been developed.

  • Koren et al. developed the PacBio corrected Reads (PBcR) approach for using short reads to correct the errors in long reads2. PBcR has been applied to reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced parrot (Melopsittacus undulates) The long-read correction approach, has achieved >99.9% base-call accuracy, leading to substantially better assemblies than non-hybrid sequencing strategies.
  • Also, Bashir et al. used hybrid sequencing data to assemble the two-chromosome genome of a Haitian cholera outbreak strain at >99.9% accuracy in two nearly finished contigs, completely resolving complex regions with clinically relevant structures3.
  • More recently, Goodwin et al. developed an open-source error correction algorithm Nanocorr, specifically for hybrid error correction of Oxford Nanopore reads. They used this error correction method with complementary MiSeq data to produce a highly contiguous and accurate de novo assembly of the Saccharomyces cerevisiae The contig N50 length was more than ten times greater than an Illumina-only assembly with >99.88% consensus identity when compared to the reference. Additionally, this assembly offered a complete representation of the features of the genome with correctly assembled gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly4.

Transcript structure and Gene isoform identification

Besides genome assembly, hybrid sequencing can also be applied to the error correction of PacBio long reads of transcripts. Moreover, it could improve gene isoform identification and abundance estimation.

  • Along with genome assembly, Koren et al. used the PBcR method to identify and confirm full-length transcripts and gene isoforms. As the length of the single-molecule PacBio reads from RNA-Seq experiments is within the size distribution of most transcripts, many PacBio reads represent near full-length transcripts. These long reads can therefore greatly reduce the need for transcript assembly, which requires complex algorithms for short reads and confidently detect alternatively spliced isoforms. However, the predominance of indel errors makes analysis of the raw reads challenging. Both sets of PacBio reads (before and after error-correction) were aligned to the reference genome to determine the ones that matched the exon structure over the entire length of the annotated transcripts. Before correction, only 41 (0.1%) of the PacBio reads exactly matched the annotated exon structure that rose to 12, 065 (24.1%) after correction.
  • Au et al. developed a computational tool called LSC for the correction of raw PacBio reads by short reads5. Applying this tool to 100,000 human brain cerebellum PacBio subreads and 64 million 75-bp Illumina short reads, they reduced the error rate of the long reads by more than 3-fold. In order to identify and quantify full-length gene isoforms, they also developed an Isoform Detection and Prediction tool (IDP), which makes use of TGS long reads and SGS short reads6. Applying LSC and IDP to PacBio long reads and Illumina short reads of the human embryonic stem cell transcriptome, they detected several thousand RefSeq-annotated gene isoforms at full-length. IDP-fusion has also been released for the identification of fusion genes, fusion sites, and fusion gene isoforms from cancer transcriptomes7.
  • Ning et al. developed an analysis method HySeMaFi to decipher gene splicing and estimate the gene isoforms abundance8. Firstly, the method establishes the mapping relationship between the error-corrected long reads and the longest assembled contig in every corresponding gene. According to the mapping data, the true splicing pattern of the genes is detected, followed by quantification of the isoforms.

Personal transcriptomes

Personal transcriptomes are expected to have applications in understanding individual biology and disease, but short read sequencing has been shown to be insufficiently accurate for the identification and quantification of an individual’s genetic variants and gene isoforms9.

  • Using a hybrid sequencing strategy combining PacBio long reads and Illumina short reads, Tilgner et al. sequenced the lymphoblastoid transcriptomes of three family members in order to produce and quantify an enhanced personalized genome annotation. Around 711,000 CCS reads were used to identify novel isoforms, and ∼100 million Illumina paired-end reads were used to quantify the personalized annotation, which cannot be accomplished by the relatively small number of long reads alone. This method produced reads representing all splice sites of a transcript for most sufficiently expressed genes shorter than 3 kb. It provides a de novo approach for determining single-nucleotide variations, which could be used to improve RNA haplotype inference10.

Epigenetics research

  • Beckmann et al. demonstrated the ability of PacBio sequencing to recover previously-discovered epigenetic motifs with m6A and m4C modifications in both low-coverage and high-contamination scenarios11. They were also able to recover many motifs from three mixed strains ( E. coliG. metallireducens, and C. salexigens), even when the motif sequences of the genomes of interest overlap substantially, suggesting that PacBio sequencing is applicable to metagenomics. Their studies infer that hybrid sequencing would be more cost-effective than using PacBio sequencing alone to detect and accurately define k-mers for low proportion genomes.

Hybrid assembly tools

Several algorithms have been developed that can help in the single molecule de novo assembly of genomes along with hybrid error correction using the short, high-fidelity sequences.

  • Jabba is a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. It uses a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds12. The tool is available here: https://github.com/biointec/jabba.
  • HALC is a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement and constructs a contig graph. This tool was applied on E. coliA. thaliana and Maylandia zebra data sets and has been showed to achieve up to 41 % higher throughput than other existing algorithms while maintaining comparable accuracy13. HALC can be downloaded here:  https://github.com/lanl001/halc.
  • The HYBRIDSPADES algorithm was developed for assembling short and long reads and benchmarked on several bacterial assembly projects. HYBRIDSPADES generated accurate assemblies (even in projects with relatively low coverage by long reads), thus reducing the overall cost of genome sequencing. This method was used to demonstrate the first complete circular chromosome assembly of a genome from single cells of Candidate Phylum TM6using SMRT reads14. The tool is publicly available on this page: http://bioinf.spbau.ru/en/spades.

Due to the constant development of new long read error correction tools, La et al. have recently published an open-source pipeline that evaluates the accuracy of these different algorithms15. LRCstats analyzed the accuracy of four hybrid correction methods for PacBio long reads over three data sets and can be downloaded here: https://github.com/cchauve/lrcstats.

Sović et al. evaluated the different non-hybrid and hybrid assembly methods for de novo assembly using nanopore reads16. They benchmarked five non-hybrid assembly pipelines and two hybrid assemblers that use nanopore sequencing data to scaffold Illumina assemblies. Their results showed that hybrid methods are highly dependent on the quality of NGS data, but much less on the quality and coverage of nanopore data and performed relatively well on lower nanopore coverages. The implementation of this DNA Assembly benchmark is available here: https://github.com/kkrizanovic/NanoMark.

References:

  1. Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics, Proteomics Bioinforma. 13, 278–289 (2015).
  2. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotech 30, 693–700 (2012).
  3. Bashir, A. et al. A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol 30, (2012).
  4. Goodwin, S. et al. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res 25, (2015).
  5. Au, K. F., Underwood, J. G., Lee, L. & Wong, W. H. Improving PacBio Long Read Accuracy by Short Read Alignment. PLoS One 7, e46679 (2012).
  6. Au, K. F. et al. Characterization of the human ESC transcriptome by hybrid sequencing. Proc. Natl. Acad. Sci. 110, E4821–E4830 (2013).
  7. Weirather, J. L. et al. Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing. Nucleic Acids Res. 43, e116 (2015).
  8. Ning, G. et al. Hybrid sequencing and map finding (HySeMaFi): optional strategies for extensively deciphering gene splicing and expression in organisms without reference genome. 7, 43793 (2017).
  9. Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq.(ANALYSIS OPEN)(Report). Nat. Methods 10, 1177 (2013).
  10. Tilgner, H., Grubert, F., Sharon, D. & Snyder, M. P. Defining a personal, allele-specific, and single-molecule long-read transcriptome. Proc. Natl. Acad. Sci. 111, 9869–9874 (2014).
  11. Beckmann, N. D., Karri, S., Fang, G. & Bashir, A. Detecting epigenetic motifs in low coverage and metagenomics settings. BMC Bioinformatics 15, S16 (2014).
  12. Miclotte, G. et al. Jabba: hybrid error correction for long sequencing reads. Algorithms Mol. Biol. 11, 10 (2016).
  13. Bao, E. & Lan, L. HALC: High throughput algorithm for long read error correction. BMC Bioinformatics 18, 204 (2017).
  14. Antipov, D., Korobeynikov, A., McLean, J. S. & Pevzner, P. A. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2016).
  15. La, S., Haghshenas, E. & Chauve, C. LRCstats, a tool for evaluating long reads correction methods. Bioinformatics (2017). doi:10.1093/bioinformatics/btx489
  16. Sović, I., Križanović, K., Skala, K. & Šikić, M. Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads . Bioinformatics 32, 2582–2589 (2016).

 

PacBio vs. Oxford Nanopore sequencing

Long-read sequencing developed by Pacific Biosciences and Oxford Nanopore overcome many of the limitations researchers face with short reads. Long reads improve de novo assembly, transcriptome analysis (gene isoform identification) and play an important role in the field of metagenomics. Longer reads are also useful when assembling genomes that include large stretches of repetitive regions.

Currently, there are two long read sequencing platforms. To help a researcher choose between which platform has greater utility for their application, we compare overall instrument specifications offered by PacBio and Oxford Nanopore, and published applications in the next-generation sequencing space.

Capturea Oxford Nanopore charges an access fee that gives users one MinION/PromethIon instrument, a starter pack of consumables, certain data services, and community-based support

* Insufficient data

Although both PacBio and Oxford Nanopore generate longer reads compared to short read Illumina or Ion sequencing, the higher error rate of both the PacBio and Oxford Nanopore sequencers remain an issue needs addressing. Whereas PacBio reads a molecule multiple times to generate high-quality consensus data, Oxford Nanopore can only sequence a molecule twice. As a result, PacBio generates data with lower error rates compared to Oxford Nanopore. PacBio has a slightly better overall performance for applications such as the discovery of transcriptome complexity and sensitive identification of isoforms. On the other hand, MinION provides higher throughput as nanopores can sequence multiple molecules simultaneously. Hence, it is best suited for applications that require a larger amount of data9

As long reads can provide large scaffolds, de novo assembly is one of the main applications of PacBio sequencing5. Though the error rate of PacBio data is higher than that of short read Illumina or Ion sequencing, increased coverage or hybrid sequencing can greatly improve the accuracy of genome assembly. PacBio sequencing has been successfully used to finish the 100-contig draft genome of Clostridium autoethanogenum DSM 10061, a Class III, the most complex genome classification in terms of repeat content and repeat type. It has a 31.1% GC content and contains repeats, prophage, and nine copies of rRNA gene operons. Using a single PacBio library and sequencing it with two SMRT cells, an entire genome can be assembled de novo with a single contig. When short read Illumina or Ion sequencing was used alone with the same genome, >22 contigs were needed, and each of the assemblies contained at least four collapsed repeat regions, PacBio assemblies had none10.

PacBio sequencing has also been used to assemble the chloroplast genome of Potentilla micrantha11, Saccharomyces cerevisiae, Aradopsis thaliana and Drosophila melanogaster using fewer contigs and CPU time for assembly compared to assemblies using Illumina sequencers12.

PacBio sequencing of PCR products can be used to improve the quality of current draft genomes by closing gaps and sequencing through hairpin structures and areas of high GC content more efficiently than Sanger sequencing13.

Pacific Biosciences has developed a protocol, Iso-Seq, for transcript sequencing. This includes library construction, size selection, sequencing data collection, and data processing. Iso-Seq allows direct sequencing of transcripts up to 10 kb without the use of a reference genome. Iso-Seq has been used to characterize alternative splicing events involved in the formation of blood cellular components14. This is essential for interpreting the effects of mutations leading to inherited disorders and blood cancers, and can be applied to design strategies to advance transplantation and regenerative medicine.

Another major application of PacBio sequencing is in epigenetics research. Recent studies demonstrate that investigation of intercellular heterogeneity in previously undetectable genome DNA modifications (such as m6A and m4C) is facilitated by the direct detection of modifications in single molecules by PacBio sequencing15.

Compared to PacBio, the Oxford Nanopore MinION is small (size of a USB thumb drive), affordable, utilizes a simple library prep and is field portable16. This is useful in situations such as a virus outbreak where a mobile diagnostic laboratory can be set up using MinIONS. In remote regions such as parts of Brazil and Africa where there are logistical issues associated with shipping samples for sequencing, MinION can provide immediate and real-time data to scientific investigators. The most notable clinical use of MinION has been the analysis of Ebola samples on-site during the viral outbreak in West Africa17,18.

The low cost of sequencing and portability of the MinION sequencer also make it a useful tool for teaching. It has been used to provide hands-on experience to students, most recently at Columbia University and the University of California Santa Cruz, where every student performed their own MinION sequencing19.

Perhaps the most ambitious MinION application is its potential to detect and identify bacteria and viruses on manned space flights. In a proof-of-concept experiment, Castro-Wallace et al. demonstrated successful sequencing and de novo assembly of a lambda phage genome, an E. coli genome, and a mouse mitochondrial genome. They observed that there was no significant difference in the quality of sequence data generated on the International Space Station and in control experiments that were performed in parallel on Earth22.

Recently, Oxford Nanopore developed a bench-top instrument, PromethION, that provides high-throughput sequencing and is modular in design. It contains 48 flow cells that can be run individually or in parallel. The PromethION flow cells contain 3000 channels each, and produce up to 40 Gb of data.

 

References:

  1. Pacific Biosciences – AllSeq. Available at: http://allseq.com/knowledge-bank/sequencing-platforms/pacific-biosciences/.
  2. Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 239 (2016).
  3. Lu, H., Giordano, F. & Ning, Z. Oxford Nanopore MinION Sequencing and Genome Assembly. Genomics. Proteomics Bioinformatics 14, 265–279 (2016).
  4. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. bioRxiv (2017).
  5. Jain, M. et al. MinION Analysis and Reference Consortium: Phase 2 data release and analysis of R9.0 chemistry [version 1; referees: awaiting peer review]. F1000Research 6, (2017).
  6. Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics, Proteomics Bioinforma. 13, 278–289 (2015).
  7. MinION. Available at: https://nanoporetech.com/products/minion.
  8. PromethION Early Access Programme. Available at: https://nanoporetech.com/community/promethion-early-access-programme.
  9. Oxford Nanopore in 2016. Available at: http://blog.booleanbiotech.com/nanopore_2016.html.
  10. Weirather, J. L. et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Research 6, 100 (2017).
  11. Brown, S. D. et al. Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. Biotechnol. Biofuels 7, 40 (2014).
  12. Ferrarini, M. et al. An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC Genomics 14, 670 (2013).
  13. Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotech 33, 623–630 (2015).
  14. Zhang, X. et al. Improving genome assemblies by sequencing PCR products with PacBio. Biotechniques 53, 61–62 (2012).
  15. Chen, L. et al. Transcriptional diversity during lineage commitment of human blood progenitors. Science (80-. ). 345, (2014).
  16. Feng, Z., Li, J., Zhang, J.-R. & Zhang, X. qDNAmod: a statistical model-based tool to reveal intercellular heterogeneity of DNA modification from SMRT sequencing data. Nucleic Acids Res. 42, 13488–13499 (2014).
  17. Jain, M., Olsen, H. E., Paten, B. & Akeson, M. Erratum to: The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 256 (2016).
  18. Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).
  19. Hoenen, T. et al. Nanopore sequencing as a rapidly deployable Ebola outbreak tool. Emerg. Infect. Dis. 22, 331–334 (2016).
  20. Citizen Sequencers: Taking Oxford Nanopore’s MinION to the Classroom and Beyond – Bio-IT World. Available at: http://www.bio-itworld.com/2015/12/9/citizen-sequencers-taking-oxford-nanopores-minion-classroom-beyond.html.
  21. Castro-Wallace, S. L. et al. Nanopore DNA Sequencing and Genome Assembly on the International Space Station. bioRxiv (2016).

AGBT 2014 – Summary of Day 1

AGBT 2014 Summary

The first day of the Advances in Genome Biology & Technology (AGBT) meeting kicked off with an introduction by Eric Green, Director of the National Human Genome Research Institute. He announced that this 15th annual meeting was the largest ever with 850 expected to attend. The opening plenary session certainly did not look like 850 people in attendance. Winter Storm Pax wreaked havoc on flights coming in from Atlanta and other cities, resulting in several speaker and general attendee cancellations.

The plenary session began with scheduled talks by Aviv Regev, Jeanne Lawrence, Wendy Winckler and Valerie Schneider. Jeanne Lawrence couldn’t make it, which was a shame particularly since she gave a brilliant talk at ASHG on using a single gene XIST to shut down the extra copy of chromosome 21 in Down syndrome. This work was nicely summarized in a publication that came out this summer titled: Translating dosage compensation to trisomy 21.          

Aviv Regev and Wendy Winckler’s talks were subject to a blog/tweet embargo (unclear whether Regev’s talk was completely under embargo or only the last half, we’re playing it safe and not discussing it here), leaving Valerie Schneider’s presentation the only one that was tweeted or written about. This instantly created great angst among those attending the lectures, those stuck in airports enroute to AGBT and those at home waiting for in depth coverage.

Single-cell sequencing, considered the “method of the year” by Nature Methods was the basis of the opening lecture. Aviv Regev offered an excellent view of the dendritic cell network based on cyclical perturbations and variations between single cells. Regev’s first half of her presentation titled, “Harnessing Variation Between Single Cells to Decipher Intra and Intercellular Circuits in Immune Cells” was largely covered by her publication in April, “Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells”.

The second talk, by Wendy Winckler was not allowed to be discussed or tweeted according to Winckler, courtesy of Novartis’s communications department. The title of her presentation “Next Generation Diagnostics for Precision Cancer Medicine” wasn’t revealing either. To get an idea of what she’s up to and the direction of her lecture, you can read these recent publications.

The final talk by Valerie Schneider, titled “Taking advantage of GRCh38” began with an analogy to an unwanted pair of socks one receives for Christmas that ends up being used and finally really liked. “It was time for an update….whether or not it was on your wish list”. We were reminded that centromeres are important specialized chromatin structures important for cell division, but because of repetitive regions, they are not represented in reference assemblies. Previous versions of the human reference assembly had centromeres represented by a 3M gap. The latest assembly, GRCh38 incorporates centromere models generated using whole genome shotgun reads as part of the Venter sequencing project. Since there are two copies of each centromere for each autosome, these centromere models represent an average of two copies. She concluded her presentation urging users to switch now: http://www.ncbi.nlm.nih.gov/genome/tools/remap.

 After a short break from the talks, the closing reception sponsored by Roche began outside. Halfway through, there was a brief yet sudden Florida thundershower that sent the entire AGBT community scurrying indoors for shelter. That was okay though because the conversations just continued indoors. Looking forward to tomorrow morning’s lectures. Several of the ones we’ve highlighted will be up.

 

3 Top Factors Researchers Consider When Selecting an NGS Provider

At Genohub, not only do we seek feedback from researchers, our development methodology is almost entirely based on this feedback. We receive this feedback via website forms as well as routine one-on-one conversations with some of the top researchers using next generation sequencing for their projects. Through this data and interaction, certain trends have begun to emerge which may be useful to an NGS provider seeking additional projects. This list is not based on a controlled experiment, however countless conversations indicate that these factors are extremely important:

  1. Turnaround time – this one is a toss up when compared with price, but we typically find turnaround time to be among the leading factors in a researcher’s decision to select an NGS provider. We have heard quite a few stories of researchers seeing turnaround times over several months for library prep and sequencing.
  2. Price – while this is one of the biggest factors for researchers, it must be qualified with established trust which is the next major factor.
  3. Trust – this one is a biggie for many researchers and often a non-starter if not established. The main reasons for this are that researchers are hesitant to ship their precious samples (ie human brain tissue) to an NGS provider for quite often costly sequencing if they are not confident in their abilities. Researchers have told us some of the things they look for which lend to building their confidence:
    • Referrals & Reviews – researchers seek out colleagues who have done similar projects and look for recommendations. Word of mouth is one of the biggest methods researchers rely on to select an NGS provider.
    • Publications – providers who are listed in publications involving similar projects.
    • What kind of QC will be run on the sample.
    • Overall experience indicators such as time in business and volume of samples regularly handled.
    • Data and sample security.
    • Location – this factor is considerably important if previous trust is not established. Some researchers have absolutely no problem shipping samples across the globe, while others might physically drive their samples to a local provider to ensure sample integrity.

We would love to hear your feedback on this topic whether you are an NGS provider, or a researcher actively using next sequencing. What other decision driving criteria have you found as a provider, or what are some other factors important to you as a researcher?

In a Nutshell: Life Tech Exome Certified Service Provider Program

Life Technologies announced yesterday that they launched the Ion AmpliSeq Exome Certified Service Provider Program.

What the program is in a nutshell:

  • Goals: Offer a network of next gen sequencing providers able to help researchers get a high quality exome sequence at a reduced cost with fast turnaround times and low amounts of input material
  • Exome sequencing inputs: as little as 50ng of customer DNA
  • Library kit used: Ion AmpliSeq Exome kit
  • NGS Instrument used: Ion Proton
  • Exome sequencing outputs: high quality data, which of course can be used with Ion Reporter Software for mutation validation, annotation, and reporting

The Service Provider Program is intended to fill exome sequencing market demand which Life Tech argues has been under-serviced with exome sequencing currently going for $1,000+ , long turnaround times up to 8 weeks, and requiring up to 3mg of DNA. Dr. Candace Johnson, Deputy Director and the Wallace Chair of Translational Research at Roswell Park Cancer Institute states “Exome sequencing will be central to discoveries made in clinical research”. If the Exome CSP delivers as promised, it could have a major impact in accelerating discoveries made in clinical research.

For more information on the Life Tech Provider Program please see the entire press release.

Targeted Resequencing (TPS/WES) Tops Next Gen Sequencing Survey

Oxford Gene Technology (NGS provider currently listed on Genohub) recently presented the results of their next gen sequencing survey which demonstrated targeted resequencing as the top use for next generation sequencing. The results are based on a survey of 596 researchers who responded regarding their current and expected use of NGS services. When compared to the results for whole genome sequencing the popularity of targeted resequencing is possibly attributed mostly to the lower cost of targeted resequencing. This infographic depicts the results:

OGT NGS Survey Results

OGT NGS Survey Results

Other interesting results point to a general data problem with 38% of respondents saying they lack trust in bioinformatics data. Bioinformatics also leads the field when researchers were asked about the biggest barrier to NGS usage (see below).

Barriers to NGS Usage

Barriers to NGS Usage

Undoubtedly this presents an immense opportunity for the bioinformatics sector to increase confidence in data accuracy and interpretation which could have a positive impact on the use of next gen sequencing as a whole.

You can find many more interesting survey results on the excellent infographic titled Oxford Gene Technology – NGS Survey 2013.