Amplicon Sequencing – Short vs. Long Reads

Amplicon sequencing is a type of targeted sequencing that can be used for various purposes. Some common types of amplicon sequencing are 16S and ITS sequencing, which are used in phylogeny and taxonomy studies for the identification of bacteria and fungi, respectively. When there is a need to explore the genome more generally, amplicon sequencing can be used to discover rare somatic mutations, detect and characterize variants, and identify germline single nucleotide polymorphisms (SNPs), insertions/deletions (INDELs), and known fusions [1, 2]. Targeted gene sequencing panel projects are another example of amplicon sequencing, where these panels include genes that are often associated with a certain disease or phenotype-of-interest [3].

In this article, we will go over what amplicon sequencing is, describe the advantages and disadvantages of short- and long-read sequencing, and then explain how Genohub can help support your project.

Amplicon Sequencing

Amplicon sequencing is targeted sequencing that involves specific primer design in order to achieve high on-target rates. It’s called amplicon sequencing, because a crucial step of the process is polymerase chain reaction (PCR), which is a method that amplifies specific DNA sequences based on the primers used. Primers are small DNA oligos that are specifically designed to target only the genes/regions-of-interest. When the amplification part of PCR occurs, only these specific genes are multiplied. The final products of PCR are called amplicons, hence amplicon sequencing [1].

It’s important to think about what type of sequencing (short vs. long read) needs to be done for your specific project, because in order to sequence amplicon samples, the appropriate adapters need to be added to help them adhere to sequencing flow cells [2]. These adapters will differ depending on the flow cell, and in some cases, it may even be more cost-effective to send DNA samples and have one of our NGS partners perform all the library prep themselves.

Short read sequencing (Illumina)

Short-read amplicon sequencing is done with Illumina platforms, often the MiSeq, and has been the standard for 16S, ITS and other microbial profiling projects for many years. Being the standard for so long has advantages, as there are many targeted gene panels created and validated already for use with Illumina sequencing, which can make the workflow much easier on researchers who are new to targeted sequencing. There is also an abundance of literature with Illumina sequencing, so it’s easy for researchers to compare their findings to those of other groups. The biggest advantage is that researchers can sequence hundreds of genes in a single run, which lowers sequencing costs and turnaround time, especially if the researcher is interested in many different genes [1].

A disadvantage with short-read sequencing is that the sequencing resolution may not be as high as long-read sequencing. A comparison of short-read to long-read 16S amplicon sequencing showed that only long-read sequencing could provide strain-level community resolution and insight into novel taxa. Then for the metagenomics portion, a greater number of and more complete bacterial metagenome-assembled genomes (MAGs) were recovered from the data generated from long reads [4].

Long read sequencing (PacBio and Nanopore)

Long-read amplicon sequencing is done with either the PacBio or Oxford Nanopore platforms. They both offer complete, contiguous, uniform, and non-biased coverage across long amplicons up to 10 kb. Advantages of this type of long-read amplicon sequencing is that it’s more efficient, accurate and sensitive than short-read sequencing.

PacBio sequencing can obtain up to 99.999% single-molecule base calling accuracy and has been used to sequence full-length 16S and ITS sequences with very high accuracy as well [3].

Nanopore sequencing can provide accurate variant calling as well as robust coverage of larger targeted regions, which can help enhance the analysis of repetitive regions and improve taxonomic assignment [5]. Nanopore sequencing also tends to allow a bit more flexibility than PacBio sequencing when it comes to scaling amplicon projects at a cost-effective price [6].

The disadvantages to using long-read sequencing for amplicon projects is that it tends to be much more expensive and time-consuming than short-read sequencing, and sometimes long reads may not even be needed if the targeted amplicons themselves are already very short.

How can Genohub help you?

Genohub’s amplicon sequencing partners are experts in every step of the amplicon sequencing process, including extraction, PCR amplification, adapter ligation, library prep and data analysis. Our partners have experience extracting from many different types of environmental and biological samples, but they can work just as well with your DNA or amplicons if you prefer to extract and/or perform PCR in your own lab. From our experience, it’s more cost-effective to send DNA samples rather than amplicons, unless you can attach Illumina adapters yourself.

We know that each research project is unique, so we have partners who are also open to working with your custom primers, custom gene panels and custom bioinformatics needs! Get started today by letting us know about your amplicon sequencing project here: https://genohub.com/ngs/ .

Fungal Sequencing – ITS vs. 18S

Studying the Fungi kingdom is important, because they have so many different ecological roles, including decomposers, symbiotes and parasites. There are also more than 1 million different species of fungi, so researchers need to have high-throughput methods to explore this diversity [1]. One such method is next-generation sequencing.

In this blog, we’ll go over why and how researchers sequence for fungi, what the ITS and 18S genes are, how to choose between them and how Genohub can help with your fungal sequencing project.

Why perform sequencing for fungal community analysis?

Fungal sequencing can be used to discover novel fungal species, quantify known fungi, explore the structure of fungal communities, and determine the roles of fungi in nature. In addition, it’s important to study these communities for human health, as there are some fungi that are resistant to antifungal drugs and others that are involved in plant diseases [2]. Thus, sequencing for fungi is relevant for multiple fields, including environmental conservation, agriculture, and microbiology.

Both ITS and 18S sequencing are well-established methods for studying fungal communities, as focusing on these genes is a simple way to identify fungi within complex microbiomes or environments that would otherwise be difficult to study [3]. For example, this type of specific amplicon sequencing enables the analysis of the fungal community within very mixed environmental samples, such as soil or water.

What are ITS and 18S?

The internal transcribed spacer (ITS) region and the 18S ribosomal RNA gene are used as biomarkers to classify fungi.

Figure 1. Picture of the ITS region as spacers between the ribosomal subunit sequences.

As seen in Figure 1, the ITS region includes ITS1 and ITS2, the spacer genes located between the small-subunit rRNA and large-subunit rRNA. Generally, the ITS1/ITS4 primers are used for amplification of the ITS region, although they can be substituted with the universal primers ITS2, ITS3, and ITS5 [4].

The 18S ribosomal RNA (18S rRNA) gene codes for a component of the small 40S eukaryotic ribosomal subunit and has both conserved and variable regions. The conserved regions can reveal the family relationship among species, whereas the variable regions will show the disparities in their sequences. Regarding the variable regions, 18S rRNA gene has a total of nine, V1-V9. The regions V2, V4 and V9 together are useful for identifying samples at both the family and order levels, while V9 seems to have a higher resolution at the genus level [5].

How to choose between ITS and 18S?

Although both ITS and 18S rRNA have proven useful for assessing fungal diversity in environmental samples, there are enough differences between them that researchers may choose to focus on only one, although sequencing for both is an option as well.

There was relatively low evolutionary pressure for the ITS1 and ITS2 sequences to remain conserved, so the ITS region tends to be hypervariable between fungal species while remaining moderately unchanged among individuals from the same species. It is therefore very well suited as a marker for species identification in the classification of fungus and is often used to study relative abundance of fungi as well [2]. This can be useful if you need to perform a survey for genetic diversity at the species level or even within a species.

On the other hand, there was significant evolutionary pressure for the 18S rRNA gene to remain highly conserved as a component of the small eukaryotic 40S ribosomal subunit, an essential part of all eukaryotic cells. Due to this pressure, 18S is considered a potential biomarker for fungi classification above the species level and is often used in wide phylogenetic analyses and environmental biodiversity screenings [5].

In summary, the ITS region is mainly used for fungal diversity studies, while 18S rRNA is mainly used for high resolution taxonomic studies of fungi.

How can Genohub help?

Genohub’s ITS and 18S sequencing partners are experts in every step of the amplicon sequencing process, including extraction, PCR amplification and library preparation using validated primers based on the literature, and data analysis, including taxonomic assignment, diversity and richness analysis, comparative analysis, and evolutionary analysis. Our partners have experience extracting from many different types of environmental and biological samples, including soil, water, sludge, feces, and plant and animal tissue, but they can work just as well with DNA samples that you extract yourself.

We know that each research project is unique, so we have partners who are also open to working with your custom primers or your custom analysis needs! Get started today by letting us know about your ITS or 18S sequencing project here: https://genohub.com/ngs/ .

Illumina Unveils NextSeq 1000 & NextSeq 2000

Last week at the J.P. Morgan Healthcare Conference, Illumina presented their new sequencers, the NextSeq 1000 and NextSeq 2000. 

Strengths: The NextSeq 1000 and 2000 use patterned flow cells similar to the NovaSeq 6000 System that offer the highest cluster density flow cell of any on-market NGS system. To take full advantage of these higher density flow cells, they feature a novel super resolution optics system that is optimized to increase cluster brightness, reduce channel cross-talk, and improve signal-to-noise ratio. This should increase the output and reduce the cost per run compared to the previous NextSeq model (1). The system uses fluors, which both excite and emit with blue and green wavelengths. 

The major difference between the NextSeq 1000 and 2000 capacities is that only the 2000 will be able to handle the larger P3 flowcell. To compare the P2 and P3 flowcells at the 2×150 read length, the P2 flowcell will yield a similar number of clusters to the NextSeq 550 Hi Ouptut kit for a similar runtime. The P3 flowcell will yield a number of clusters that is between the NovaSeq’s SP and S1 flowcells, although the run time is longer, which is likely due to the new super resolution technology. According to Illumina, the NextSeq 2000 will have a $20 per Gb cost, and the NextSeq 1000 will have a $30 per Gb cost (2). 

Regarding downstream data analysis, these new sequencers also come with the DRAGEN system, which is both on-board and cloud-based. The DRAGEN (Dynamic Read Analysis for GENomics) Bio-IT Platform will enable our providers to automate a variety of genomic analysis, including BCL conversion, mapping, alignment, sorting, duplicate marking, and variant calling. According to Illumina, results can be generated in as little as 2 hours (1).

On the wet bench side of things, the NextSeq 1000 and 2000 reagents will also reduce the volume of the sequencing reactions. This volume reduction should decrease waste and minimize physical storage requirements. For example, one cartridge includes all reagents, fluidics and the waste holder (1), which will simplify library loading and instrument use. This should increase efficiency, reduce the chance of user error, lower the sequencing costs, improve recyclability and minimize waste volume. Ideally, these cost savings will then be passed on to our clients. 

Applications: According to Illumina, the new applications available on the NextSeq 1000 and 2000 are small whole-genome sequencing, whole exome sequencing and single-cell RNA-Seq (1), applications which are useful for research in oncology, genetic disease, reproductive health, agrigenomics, etc. 

As some analysis examples, the new DRAGEN Enrichment Pipeline can be applied to whole exome sequencing and targeted resequencing with alignment, small variant calling, somatic variant calling, SV/CNV calling and custom manifest files. The DRAGEN RNA Pipeline can be applied to whole transcriptome gene expression and gene fusion detection with alignment, fusion detection and gene expression. Other standardized DRAGEN pipelines include DRAGEN-GATK, DNA/RNA targeted panels and single-cell sequencing. A more complete list is available here.

Release Date: The NextSeq 2000 is available for order now, but both the NextSeq 2000 and 1000 will only be available for shipment in Q4 2020. The NextSeq 1000 has a list price of $210,000 and the NextSeq 2000 has a list price of $335,000 (2). We have already added the instrument specifications to our database, so providers can start listing their NextSeq 1000 and 2000 services as soon as they are ready.  

Overall, the new NextSeq 1000 and 2000 seem like solid desktop upgrades and also good testing ground for the new super resolution technology. If it goes well, there may be an upgraded version of the NovaSeq unveiling in the future.

Hybrid Read Sequencing: Applications and Tools

Next-generation sequencing (Illumina) and long read sequencing (PacBio/Oxford Nanopore) platforms each have their own strengths and weaknesses. Recent advances in single molecule real-time (SMRT) and nanopore sequencing technologies have enabled high-quality assemblies from long and inaccurate reads. However, these approaches require high coverage by long reads and remain expensive. On the other hand, the inexpensive short reads technologies produce accurate but fragmented assemblies. Thus, the combination of these techniques led to a new improved approach known as hybrid sequencing.

The hybrid sequencing methods utilize the high-throughput and high-accuracy short read data to correct errors in the long reads. This approach reduces the required amount of costlier long-read sequence data as well as results in more complete assemblies including the repetitive regions. Moreover, PacBio long reads can provide reliable alignments, scaffolds, and rough detections of genomic variants, while short reads refine the alignments, assemblies, and detections to single-nucleotide resolution. The high coverage of short read sequencing data output can also be utilized in downstream quantitative analysis1.

Applications

De novo sequencing

As alternatives to using PacBio sequencing alone for eukaryotic de novo assemblies, error correction strategies using hybrid sequencing have also been developed.

  • Koren et al. developed the PacBio corrected Reads (PBcR) approach for using short reads to correct the errors in long reads2. PBcR has been applied to reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced parrot (Melopsittacus undulates) The long-read correction approach, has achieved >99.9% base-call accuracy, leading to substantially better assemblies than non-hybrid sequencing strategies.
  • Also, Bashir et al. used hybrid sequencing data to assemble the two-chromosome genome of a Haitian cholera outbreak strain at >99.9% accuracy in two nearly finished contigs, completely resolving complex regions with clinically relevant structures3.
  • More recently, Goodwin et al. developed an open-source error correction algorithm Nanocorr, specifically for hybrid error correction of Oxford Nanopore reads. They used this error correction method with complementary MiSeq data to produce a highly contiguous and accurate de novo assembly of the Saccharomyces cerevisiae The contig N50 length was more than ten times greater than an Illumina-only assembly with >99.88% consensus identity when compared to the reference. Additionally, this assembly offered a complete representation of the features of the genome with correctly assembled gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly4.

Transcript structure and Gene isoform identification

Besides genome assembly, hybrid sequencing can also be applied to the error correction of PacBio long reads of transcripts. Moreover, it could improve gene isoform identification and abundance estimation.

  • Along with genome assembly, Koren et al. used the PBcR method to identify and confirm full-length transcripts and gene isoforms. As the length of the single-molecule PacBio reads from RNA-Seq experiments is within the size distribution of most transcripts, many PacBio reads represent near full-length transcripts. These long reads can therefore greatly reduce the need for transcript assembly, which requires complex algorithms for short reads and confidently detect alternatively spliced isoforms. However, the predominance of indel errors makes analysis of the raw reads challenging. Both sets of PacBio reads (before and after error-correction) were aligned to the reference genome to determine the ones that matched the exon structure over the entire length of the annotated transcripts. Before correction, only 41 (0.1%) of the PacBio reads exactly matched the annotated exon structure that rose to 12, 065 (24.1%) after correction.
  • Au et al. developed a computational tool called LSC for the correction of raw PacBio reads by short reads5. Applying this tool to 100,000 human brain cerebellum PacBio subreads and 64 million 75-bp Illumina short reads, they reduced the error rate of the long reads by more than 3-fold. In order to identify and quantify full-length gene isoforms, they also developed an Isoform Detection and Prediction tool (IDP), which makes use of TGS long reads and SGS short reads6. Applying LSC and IDP to PacBio long reads and Illumina short reads of the human embryonic stem cell transcriptome, they detected several thousand RefSeq-annotated gene isoforms at full-length. IDP-fusion has also been released for the identification of fusion genes, fusion sites, and fusion gene isoforms from cancer transcriptomes7.
  • Ning et al. developed an analysis method HySeMaFi to decipher gene splicing and estimate the gene isoforms abundance8. Firstly, the method establishes the mapping relationship between the error-corrected long reads and the longest assembled contig in every corresponding gene. According to the mapping data, the true splicing pattern of the genes is detected, followed by quantification of the isoforms.

Personal transcriptomes

Personal transcriptomes are expected to have applications in understanding individual biology and disease, but short read sequencing has been shown to be insufficiently accurate for the identification and quantification of an individual’s genetic variants and gene isoforms9.

  • Using a hybrid sequencing strategy combining PacBio long reads and Illumina short reads, Tilgner et al. sequenced the lymphoblastoid transcriptomes of three family members in order to produce and quantify an enhanced personalized genome annotation. Around 711,000 CCS reads were used to identify novel isoforms, and ∼100 million Illumina paired-end reads were used to quantify the personalized annotation, which cannot be accomplished by the relatively small number of long reads alone. This method produced reads representing all splice sites of a transcript for most sufficiently expressed genes shorter than 3 kb. It provides a de novo approach for determining single-nucleotide variations, which could be used to improve RNA haplotype inference10.

Epigenetics research

  • Beckmann et al. demonstrated the ability of PacBio sequencing to recover previously-discovered epigenetic motifs with m6A and m4C modifications in both low-coverage and high-contamination scenarios11. They were also able to recover many motifs from three mixed strains ( E. coliG. metallireducens, and C. salexigens), even when the motif sequences of the genomes of interest overlap substantially, suggesting that PacBio sequencing is applicable to metagenomics. Their studies infer that hybrid sequencing would be more cost-effective than using PacBio sequencing alone to detect and accurately define k-mers for low proportion genomes.

Hybrid assembly tools

Several algorithms have been developed that can help in the single molecule de novo assembly of genomes along with hybrid error correction using the short, high-fidelity sequences.

  • Jabba is a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. It uses a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds12. The tool is available here: https://github.com/biointec/jabba.
  • HALC is a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement and constructs a contig graph. This tool was applied on E. coliA. thaliana and Maylandia zebra data sets and has been showed to achieve up to 41 % higher throughput than other existing algorithms while maintaining comparable accuracy13. HALC can be downloaded here:  https://github.com/lanl001/halc.
  • The HYBRIDSPADES algorithm was developed for assembling short and long reads and benchmarked on several bacterial assembly projects. HYBRIDSPADES generated accurate assemblies (even in projects with relatively low coverage by long reads), thus reducing the overall cost of genome sequencing. This method was used to demonstrate the first complete circular chromosome assembly of a genome from single cells of Candidate Phylum TM6using SMRT reads14. The tool is publicly available on this page: http://bioinf.spbau.ru/en/spades.

Due to the constant development of new long read error correction tools, La et al. have recently published an open-source pipeline that evaluates the accuracy of these different algorithms15. LRCstats analyzed the accuracy of four hybrid correction methods for PacBio long reads over three data sets and can be downloaded here: https://github.com/cchauve/lrcstats.

Sović et al. evaluated the different non-hybrid and hybrid assembly methods for de novo assembly using nanopore reads16. They benchmarked five non-hybrid assembly pipelines and two hybrid assemblers that use nanopore sequencing data to scaffold Illumina assemblies. Their results showed that hybrid methods are highly dependent on the quality of NGS data, but much less on the quality and coverage of nanopore data and performed relatively well on lower nanopore coverages. The implementation of this DNA Assembly benchmark is available here: https://github.com/kkrizanovic/NanoMark.

References:

  1. Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics, Proteomics Bioinforma. 13, 278–289 (2015).
  2. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotech 30, 693–700 (2012).
  3. Bashir, A. et al. A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol 30, (2012).
  4. Goodwin, S. et al. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res 25, (2015).
  5. Au, K. F., Underwood, J. G., Lee, L. & Wong, W. H. Improving PacBio Long Read Accuracy by Short Read Alignment. PLoS One 7, e46679 (2012).
  6. Au, K. F. et al. Characterization of the human ESC transcriptome by hybrid sequencing. Proc. Natl. Acad. Sci. 110, E4821–E4830 (2013).
  7. Weirather, J. L. et al. Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing. Nucleic Acids Res. 43, e116 (2015).
  8. Ning, G. et al. Hybrid sequencing and map finding (HySeMaFi): optional strategies for extensively deciphering gene splicing and expression in organisms without reference genome. 7, 43793 (2017).
  9. Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq.(ANALYSIS OPEN)(Report). Nat. Methods 10, 1177 (2013).
  10. Tilgner, H., Grubert, F., Sharon, D. & Snyder, M. P. Defining a personal, allele-specific, and single-molecule long-read transcriptome. Proc. Natl. Acad. Sci. 111, 9869–9874 (2014).
  11. Beckmann, N. D., Karri, S., Fang, G. & Bashir, A. Detecting epigenetic motifs in low coverage and metagenomics settings. BMC Bioinformatics 15, S16 (2014).
  12. Miclotte, G. et al. Jabba: hybrid error correction for long sequencing reads. Algorithms Mol. Biol. 11, 10 (2016).
  13. Bao, E. & Lan, L. HALC: High throughput algorithm for long read error correction. BMC Bioinformatics 18, 204 (2017).
  14. Antipov, D., Korobeynikov, A., McLean, J. S. & Pevzner, P. A. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2016).
  15. La, S., Haghshenas, E. & Chauve, C. LRCstats, a tool for evaluating long reads correction methods. Bioinformatics (2017). doi:10.1093/bioinformatics/btx489
  16. Sović, I., Križanović, K., Skala, K. & Šikić, M. Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads . Bioinformatics 32, 2582–2589 (2016).

 

Choosing the Right NGS Instrument for Your Research

If you’re about to embark on a high throughput sequencing project, choosing the right sequencing instrument to use is an important consideration. Perhaps you’re replicating a published study or repeating an experiment from previous work and the instrument you plan to use is known. If not, the right sequencing instrument should be based on the sequencing goal you are trying to achieve. Instrument features to take into consideration include: number of reads per run, read length, read type (paired or single end), error type, turnaround time and price. Using Genohub’s Shop by Project page, you can enter the number of required reads or coverage you need and instantly compare instruments, filtering by read length and sorting by turnaround time and price. To get a better idea for the differences between NGS instruments, we’ve generated the following comparison: Table 1.   

Certain instruments are ideally suited to specific applications. Illumina instruments are versatile and ideal for a variety of sequencing applications, including: de novo assembly, resequencing, transcriptome, SNP detection and metagenomic studies. The HiSeq and GAIIx instruments are both suited for analyzing large animal or plant genomes. High level multiplexing of samples are possible when analyzing species with a smaller genome size. While the Illumina MiSeq outputs significantly fewer reads (Table 1), its read lengths are significantly longer making it ideal for small genomes, sequencing long variable domains or targeted regions within a genome. The only real limitation to the Illumina platform is its relatively short reads compared to other platforms (Roche 454 and PacBio).

The Ion PGM (Ion Torrent), is ideal for amplicons, small genomes or targeting of small regions within a genome. Its low throughput makes it ideal for smaller sized studies. The Ion Proton however is capable of generating significantly larger outputs (Table 1) making sequencing of transcriptome, exome and medium sized genomes possible.

The PacBio RS/RS II breaks the mold of other short reads high throughput sequencing instruments by focusing on length. The reads, averaging ~4.6 kb are significantly longer than other sequencing platforms making it ideal for sequencing small genomes such as bacteria or viruses. Other advantages include its ability to sequence regions of high G/C content and determine the status of modified bases (methylation, hydroxymethylation) without necessitating the need for chemical conversion during library preparation. The instrument’s low output of reads prevent it from being useful for assembly of medium to large genomes.

The Roche 454 FLX+ is typically used in studies where read length is critical. These include de novo assemblies of microbial genomes, BACs and plastids. It’s long read length has made it a favorite of those examining 16S variable regions and other targeted amplicon sequences. The lower output of the FLX and FLX+ instruments make it less cost-effective for transcriptome or larger genome studies. Roche has announced that it will stop producing the 454 in 2015 and end servicing in mid-2016. 

The SOLiD series of instruments are high throughput, generating a large number of short reads. De novo sequencing, differential transcript expression and resequencing are all viable applicaions of the SOLiD platform. The weakness of the platform is its short reads making assembly very difficult. 

If you’re still not sure about what NGS instrument to choose for your work, feel free to contact us for our complementary sequencing project consultation