Ribo-Seq: Understanding the effect of translational regulation on protein abundance in the cell

Examining changes in gene expression has become one of the most powerful tools in molecular biology today. However, the correlation between mRNA expression and protein levels is often poor. Thus, being able to identify precisely which transcripts are being actively translated, and the rate at which they are being translated, could be a huge boon to the field and give us more insight into which genes are carried through all the way from the mRNA to the protein level–and Ribo-seq (also known as ribosome profiling) technology gives us just that!

Historic nature of ribosome profiling

Ribo-seq is based upon the much older technique of in vitro ribosome footprinting, which stretches back nearly 50 years ago and was used by Joan Steitz and Marilyn Kozak in important studies to map the locations of translation initiation [1, 2]. Due to the technological limitations of the time, these experiments were performed with cell-free in vitro translational systems. These days, we can actually extract actively translating ribosomes from cells and directly observe their locations on the mRNAs they are translating!

Method

So how does this innovative new technique work? The workflow is actually remarkably simple.

  1. We start by lysing the cells by first flash-freezing them, and then harvesting them in the presence of cyclohexamide (see explanation for this under ‘Drawbacks and complications’).
  2. Next, we treat the lysates with RNase 1, which digests the part of the mRNA not protected by the ribosome.
  3. The ribosomes are then separated using a sucrose cushion and centrifugation at very high speeds.
  4. RNA from the ribosome fraction obtained above are then purified with a miRNeasy kit and then gel purified to obtain the 26 – 34 nt region. These are the ribosome footprints.
  5. From there, the RNA is dephosphorylated and the linker DNA is added.
  6. The hybrid molecule is then subjected to reverse transcription into cDNA.
  7. The cDNA is then circularized, PCR amplified, and then used for deep sequencing.

Ribo-seq vs. RNA-seq

Ribosome profiling as a next-generation sequencing technique was developed quite recently by Nicholas Ingolia and Jonathan Weissman [3, 4]. One of their most interesting findings was that there is a roughly 100-fold range of translation efficiency across the yeast transcriptome, meaning that just because an mRNA is very abundant, that does not mean that it is highly translated. They concluded that translation efficiency, which cannot be measured by RNA-seq experiments, is a significant factor in whether or not a gene makes it all the way from an mRNA to a protein product.

Additionally, they looked at the correlation between the abundance of proteins (measured by mass spectrometry) and either the data obtained from Ribo-seq or RNA-seq. They found that Ribo-seq measurements had a much higher correlation with protein abundance than RNA-seq ( = 0.60 vs. = 0.17), meaning that Ribo-seq is actually a better measurement of gene expression analysis (depending on the type of experiment you’re interested in performing).

Of course, there are still significant advantages to RNA-seq over Ribo-seq–Ribo-seq will not be able to capture the expression of non-coding RNAs, for instance. Additionally, RNA-seq is considerably cheaper and easier to perform as of this moment. However, I believe that we are likely to see a trend towards ribosome profiling as this technique becomes more mature.

What else can we learn from ribosome profiling?

Ribosome profiling has already taught us many new things, including:

  • discovering that RNAs that had previously been thought to be non-coding RNAs due to their short length are actually translated, and indeed code for short peptide sequences, the exact functions of which remain unknown. [5]
  • detection of previously unknown translated short upstream ORFs (uORFs), which often possess a non-AUG start codon. These uORFs are likely responsible for regulating the protein-coding ORFs (which is true of the GCN4 gene)[6], though it remains to be seen if that is true for all uORFs or if they have other currently unknown functionalities.
  • determination of the approximate translation elongation rate (330 codons per minute)
  • examples of ribosome pausing or stalling at consecutive proline codons in eukaryotes and yeast [7, 6]

But who knows what else we will learn in the future? This technique can teach us a lot about how gene expression can be regulatated at the translational level. Additionally, we can learn a lot about how translation affects various diseases states, most notably cancer, since cellular stress will very likely affect both translation rate and regulation.

Drawbacks and complications 

While this technique is extremely powerful, there are a few drawbacks. The most prominent amongst them is that any attempt to lyse and harvest the cells for this procedure causes a change in the ribosome profile, making this technique particularly vulnerable to artefacts. Researchers often attempt to halt translation before harvesting with a 5 minute incubation with cyclohexamide, a drug that blocks translation elongation, to prevent ribosome run-off; however, this can result in an enormous increase in ribosome signal at the initiation sites, as ribosomes will still initiate translation and begin to pile up.

The best method of combatting these artefacts is to flash-freeze the cells prior to harvesting, lyse over them dry ice, and then continue the protocol in the presence of cyclohexamide. This technique should result in the best balance between prevention of run-off, and prevention of excessive ribosome accumulation at the initiation site [8].

Conclusions

Our understanding of the mechanisms involved in regulation of translation has been sorely limited by our lack of ability to study it directly. Ribosome profiling now provides a method for us to do just that. We’ve already made huge strides in our understanding of many events in the translation process, including the discovery of hundreds of non-canonical translation initiation sites as well the realization that not all ‘non-coding’ RNAs are non-coding after all! I expect that we’ll continue to see this technique put to new and innovate questions about translation and its role in the cell as the technology matures.

If you’re interested in Ribo-Seq services enter your basic project parameters on Genohub and send us a request. We’ll be happy to help.

 

 

6 QC methods post library construction for NGS

After nucleic acid extraction and sample QC, the next step in the NGS workflow is library preparation. NGS libraries are prepared to meet the platform requirements with respect to size, purity, concentration and efficient ligation of adaptors. Assessing the quality of a sequencing library before committing it to a full-scale sequencing run ensures maximum sequencing efficiency, leading to accurate sequencing data with more even coverage.

In this blog post, we list the various ways to QC libraries in order of most stringent to least stringent.

1. qPCR

qPCR is a method of quantifying DNA based on PCR. qPCR tracks target concentration as a function of PCR cycle number to derive a quantitative estimate of the initial template concentration in a sample. As with conventional PCR, it uses a polymerase, dNTPs, and two primers designed to match sequences within a template. For the QC protocol, the primers match sequences within the adapters flanking a sequencing library.

Therefore, qPCR is an ideal method for measuring libraries in advance of generating clusters, because it will only measure templates that have both adaptor sequences on either end which will subsequently form clusters on a flow cell. In addition, qPCR is a very sensitive method of measuring DNA and therefore dilute libraries with concentrations below the threshold of detection of conventional spectrophotometric methods can be quantified by qPCR.

KAPA Biosystems SYBR FAST ‘Library Quantification Kit for Illumina Sequencing Platforms is commonly used with qPCR. This kit measures absolute numbers of molecules containing the Illumina adapter sequences, thus providing a highly accurate measurement of amplifiable molecules available for cluster generation.

2. MiSeq

The MiSeq system uses the same library prep methods and proven sequencing by synthesis chemistry as the HiSeq system. Thus, it is ideal for analyzing prepared libraries prior to performing high-throughput sequencing. Performing library quality control (QC) using the MiSeq system before committing it to a fullscale HiSeq run can save time and money while leading to better sequencing results.

Data generated by the MiSeq system is comparable to other Illumina next-generation sequencing platforms, ensuring a smooth transition from one instrument to another. Based on the individual experimental requirements, metrics obtained from performing simple QC can be used to streamline and improve your sequencing projects.

Using a single library prep method and taking only a single day, detailed QC parameters, including cluster density, library complexity, percent duplication, GC bias, and index representation can be generated on the MiSeq system. The MiSeq system has the unique ability to do paired-end (PE) sequencing for accurately assessing insert size. Library cluster density can also be determined and used to predict HiSeq cluster density, maximizing yield and reducing rework.

3. Fluorometric method

Quantifying DNA libraries using a fluorometric method that involves intercalating dyes specifically binding to DNA or RNA is highly useful. This method is very precise as DNA dyes do not bind to RNA and vice versa.

The Invitrogen™ Qubit™ Fluorometer a popular fluorometer that accurately measures DNA, RNA, and protein using the highly sensitive Invitrogen™ Qubit™ quantitation assays. The concentration of the target molecule in the sample is reported by a fluorescent dye that emits a signal only when bound to the target, which minimizes the effects of contaminants—including degraded DNA or RNA—on the result.

4. Automated electrophoresis

Several automated electrophoretic instruments are useful in estimating the size of the NGS libraries. The Agilent 2100 Bioanalyzer system provides sizing, quantitation, and purity assessments for DNA, RNA, and protein samples. The Agilent 2200 TapeStation system is a tape-based platform for reliable electrophoresis platform for accurate size selection of generated libraries. PerkinElmer LabChip GX can be used for DNA and RNA quantitation and sizing using automated capillary electrophoresis separation. The Qiagen QIAxcel Advanced system fully automates sensitive, high-resolution capillary electrophoresis of up to 96 samples per run that can be used for library QC as well. All these instruments are accompanied by convenient analysis and data documentation software that make the library QC step faster and easier.

5. UV-Visible Spectroscopy

A UV-Vis spectrophotometer can be used to analyze spectral absorbance to measure the nucleic acid libraries and can differentiate between DNA, RNA and other absorbing contaminants. However, this method is not super accurate and should be paired with one of the other QC methods to ensure high-quality libraries. There are several US-Vis spectrophotometers currently available, such as currently available such as Thermo Scientific™ NanoDrop™ UV-Vis spectrophotometer, Qiagen QIAExpert System, Shimadzu Biospec-nano etc.

6. Bead normalization

This is the preferred QC method if < 12 libraries are to be QCed or if library yields are less than 15 nM, highly variable and unpredictable or Users are working with uncharacterized genomes and are inexperienced with the Nextera XT DNA Library Prep Kit protocol.

During bead-based normalization, DNA is bound to normalization beads and eluted off the beads at approximately the same concentration for each sample. Bead-based normalization enables scientists to bypass time-consuming library quantitation measurements and manual pipetting steps before loading libraries onto the sequencer. Bead-based normalization can provide significant cost and time savings for researchers processing many samples, or for researchers without access to any of the QC  instruments listed in the above methods.

 

 

 

 

Top 3 Sample QC steps prior to library preparation for NGS

Before beginning library preparation for next-generation sequencing, it is highly recommended to perform sample quality control (QC) to check the nucleic acid quantity, purity and integrity. The starting material for NGS library construction might be any type of nucleic acid that is or can be converted into double-stranded DNA (dsDNA). These materials, often gDNA, RNA, PCR amplicons, and ChIP samples, must have high purity and integrity and sufficient concentration for the sequencing reaction.

1. Nucleic Acid Quantification

Measuring the concentration of nucleic acid samples is a key QC step to determine the fit and amount of nucleic acid available for further processing.

  • Absorbance Method:

A UV-Vis spectrophotometer can be used to analyze spectral absorbance to measure the whole nucleic acid profile and can differentiate between DNA, RNA and other absorbing contaminants. Different molecules such as nucleic acids, proteins, and chemical contaminants absorb light in their own pattern. By measuring the amount of light absorbed at a defined wavelength, the concentration of the molecules of interest can be calculated. Most laboratories are equipped with a US-Vis spectrophotometer to quantify nucleic acids or proteins for their day-to-day experiments. Customers can choose from several spectrophotometers currently available such as Thermo Scientific™ NanoDrop™ UV-Vis spectrophotometer, Qiagen QIAExpert System, Shimadzu Biospec-nano etc.

  • Fluorescence Method:

Fluorescence methods are more sensitive than absorbance, particularly for low-concentration samples, and the use of DNA-binding dyes allows more specific measurement of DNA than spectrophotometric methods. Fluorescence measurements are set at excitation and emission values that vary depending on the dye chosen (Hoechst bis-benzimidazole dyes, PicoGreen® or QuantiFluor™ dsDNA dyes). The concentration of unknown samples is calculated based on comparison to a standard curve generated from samples of known DNA concentration.

The availability of single-tube and microplate fluorometers gives flexibility for reading samples in PCR tubes, cuvettes or multiwell plates and makes fluorescence measurement a convenient modern alternative to the more traditional absorbance methods. Thermo Scientific (Invitrogen) Qubit™ Fluorometer is one of the most commonly used fluorometers that accurately measure low concentration DNA, RNA, and protein.

sho-qubit-instrument

2. Nucleic Acid Purity

Nucleic acid samples can become contaminated by other molecules with which they were co-extracted and eluted during the purification process or by chemicals from upstream applications. Purification methods involving phenol extraction, ethanol precipitation or salting-out may not completely remove all contaminants or chemicals from the final eluates. The resulting impurities can significantly decrease the sensitivity and efficiency of your downstream enzymatic reactions.

  • UV spectrophotometry measurements enable calculation of nucleic acid concentrations based on the sample’s absorbance at 260 nm. The absorbance at 280 nm and 230 nm can be used to assess the level of contaminating proteins or chemicals, respectively. The absorbance ratio of nucleic acids to contaminants provides an estimation of the sample purity, and this number can be used as acceptance criteria for inclusion or exclusion of samples in downstream applications.
  • Contaminants such as RNA, proteins or chemicals can interfere with library preparation and the sequencing reactions. When sequencing DNA, an RNA removal step is highly recommended, and when sequencing RNA, a gDNA removal step is recommended. Sample purity can be assessed following nucleic acid extraction and throughout the library preparation workflow using UV/Vis spectrophotometry. For DNA and RNA samples the relative abundance of proteins in the sample can be assessed by determining the A260/A280ratio, which should be between 1.8–2.0. Contamination by organic compounds can be assessed using the A260/A230 ratio, which should be higher than 2.0 for DNA and higher than 1.5 for RNA. Next-generation spectrophotometry with the Qiagen QIAxpert system enables spectral content profiling, which can discriminate DNA and RNA from sample contaminants without using a dye.

19647-15451

  • qPCR:

Quantitative PCR, or real-time PCR, (qPCR) uses the linearity of DNA amplification to determine absolute or relative quantities of a known sequence in a sample. By using a fluorescent reporter in the reaction, it is possible to measure DNA generation in the qPCR assay. In qPCR, DNA amplification is monitored at each cycle of PCR. When the DNA is in the log-linear phase of amplification, the amount of fluorescence increases above the background. The point at which the fluorescence becomes measurable is called the threshold cycle (CT) or crossing point. By using multiple dilutions of a known amount of standard DNA, a standard curve can be generated of log concentration against CT. The amount of DNA or cDNA in an unknown sample can then be calculated from its CT value.

qPCR-based assays can accurately qualify and quantify amplifiable DNA in challenging samples. For example, DNA derived from Formalin-fixed paraffin-embedded tissue samples, is oftentimes highly fragmented, cross-linked with protein and has a high proportion of single-stranded DNA making it challenging to perform library preparation steps. For FFPE samples, the Agilent NGS FFPE QC kit enables functional DNA quality assessment of input DNA.

3. Nucleic Acid Integrity (Size distribution)

Along with quantity and purity, size distribution is a critical QC parameter that provides valuable insight into sample quality. Analyzing nucleic acid size informs you about your sample’s integrity and indicates whether the samples are fragmented or contaminated by other DNA or RNA products. Various electrophoretic methods can be used to assess the size distribution of your sample.

  • Agarose Gel Electrophoresis

In this method, a horizontal gel electrophoresis tank with an external power supply, analytical-grade agarose, an appropriate running buffer (e.g., 1X TAE) and an intercalating DNA dye along with appropriately sized DNA standards are required. A sample of the isolated DNA is loaded into a well of the agarose gel and then exposed to an electric field. The negatively charged DNA backbone migrates toward the anode. Since small DNA fragments migrate faster, the DNA is separated by size. The percentage of agarose in the gel will determine what size range of DNA will be resolved with the greatest clarity. Any RNA, nucleotides, and protein in the sample migrate at different rates compared to the DNA so the band(s) containing the DNA will be distinct.

gel_electrophoresis_dna_bands_yourgenome

Analyzing PCR amplicons or RFLP fragments confirms the presence of the expected size fragments and alerts you to the presence of any non-specific amplicons. Electrophoresis also helps you assess the ligation efficiency yield for plasmid cloning procedures as well as the efficiency of removal of primer–dimers or other unspecific fragments during sample cleanup.

For complex samples such as genomic DNA (gDNA) or total RNA, the shape and position of the smear from electrophoresis analysis directly correlates with the integrity of the samples. Nucleic acid species of larger size tend to be degraded first and provide degradation products of lower molecular weight. Samples of poor integrity generally have a higher abundance of shorter fragments, while high-quality samples contain intact nucleic acid molecules with higher molecular size.

Eukaryotic RNA samples have unique electrophoretic signatures, which consist of a smear with major fragments corresponding to 28S, 18S and 5S ribosomal RNA (rRNA). These electrophoretic patterns correlate with the integrity of the RNA samples. The RNA integrity can either be assessed manually or with automation that employs a dedicated algorithm such as the RNA Integrity Score (RIS) that gives an objective integrity grade to RNA samples ranging from 1–10. RNA samples of highest quality usually have a score of 8 or above.

  • Capillary Electrophoresis

In this method, charged DNA or RNA molecules are injected into a capillary and are resolved during migration through a gel-like matrix. Nucleic acids are detected as they pass by a detector that captures signals of specific absorbance. Results are presented in the form of an electropherogram, which is a plot of signal intensity against migration time. The fragment sizes are precisely determined using a size marker consisting of fragments of known size. This method provides highly resolving and sensitive nucleic acid analysis that is faster and safer.

 

 

Hybrid Read Sequencing: Applications and Tools

Next-generation sequencing (Illumina) and long read sequencing (PacBio/Oxford Nanopore) platforms each have their own strengths and weaknesses. Recent advances in single molecule real-time (SMRT) and nanopore sequencing technologies have enabled high-quality assemblies from long and inaccurate reads. However, these approaches require high coverage by long reads and remain expensive. On the other hand, the inexpensive short reads technologies produce accurate but fragmented assemblies. Thus, the combination of these techniques led to a new improved approach known as hybrid sequencing.

The hybrid sequencing methods utilize the high-throughput and high-accuracy short read data to correct errors in the long reads. This approach reduces the required amount of costlier long-read sequence data as well as results in more complete assemblies including the repetitive regions. Moreover, PacBio long reads can provide reliable alignments, scaffolds, and rough detections of genomic variants, while short reads refine the alignments, assemblies, and detections to single-nucleotide resolution. The high coverage of short read sequencing data output can also be utilized in downstream quantitative analysis1.

Applications

De novo sequencing

As alternatives to using PacBio sequencing alone for eukaryotic de novo assemblies, error correction strategies using hybrid sequencing have also been developed.

  • Koren et al. developed the PacBio corrected Reads (PBcR) approach for using short reads to correct the errors in long reads2. PBcR has been applied to reads generated by a PacBio RS instrument from phage, prokaryotic and eukaryotic whole genomes, including the previously unsequenced parrot (Melopsittacus undulates) The long-read correction approach, has achieved >99.9% base-call accuracy, leading to substantially better assemblies than non-hybrid sequencing strategies.
  • Also, Bashir et al. used hybrid sequencing data to assemble the two-chromosome genome of a Haitian cholera outbreak strain at >99.9% accuracy in two nearly finished contigs, completely resolving complex regions with clinically relevant structures3.
  • More recently, Goodwin et al. developed an open-source error correction algorithm Nanocorr, specifically for hybrid error correction of Oxford Nanopore reads. They used this error correction method with complementary MiSeq data to produce a highly contiguous and accurate de novo assembly of the Saccharomyces cerevisiae The contig N50 length was more than ten times greater than an Illumina-only assembly with >99.88% consensus identity when compared to the reference. Additionally, this assembly offered a complete representation of the features of the genome with correctly assembled gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly4.

Transcript structure and Gene isoform identification

Besides genome assembly, hybrid sequencing can also be applied to the error correction of PacBio long reads of transcripts. Moreover, it could improve gene isoform identification and abundance estimation.

  • Along with genome assembly, Koren et al. used the PBcR method to identify and confirm full-length transcripts and gene isoforms. As the length of the single-molecule PacBio reads from RNA-Seq experiments is within the size distribution of most transcripts, many PacBio reads represent near full-length transcripts. These long reads can therefore greatly reduce the need for transcript assembly, which requires complex algorithms for short reads and confidently detect alternatively spliced isoforms. However, the predominance of indel errors makes analysis of the raw reads challenging. Both sets of PacBio reads (before and after error-correction) were aligned to the reference genome to determine the ones that matched the exon structure over the entire length of the annotated transcripts. Before correction, only 41 (0.1%) of the PacBio reads exactly matched the annotated exon structure that rose to 12, 065 (24.1%) after correction.
  • Au et al. developed a computational tool called LSC for the correction of raw PacBio reads by short reads5. Applying this tool to 100,000 human brain cerebellum PacBio subreads and 64 million 75-bp Illumina short reads, they reduced the error rate of the long reads by more than 3-fold. In order to identify and quantify full-length gene isoforms, they also developed an Isoform Detection and Prediction tool (IDP), which makes use of TGS long reads and SGS short reads6. Applying LSC and IDP to PacBio long reads and Illumina short reads of the human embryonic stem cell transcriptome, they detected several thousand RefSeq-annotated gene isoforms at full-length. IDP-fusion has also been released for the identification of fusion genes, fusion sites, and fusion gene isoforms from cancer transcriptomes7.
  • Ning et al. developed an analysis method HySeMaFi to decipher gene splicing and estimate the gene isoforms abundance8. Firstly, the method establishes the mapping relationship between the error-corrected long reads and the longest assembled contig in every corresponding gene. According to the mapping data, the true splicing pattern of the genes is detected, followed by quantification of the isoforms.

Personal transcriptomes

Personal transcriptomes are expected to have applications in understanding individual biology and disease, but short read sequencing has been shown to be insufficiently accurate for the identification and quantification of an individual’s genetic variants and gene isoforms9.

  • Using a hybrid sequencing strategy combining PacBio long reads and Illumina short reads, Tilgner et al. sequenced the lymphoblastoid transcriptomes of three family members in order to produce and quantify an enhanced personalized genome annotation. Around 711,000 CCS reads were used to identify novel isoforms, and ∼100 million Illumina paired-end reads were used to quantify the personalized annotation, which cannot be accomplished by the relatively small number of long reads alone. This method produced reads representing all splice sites of a transcript for most sufficiently expressed genes shorter than 3 kb. It provides a de novo approach for determining single-nucleotide variations, which could be used to improve RNA haplotype inference10.

Epigenetics research

  • Beckmann et al. demonstrated the ability of PacBio sequencing to recover previously-discovered epigenetic motifs with m6A and m4C modifications in both low-coverage and high-contamination scenarios11. They were also able to recover many motifs from three mixed strains ( E. coliG. metallireducens, and C. salexigens), even when the motif sequences of the genomes of interest overlap substantially, suggesting that PacBio sequencing is applicable to metagenomics. Their studies infer that hybrid sequencing would be more cost-effective than using PacBio sequencing alone to detect and accurately define k-mers for low proportion genomes.

Hybrid assembly tools

Several algorithms have been developed that can help in the single molecule de novo assembly of genomes along with hybrid error correction using the short, high-fidelity sequences.

  • Jabba is a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. It uses a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds12. The tool is available here: https://github.com/biointec/jabba.
  • HALC is a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement and constructs a contig graph. This tool was applied on E. coliA. thaliana and Maylandia zebra data sets and has been showed to achieve up to 41 % higher throughput than other existing algorithms while maintaining comparable accuracy13. HALC can be downloaded here:  https://github.com/lanl001/halc.
  • The HYBRIDSPADES algorithm was developed for assembling short and long reads and benchmarked on several bacterial assembly projects. HYBRIDSPADES generated accurate assemblies (even in projects with relatively low coverage by long reads), thus reducing the overall cost of genome sequencing. This method was used to demonstrate the first complete circular chromosome assembly of a genome from single cells of Candidate Phylum TM6using SMRT reads14. The tool is publicly available on this page: http://bioinf.spbau.ru/en/spades.

Due to the constant development of new long read error correction tools, La et al. have recently published an open-source pipeline that evaluates the accuracy of these different algorithms15. LRCstats analyzed the accuracy of four hybrid correction methods for PacBio long reads over three data sets and can be downloaded here: https://github.com/cchauve/lrcstats.

Sović et al. evaluated the different non-hybrid and hybrid assembly methods for de novo assembly using nanopore reads16. They benchmarked five non-hybrid assembly pipelines and two hybrid assemblers that use nanopore sequencing data to scaffold Illumina assemblies. Their results showed that hybrid methods are highly dependent on the quality of NGS data, but much less on the quality and coverage of nanopore data and performed relatively well on lower nanopore coverages. The implementation of this DNA Assembly benchmark is available here: https://github.com/kkrizanovic/NanoMark.

References:

  1. Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics, Proteomics Bioinforma. 13, 278–289 (2015).
  2. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotech 30, 693–700 (2012).
  3. Bashir, A. et al. A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol 30, (2012).
  4. Goodwin, S. et al. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res 25, (2015).
  5. Au, K. F., Underwood, J. G., Lee, L. & Wong, W. H. Improving PacBio Long Read Accuracy by Short Read Alignment. PLoS One 7, e46679 (2012).
  6. Au, K. F. et al. Characterization of the human ESC transcriptome by hybrid sequencing. Proc. Natl. Acad. Sci. 110, E4821–E4830 (2013).
  7. Weirather, J. L. et al. Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing. Nucleic Acids Res. 43, e116 (2015).
  8. Ning, G. et al. Hybrid sequencing and map finding (HySeMaFi): optional strategies for extensively deciphering gene splicing and expression in organisms without reference genome. 7, 43793 (2017).
  9. Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq.(ANALYSIS OPEN)(Report). Nat. Methods 10, 1177 (2013).
  10. Tilgner, H., Grubert, F., Sharon, D. & Snyder, M. P. Defining a personal, allele-specific, and single-molecule long-read transcriptome. Proc. Natl. Acad. Sci. 111, 9869–9874 (2014).
  11. Beckmann, N. D., Karri, S., Fang, G. & Bashir, A. Detecting epigenetic motifs in low coverage and metagenomics settings. BMC Bioinformatics 15, S16 (2014).
  12. Miclotte, G. et al. Jabba: hybrid error correction for long sequencing reads. Algorithms Mol. Biol. 11, 10 (2016).
  13. Bao, E. & Lan, L. HALC: High throughput algorithm for long read error correction. BMC Bioinformatics 18, 204 (2017).
  14. Antipov, D., Korobeynikov, A., McLean, J. S. & Pevzner, P. A. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2016).
  15. La, S., Haghshenas, E. & Chauve, C. LRCstats, a tool for evaluating long reads correction methods. Bioinformatics (2017). doi:10.1093/bioinformatics/btx489
  16. Sović, I., Križanović, K., Skala, K. & Šikić, M. Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads . Bioinformatics 32, 2582–2589 (2016).

 

International biological material shipment information for various countries

Many scientific researchers prefer to outsource their next generation sequencing projects to commercial service providers to get access to the latest instruments and scientific expertise.

However, there are some countries in the world that do not allow the export of biological samples (tissue samples, DNA, RNA etc.) or require several formal agreements and multi-level clearance.

In this post, we’ll highlight some general information about shipping samples out of several major countries, primarily to the US. Some of this is based on our experience working with many international researchers who use Genohub to outsource their sequencing.

China

China, for example, does not allow the import or export of biological samples, as confirmed by multiple courier service agents1. Major Chinese service providers require biological samples to be shipped to their Hong Kong address to avoid delay or loss of samples2,3.

In a rare situation, a Chinese group of researchers was able to ship DNA samples to the US using FedEx. They have also detailed their experience and have some advice regarding sample shipment that can be potentially useful to other groups willing to do the same4.

Brazil

To export biological material from Brazil, several documents such as Material Transfer Agreement and Institutional invoice of specimen exported, are required for customs clearance. A detailed cover letter in both Portuguese and English that can help Customs officials in Brazil (IBAMA) and the USA (USFWS) properly assess the authorization to export and import specimens is also required5. It could take several weeks to obtain these documents so researchers need to plan their work in advance.

India

Until 2016, The Indian Council of Medical Research made decisions on shipment of biological samples on a case-by-case basis6. However, these regulations have since been lifted since August 2016 and researchers have to follow several guidelines for biological materials to qualify for transport to foreign countries for research purposes7.

According to a FedEx India employee, a non-infectious certificate from an authentic laboratory and a detailed description of the included biological samples is sufficient for customs clearance from India. Any pathogenic material is not allowed to be shipped internationally.

Europe

We haven’t come across any issues shipping samples from European countries and generally, a properly declared biological shipment can be exported without any hassles.

The current Universal Postal Union regulations for shipping biological material have been comprehensively summarized in an official document. This document also lists the countries that allow or ban the import/export of biological substances8.

Please consult our shipping guide for more details on how to prepare your shipment to ship samples to USA – https://genohub.com/dna-rna-shipping-for-ngs/#USA.

If you know of any countries that require a lot of formal paperwork for export of biological substances for research or sequencing purposes, feel free to comment below. I’ll update the blog with this information.

References:

(1)     China Country Snapshot https://smallbusiness.fedex.com/international/country-snapshots/china.html.

(2)     Sample Preparation; Shipping – Novogene https://en.novogene.com/support/sample-preparation/.

(3)     Sample submission guidelines – BGI http://www.bgisample.com/yangbenjianyi/BGI-TS-03-12-01-001 Suggestions for Sample Delivery(NGS) B0.pdf.

(4)     Community/ZJU-China Letter about Shipping DNA – 2015.igem.org http://2015.igem.org/Community/ZJU-China_Letter_about_Shipping_DNA.

(5)     Shipping and Customs http://symbiont.ansp.org/ixingu/shipping/index.html.

(6)    Centre removes ICMR approval for import/export of human biological samples http://www.dnaindia.com/india/report-centre-removes-icmr-approval-for-importexport-of-human-biological-samples-2245910.

(7)     Indian Council of Medical Research http://icmr.nic.in/ihd/ihd.htm.

(8)     WFCC Regulations http://www.wfcc.info/pdf/wfcc_regulations.pdf

PacBio vs. Oxford Nanopore sequencing

Long-read sequencing developed by Pacific Biosciences and Oxford Nanopore overcome many of the limitations researchers face with short reads. Long reads improve de novo assembly, transcriptome analysis (gene isoform identification) and play an important role in the field of metagenomics. Longer reads are also useful when assembling genomes that include large stretches of repetitive regions.

Currently, there are two long read sequencing platforms. To help a researcher choose between which platform has greater utility for their application, we compare overall instrument specifications offered by PacBio and Oxford Nanopore, and published applications in the next-generation sequencing space.

Capturea Oxford Nanopore charges an access fee that gives users one MinION/PromethIon instrument, a starter pack of consumables, certain data services, and community-based support

* Insufficient data

Although both PacBio and Oxford Nanopore generate longer reads compared to short read Illumina or Ion sequencing, the higher error rate of both the PacBio and Oxford Nanopore sequencers remain an issue needs addressing. Whereas PacBio reads a molecule multiple times to generate high-quality consensus data, Oxford Nanopore can only sequence a molecule twice. As a result, PacBio generates data with lower error rates compared to Oxford Nanopore. PacBio has a slightly better overall performance for applications such as the discovery of transcriptome complexity and sensitive identification of isoforms. On the other hand, MinION provides higher throughput as nanopores can sequence multiple molecules simultaneously. Hence, it is best suited for applications that require a larger amount of data9

As long reads can provide large scaffolds, de novo assembly is one of the main applications of PacBio sequencing5. Though the error rate of PacBio data is higher than that of short read Illumina or Ion sequencing, increased coverage or hybrid sequencing can greatly improve the accuracy of genome assembly. PacBio sequencing has been successfully used to finish the 100-contig draft genome of Clostridium autoethanogenum DSM 10061, a Class III, the most complex genome classification in terms of repeat content and repeat type. It has a 31.1% GC content and contains repeats, prophage, and nine copies of rRNA gene operons. Using a single PacBio library and sequencing it with two SMRT cells, an entire genome can be assembled de novo with a single contig. When short read Illumina or Ion sequencing was used alone with the same genome, >22 contigs were needed, and each of the assemblies contained at least four collapsed repeat regions, PacBio assemblies had none10.

PacBio sequencing has also been used to assemble the chloroplast genome of Potentilla micrantha11, Saccharomyces cerevisiae, Aradopsis thaliana and Drosophila melanogaster using fewer contigs and CPU time for assembly compared to assemblies using Illumina sequencers12.

PacBio sequencing of PCR products can be used to improve the quality of current draft genomes by closing gaps and sequencing through hairpin structures and areas of high GC content more efficiently than Sanger sequencing13.

Pacific Biosciences has developed a protocol, Iso-Seq, for transcript sequencing. This includes library construction, size selection, sequencing data collection, and data processing. Iso-Seq allows direct sequencing of transcripts up to 10 kb without the use of a reference genome. Iso-Seq has been used to characterize alternative splicing events involved in the formation of blood cellular components14. This is essential for interpreting the effects of mutations leading to inherited disorders and blood cancers, and can be applied to design strategies to advance transplantation and regenerative medicine.

Another major application of PacBio sequencing is in epigenetics research. Recent studies demonstrate that investigation of intercellular heterogeneity in previously undetectable genome DNA modifications (such as m6A and m4C) is facilitated by the direct detection of modifications in single molecules by PacBio sequencing15.

Compared to PacBio, the Oxford Nanopore MinION is small (size of a USB thumb drive), affordable, utilizes a simple library prep and is field portable16. This is useful in situations such as a virus outbreak where a mobile diagnostic laboratory can be set up using MinIONS. In remote regions such as parts of Brazil and Africa where there are logistical issues associated with shipping samples for sequencing, MinION can provide immediate and real-time data to scientific investigators. The most notable clinical use of MinION has been the analysis of Ebola samples on-site during the viral outbreak in West Africa17,18.

The low cost of sequencing and portability of the MinION sequencer also make it a useful tool for teaching. It has been used to provide hands-on experience to students, most recently at Columbia University and the University of California Santa Cruz, where every student performed their own MinION sequencing19.

Perhaps the most ambitious MinION application is its potential to detect and identify bacteria and viruses on manned space flights. In a proof-of-concept experiment, Castro-Wallace et al. demonstrated successful sequencing and de novo assembly of a lambda phage genome, an E. coli genome, and a mouse mitochondrial genome. They observed that there was no significant difference in the quality of sequence data generated on the International Space Station and in control experiments that were performed in parallel on Earth22.

Recently, Oxford Nanopore developed a bench-top instrument, PromethION, that provides high-throughput sequencing and is modular in design. It contains 48 flow cells that can be run individually or in parallel. The PromethION flow cells contain 3000 channels each, and produce up to 40 Gb of data.

 

References:

  1. Pacific Biosciences – AllSeq. Available at: http://allseq.com/knowledge-bank/sequencing-platforms/pacific-biosciences/.
  2. Jain, M., Olsen, H. E., Paten, B. & Akeson, M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 239 (2016).
  3. Lu, H., Giordano, F. & Ning, Z. Oxford Nanopore MinION Sequencing and Genome Assembly. Genomics. Proteomics Bioinformatics 14, 265–279 (2016).
  4. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. bioRxiv (2017).
  5. Jain, M. et al. MinION Analysis and Reference Consortium: Phase 2 data release and analysis of R9.0 chemistry [version 1; referees: awaiting peer review]. F1000Research 6, (2017).
  6. Rhoads, A. & Au, K. F. PacBio Sequencing and Its Applications. Genomics, Proteomics Bioinforma. 13, 278–289 (2015).
  7. MinION. Available at: https://nanoporetech.com/products/minion.
  8. PromethION Early Access Programme. Available at: https://nanoporetech.com/community/promethion-early-access-programme.
  9. Oxford Nanopore in 2016. Available at: http://blog.booleanbiotech.com/nanopore_2016.html.
  10. Weirather, J. L. et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Research 6, 100 (2017).
  11. Brown, S. D. et al. Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. Biotechnol. Biofuels 7, 40 (2014).
  12. Ferrarini, M. et al. An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC Genomics 14, 670 (2013).
  13. Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotech 33, 623–630 (2015).
  14. Zhang, X. et al. Improving genome assemblies by sequencing PCR products with PacBio. Biotechniques 53, 61–62 (2012).
  15. Chen, L. et al. Transcriptional diversity during lineage commitment of human blood progenitors. Science (80-. ). 345, (2014).
  16. Feng, Z., Li, J., Zhang, J.-R. & Zhang, X. qDNAmod: a statistical model-based tool to reveal intercellular heterogeneity of DNA modification from SMRT sequencing data. Nucleic Acids Res. 42, 13488–13499 (2014).
  17. Jain, M., Olsen, H. E., Paten, B. & Akeson, M. Erratum to: The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 17, 256 (2016).
  18. Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).
  19. Hoenen, T. et al. Nanopore sequencing as a rapidly deployable Ebola outbreak tool. Emerg. Infect. Dis. 22, 331–334 (2016).
  20. Citizen Sequencers: Taking Oxford Nanopore’s MinION to the Classroom and Beyond – Bio-IT World. Available at: http://www.bio-itworld.com/2015/12/9/citizen-sequencers-taking-oxford-nanopores-minion-classroom-beyond.html.
  21. Castro-Wallace, S. L. et al. Nanopore DNA Sequencing and Genome Assembly on the International Space Station. bioRxiv (2016).

Sequencing trends in early 2017

Every month, ~5,000 unique queries for sequencing are submitted using Genohub’s NGS project matching engine: https://genohub.com/ngs/. Briefly, a user chooses the NGS application they are interested in (e.g. exome, RNA-Seq), the number of reads or coverage they’d like to achieve and the number of samples they plan on sequencing. Genohub’s matching engine, takes this input calculates the sequencing output required to meet the desired coverage and recommends services, filterable by sequencing instrument, read length, and library preparation kit. Results can be sorted by price, turnaround time and selected for immediate ordering.

Every query that’s submitted is recorded giving us a unique perspective into what types of NGS services researchers are actually interested in.

DNA-Seq

First, it’s important to note that DNA-seq is our default option in the matching engine: https://genohub.com/ngs/. Due to this bias, you can’t really compare it to other services being ordered so it’s a good idea to just throw away this data point. Of DNA-seq services that are actually ordered, this breaks down into: whole human genome sequencing, re-sequencing, and metagenomics sequencing. The most frequently used instruments for this service are currently the HiSeq X, HiSeq 3000/4000 and NextSeq. With PacBio’s release of the Sequel, requests have significantly increased this quarter compared to PacBio service requests in the last 4 quarters. We expect this trend to continue through 2017.

RNA-Seq

The pie chart above breaks down the types of RNA-seq services requested in the first three months of 2017. Total RNA-seq represents all applications where rRNA is depleted prior to library preparation, whereas mRNA-seq represents all applications where mRNA is enriched. In 2016, the number of Total RNA-seq projects was half that of this year. We attribute this to a growing interest in non-coding RNA and the availability of higher throughput sequencing runs. As sequencing costs drop and rRNA depletion becomes more affordable, researchers are asking for more biological information.  Today, the Nextseq and HiSeq 3000/4000 are the most commonly used instruments for any RNA-seq application. Counting applications continue to dominate, although requests for de novo transcriptome alignments are steady rising over the previous year. Whereas in the past, 1×50 and 1×75 were the most frequently requested read length for RNA counting applications, around 2x more researchers are requesting paired-end sequencing versus last year.

Methylation analysis

Compared to last year, there is an increased interest in WGBS as compared to RRBS and MeDIP. With the advent of the HiSeq X and it’s compatibility with WGBS applications, more researchers are finding whole genome based applications easier and more informative than reduced representation bisulfite sequencing.

Instrument trends

By far the biggest trend this year was the number of long read requests on the PacBio Sequel. Whereas in the past, Mate-pair library prep was more popular, we’re starting to see this service decline, and long read sequencing be ordered more frequently. Hybrid Ilumina/PacBio reads are also being more frequently ordered to improve the quality of assemblies. Long-reads are being requested to detect functional elements in human genomes that are missed by short-read sequencing. We should add that requests for 10X Genomics services have started to increase, although they are too small right now to make any meaningful comments. We currently don’t have providers offering Oxford Nanopore services on Genohub, so can’t comment here either.

This month NovaSeq services are expected to be available on Genohub. We expect there to be a lag phase as kinks are worked out, before this becomes a popular instrument request.

The future

Having spent the last 4 years receiving sequencing requests and performing consultation, it’s clear that new technology does influence behavior. With reduced sequencing costs, we see clients not only including more control and duplicates, but also looking at RNA-seq from a more global perspective, and beginning to become more interested in long reads. Clients that previously only performed exome-seq are now turning to whole genome sequencing on the HiSeq X. Researchers that normally only look at coding RNA’s are starting to show interest in long non-coding and small RNAs. Overall, faster and cheaper sequencing does tend to promote better science. Gone are the n=1 days of sequencing.