Mycoplasma Contamination in your Sequencing Data

mycoplasma contamination

Mycoplasma, the bane of any cell culture lab’s existence is a genus of bacteria characterized by a lack of a cell wall.  With a relatively small genome, mycoplasma have limited biosynthetic capabilities, requiring a host to efficiently replicate. Inspired by a bout of mycoplasma contamination in their own lab, Anthony O Olarerin-George and John B Hogenesch from the University of Pennsylvania recently set out to determine how widespread mycoplasma contamination was in other labs by screening RNA-seq data deposited in the NCBI Sequence Read Archive (1). Their study estimates that ~ 11% of NCBI’s Gene Expression Omnibus (GEO) projects between 2012 and 2013 contain at least ≥ 100 reads / million reads mapping to mycoplasma’s small 0.6 Mb genome. They also reference a recent study (2) which suggests that 7% of the samples from the 1,000 Genomes project are contaminated. Bad news if you’ve recently completed a large study and are wondering why you have so many unmapped reads. While most of these are likely from regions of the genome that haven’t been sequenced, reads mapping to mycoplasma should be taken seriously as they can affect the expression of thousands of genes and slow cellular growth.

Preventing contamination in the first place along with routine monitoring is essential, but if you’ve already completed the sequencing end of your project you can start aligning your data to several completed mycoplasma genomes.

With recent drops in cost, routine sequencing of cell culture samples has become more prevalent. If you’re interested in testing your cultures, start by searching for sequencing services and providers on Genohub

1) Assessing the prevalence of mycoplasma contamination in cell culture via a survey of NCBI’s RNA-seq archive. Anthony O Olarerin-George, John B Hogenesch doi:

2) Mycoplasma contamination in the 1000 Genomes Project. William B Langdon

Beginner’s Guide to Exome Sequencing

Exome Capture Kit Comparison

With decreasing costs to sequence whole human genomes (currently $1,550 for 35X coverage), we frequently hear researchers ask, “Why should I only sequence protein coding genes” ?

First, WGS of entire populations is still quite expensive. These types of projects are currently only being performed by large centers or government entities, like Genomics England, a company owned by UK’s Department of Health, which announced that they would sequence 100,000 whole genomes by 2017. At Genohub’s rate of $1,550/genome, 100,000 genomes would cost $155 million USD. This $155 million figure only includes sequencing costs and does not take into account labor, data storage and analysis which is likely several fold greater. 

Second, the exome, or all ~180,000 exons comprise less than 2% of all sequence in the human genome, but contain 85-90% of all known disease causing variants. A more focused dataset makes interpretation and analysis a lot easier.

Let’s assume you’ve decided to proceed with exome sequencing. The next step is to either find a service provider to perform your exome capture, sequencing and analysis or do it yourself. Genohub has made it easy to find and directly order sequencing services from providers around the world. Several of our providers offer exome library prep and sequencing services. If you’re only looking for someone to help with your data analysis, you can contact one of our providers offering exome bioinformatics services. Whether you decide to send your samples to a provider or make libraries yourself, you’ll need to decide on what capture technology to use, the number of reads you’ll need and what type of read length is most appropriate for your exome-seq project.

There are currently three main capture technologies available: Agilent SureSelect, Illumina Nextera Rapid Capture, Roche Nimblegen SeqCap EZ Exome. All three are in-solution based and utilize biotinylated DNA or RNA probes (baits) that are complementary to exons. These probes are added to genomic fragment libraries and after a period of hybridization, magnetic streptavidin beads are used to pull down and enrich for fragmented exons. Each of these three exome capture technologies is compared in a detailed table: Each kit has a varying numbers of probes, probe length, target region, input DNA requirements and hybridization time. Researchers planning on exome sequencing should first determine whether the technology they’re considering covers their regions of interest. Only 26.2 Mb of total targeted bases are in common, and only small portions of the CCDS Exome are uniquely covered by each tech (Chilamakuri, 2014).

Our Exome Guide breaks down the steps you’ll need to determine how much sequencing and what read length is appropriate for your exome capture sequencing project.

rRNA Depletion / Poly-A Selection Responsible for Coverage Bias in RNA-seq

Using a pool of 1,062 in vitro transcribed (IVT) human cDNA plasmids, a group from the University of Pennsylvania sought to characterize coverage biases in RNA-seq experiments. Their paper, titled IVT-seq reveals extreme bias in RNA-sequencing was published last week.

The authors cleverly use a carefully controlled set of IVT cDNA clones whose base composition and expression levels are known. Mixing the IVT set with mouse total RNA they found > 2 fold differences in transcript coverage amongst 50% of their transcripts and 10% having up to 10 fold changes. When IVT cDNA clones are sequenced alone, in the absence of a complex genomic milieu, the authors acknowledge biases that arise from random priming, adapter ligation, and amplification, but identify polyA selection and ribosomal depletion as being the main cause for RNA coverage bias. In their experiment, they consider hexamer entropy, GC-content, similarity of sequence to rRNA and measure coverage variability as an indicator of coverage bias along with depth of coverage as measured by FPKM. They demonstrate a significant correlation between transcript similarity to rRNA and greater differences in coverage between libraries that undergo rRNA depletion and those that do not.

Overall their method demonstrates that library preparation does introduce significant biases in RNA-seq data and that developing carefully controlled synthetic test transcripts, allows users to accurately measure this bias. Development of these controlled sets will allow for further refinement to current library preparation practices.