Gas station without pumps

2011 July 5

More on the banana slug mitochondrion

Last week I thought I was done with assembling the mitochondrial genome of UCSC’s beloved banana slug mascot Ariolimax dolichophallus. All that was left was finding some wet-lab volunteers to do some PCR to disambiguate a repeat region.  I even blogged about it.  Despite that, I’ve spent the past week still working on the genome.

First, I had to come up with primers for the wet-lab person to order to do the long-range PCR.  That took me longer than it should have, because I’d never designed primers before, and used the wrong set of parameters for Primer3. That is, I used the default set, but was later told that the SantaLucia 1998 set is preferable.  I also had to find out the favorite melting temperature for the people doing the wet-lab work (they preferred 55°C).  Even after that, the high AT content of the banana-slug mitochondrial genome required some fussing to get adequate GC content in the primer and a GC clamp at the 3′ end (all this was new to me, so it took me over a day to get primers that satisfied everyone).  I believe that one or two pairs of primers have now been ordered for doing the long-range PCR.  When that is working, I’ll have to design more primers for doing Sanger sequencing.

After doing the primer design, I decided to use my new “look-for-exit” Python program to see if the Illumina reads that I had identified as mitochondrial suggested any other variants on the assembled genome.  I’d already used the program to find lots of variants of the 615-long repeat, but I’d not applied it to the whole mitochondrial genome.  When I did, I found another region about 150 base pairs upstream of the long repeats that seemed to be another repeat region.  So I ended up spending another day or two extracting all the variants of that repeat (which seems to be about 20 copies of a 185-long sequence).  I also designed primers for doing long-range PCR over the short repeat region also, though that went much faster than my first attempt.

When I had all variants I could find of both the short and the long repeats, I did another mapping of all the Illumina reads with BWA, to identify the mitochondrial ones.  I expected the number of reads to go up slightly, as reads that matched the new repeat blocks were kept.  Instead the number went down!  Something was wrong. The single-ended merged reads behaved as expected, but fewer paired-end reads were being kept, even though I was asking BWA for read pairs in which either read mapped.  It turns out that if both ends map, but BWA doesn’t like the separation, it throws the pair out.  Because I now had more copies of the repeats, but probably in the wrong order, BWA sometimes found a better mapping, but then didn’t keep the pair, because the separation was too big.  So I had to redo the mapping setting the maximum insert size to be bigger than the length of the genome.

Now I have lots of reads, and look-for-exit does not find any more variants supported by more than 7 reads (there is one C->T change that has 7 reads supporting it, but over 500 supporting the C, so that may just be sequencing error or a rare somatic mutation due to deamination).

Notes on this stuff have been put on the mitochondrion page of the banana slug genomics wiki.

So I’m ready to put the mitochondrial genome aside and do some work on another project, right? 

Well, I thought so, but I woke up in the middle of the night with more things to do.

First I tried to figure out how to annotate the genes of the mitochondrial genome.  There does not seem to be a high-quality pipeline like RAST, which I’ve used for bacterial annotation.  There is a crude web tool called dogma, but it produced essentially unusable outputs, without even the option of saving the results as a GENBANK record.  It did suggest that I had all the standard protein genes except ATP8, which I already knew from doing BLAST searches.  (It turns out that losing ATP8 is quite common in mollusks, as well as some other clades, so losing it is no big surprise.)

I’ll probably have to write my own ORF finder to nail down the ends of the protein genes, since no one bothers with ORF finders for the 14 or so different mitochondrial genetic codes.  Most people apparently blast the genome to find the protein genes, then adjust the ends of the ORFs by hand—I might end up doing that myself, though my computer science soul cringes a bit at doing things by hand.  With only 12 or 13 proteins, though, it may not be worth the trouble of programming (unless I decide to do a population study that requires annotating lots of mitochondrial genomes).

I also ran the draft genome (with the repeat variants present but still not correctly ordered) through tRNAscan-SE to find the tRNAs.  The dogma server had found hundreds of tRNAs (the long repeat includes some tRNA genes), but tRNAscan-SE only found 7 tRNA genes.  I looked in mitochondrion genome papers and found that was a common result for mollusk mitochondria—people were using BLAST to find the tRNA genes (there are usually around 20).  I talked with Todd Lowe, the author of tRNAscan-SE, and he admitted that they had never developed good models for mitochondrial tRNAs.  Slugs and snails may be particularly difficult, as it seems there is some post-transcriptional modification to make the aminoacyl tails base pair.  I’ve put the tRNA search aside for now (though I’ll probably look for the tRNAs using BLAST), but Todd and I will look at it together later this summer, perhaps putting together some mitochondrial tRNA models to improve tRNAscan-SE.  There are lots of annotated mitchondrial genomes out there, so it should not be too hard to put together some decent covariance models of the tRNAs.

So the annotation is mostly on hold now.  Can I get some sleep or work on something else?

No, I can’t do that, because I’m still being bugged by the fact that I haven’t quite used all the data I have to try to order the repeats.  I’m now working on a new Python program to try mapping the paired-ends to the genome to try to come up with better ordering of the repeats.  BWA is not much use to me for this, as I’ve not figured out a way to get it to report multiple hits (one read is in repeat S04, and its partner is later in the genome in S02 or S08, for example).  So I’m doing crude mapping using suffix arrays and trying to figure out how to get the right balance between tolerating sequencing error and identifying subtly different repeat blocks.  I’m also trying to figure out how much I should use the quality information (if at all).  Maybe when I’ve got this program working, I’ll be able to sleep through the night and go back to working on a different project.

3 Comments »

  1. Repeats seem to be a big issues with this genome. Fortunately for me, the genome is having very few repeats and closely related annotated mitochondrial genome to compare to.

    Soap denovo also gave the complete circular assembly in the form of one contig and another scaffold which taken together give a circular genome.

    What is the kind of coverage you obtained across the genome? I have been trying to find some quantitative value of the copy number for the different cell types. It seems the coverage for data from different tissue types is very different.

    Comment by Nagarjun — 2011 July 9 @ 02:39 | Reply

    • Repeats are a big problem in almost every genome I’ve worked with (except Vibrio cholerae). In one genome (Helicobacter pylori) the repeat is in one of the most important genes of the pathenogicity island.

      Repeats have been a problem in other mitochondrial sequencing projects. I understand from Ed Green that the dog mitochondria have a repeat region that they decided to leave unresolved (as resolving it was not useful for determining pedigree).

      My coverage on the mitochondrial genome is pretty good. I get different estimates with different methods, because of the heavy repeat content, but the range is about 259x to 275x (see https://banana-slug.soe.ucsc.edu/computer_resources:assemblies:mitochondrion#finishing_the_genome ) for a genome length of 21k to 21.5k bases. (The higher coverage estimate is from uncorrected reads, the lower coverage from quake-corrected reads.)

      What mitochondrial genome are you working on that has no repeats and closely related other mitochondria? How are you doing the tRNA finding and gene finding? If you haven’t been using jellyfish and quake to correct the reads with 19-mer counts, try it. I also found Abyss to produce slightly longer contigs on the high coverage data than SOAPdenovo, though it is much more of a hassle to install and run. Abyss did walk cheerfully through one of the repeat regions, simplifying it down to one and half copies, while SOAPdenovo more correctly stopped, so the longer contig was not necessarily a win.

      Comment by gasstationwithoutpumps — 2011 July 9 @ 07:18 | Reply

  2. […] More on the banana slug mitochondrion […]

    Pingback by Banana Slug genome crowd funding | Gas station without pumps — 2014 October 22 @ 21:20 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: