Gas station without pumps

2011 July 5

More on the banana slug mitochondrion

Last week I thought I was done with assembling the mitochondrial genome of UCSC’s beloved banana slug mascot Ariolimax dolichophallus. All that was left was finding some wet-lab volunteers to do some PCR to disambiguate a repeat region.  I even blogged about it.  Despite that, I’ve spent the past week still working on the genome.

First, I had to come up with primers for the wet-lab person to order to do the long-range PCR.  That took me longer than it should have, because I’d never designed primers before, and used the wrong set of parameters for Primer3. That is, I used the default set, but was later told that the SantaLucia 1998 set is preferable.  I also had to find out the favorite melting temperature for the people doing the wet-lab work (they preferred 55°C).  Even after that, the high AT content of the banana-slug mitochondrial genome required some fussing to get adequate GC content in the primer and a GC clamp at the 3′ end (all this was new to me, so it took me over a day to get primers that satisfied everyone).  I believe that one or two pairs of primers have now been ordered for doing the long-range PCR.  When that is working, I’ll have to design more primers for doing Sanger sequencing.

After doing the primer design, I decided to use my new “look-for-exit” Python program to see if the Illumina reads that I had identified as mitochondrial suggested any other variants on the assembled genome.  I’d already used the program to find lots of variants of the 615-long repeat, but I’d not applied it to the whole mitochondrial genome.  When I did, I found another region about 150 base pairs upstream of the long repeats that seemed to be another repeat region.  So I ended up spending another day or two extracting all the variants of that repeat (which seems to be about 20 copies of a 185-long sequence).  I also designed primers for doing long-range PCR over the short repeat region also, though that went much faster than my first attempt.

When I had all variants I could find of both the short and the long repeats, I did another mapping of all the Illumina reads with BWA, to identify the mitochondrial ones.  I expected the number of reads to go up slightly, as reads that matched the new repeat blocks were kept.  Instead the number went down!  Something was wrong. The single-ended merged reads behaved as expected, but fewer paired-end reads were being kept, even though I was asking BWA for read pairs in which either read mapped.  It turns out that if both ends map, but BWA doesn’t like the separation, it throws the pair out.  Because I now had more copies of the repeats, but probably in the wrong order, BWA sometimes found a better mapping, but then didn’t keep the pair, because the separation was too big.  So I had to redo the mapping setting the maximum insert size to be bigger than the length of the genome.

Now I have lots of reads, and look-for-exit does not find any more variants supported by more than 7 reads (there is one C->T change that has 7 reads supporting it, but over 500 supporting the C, so that may just be sequencing error or a rare somatic mutation due to deamination).

Notes on this stuff have been put on the mitochondrion page of the banana slug genomics wiki.

So I’m ready to put the mitochondrial genome aside and do some work on another project, right? 

Well, I thought so, but I woke up in the middle of the night with more things to do.

First I tried to figure out how to annotate the genes of the mitochondrial genome.  There does not seem to be a high-quality pipeline like RAST, which I’ve used for bacterial annotation.  There is a crude web tool called dogma, but it produced essentially unusable outputs, without even the option of saving the results as a GENBANK record.  It did suggest that I had all the standard protein genes except ATP8, which I already knew from doing BLAST searches.  (It turns out that losing ATP8 is quite common in mollusks, as well as some other clades, so losing it is no big surprise.)

I’ll probably have to write my own ORF finder to nail down the ends of the protein genes, since no one bothers with ORF finders for the 14 or so different mitochondrial genetic codes.  Most people apparently blast the genome to find the protein genes, then adjust the ends of the ORFs by hand—I might end up doing that myself, though my computer science soul cringes a bit at doing things by hand.  With only 12 or 13 proteins, though, it may not be worth the trouble of programming (unless I decide to do a population study that requires annotating lots of mitochondrial genomes).

I also ran the draft genome (with the repeat variants present but still not correctly ordered) through tRNAscan-SE to find the tRNAs.  The dogma server had found hundreds of tRNAs (the long repeat includes some tRNA genes), but tRNAscan-SE only found 7 tRNA genes.  I looked in mitochondrion genome papers and found that was a common result for mollusk mitochondria—people were using BLAST to find the tRNA genes (there are usually around 20).  I talked with Todd Lowe, the author of tRNAscan-SE, and he admitted that they had never developed good models for mitochondrial tRNAs.  Slugs and snails may be particularly difficult, as it seems there is some post-transcriptional modification to make the aminoacyl tails base pair.  I’ve put the tRNA search aside for now (though I’ll probably look for the tRNAs using BLAST), but Todd and I will look at it together later this summer, perhaps putting together some mitochondrial tRNA models to improve tRNAscan-SE.  There are lots of annotated mitchondrial genomes out there, so it should not be too hard to put together some decent covariance models of the tRNAs.

So the annotation is mostly on hold now.  Can I get some sleep or work on something else?

No, I can’t do that, because I’m still being bugged by the fact that I haven’t quite used all the data I have to try to order the repeats.  I’m now working on a new Python program to try mapping the paired-ends to the genome to try to come up with better ordering of the repeats.  BWA is not much use to me for this, as I’ve not figured out a way to get it to report multiple hits (one read is in repeat S04, and its partner is later in the genome in S02 or S08, for example).  So I’m doing crude mapping using suffix arrays and trying to figure out how to get the right balance between tolerating sequencing error and identifying subtly different repeat blocks.  I’m also trying to figure out how much I should use the quality information (if at all).  Maybe when I’ve got this program working, I’ll be able to sleep through the night and go back to working on a different project.

2011 June 21

Banana slug mitochondrial genome done (almost)

Filed under: Uncategorized — gasstationwithoutpumps @ 19:12
Tags: , , ,
Ariolimax dolichophallus at UCSC

Banana slug on UCSC campus (same species, but not same individual as the one being sequenced). Image via Wikipedia

Today I released a draft sequence for the banana slug (Ariolimax dolichophallus) mitochondrial genome on the Banana Slug Genomics wiki.  Some rough notes about the assembly are on the wiki on a page set aside for the mitochondrion and there is a link there to the fasta sequence file.

This unfunded project was a spinoff of the larger (also unfunded) project to sequence the entire genome for the banana slug.  We had one run of Illumina paired-end data, with a rather small fragment length donated by the UCSC sequencing center (it was a test or training run for a new technician or a new machine, I believe).  They have donated some other data for the project, but most has been too low coverage to be of any real use.

I have twice taught a class on assembling the banana slug genome, learning the material myself along with the grad students.  We have no where near enough data (particularly, no data from large DNA fragments) to assemble the whole genome: there are about 2Gbases in the genome and we’re getting an N50 of about 232 base pairs—less than the read length in some technologies!

The mitochondrion, however, is small (about 20k bases) and over represented in the data (probably about 175x coverage), so I thought it would be easy to pull out the mitochondrial reads and assemble them.

It was doable, but nowhere near as easy as I expected.  The tiny DNA fragments we had were not long enough to span even a single copy of a repeat in a nasty repeat region in the mitochondrial genome, and I had to write special-purpose software just to close the circle and get all the mitochondrial reads.  The closest previously sequenced mitochondria are so dissimilar that using them to select reads was not going to help.

After more than 40 attempts, I finally got a complete genome, with (I think) all the variants of the repeat sequence, but with no data to use to order the repeats.  I’ll be setting this project aside now, unless some wet-lab person volunteers to do buy some primers and do some PCR to disambiguate the repeats.

Update:

The closest sequenced mitochondrial genomes at NCBI are

NC_010220.1     Biomphalaria tenagophila mitochondrion, complete genome
>gb|EF433576.1| Biomphalaria tenagophila strain Taim-RS mitochondrion, complete genome

NC_005439.1 Biomphalaria glabrata mitochondrion, complete genome
>gb|AY380531.1| Biomphalaria glabrata strain 1742 mitochondrion, complete genome

name max score total score query coverage E-value max identity
Biomphalaria tenagophila 2679     4235     51%     0.0     71%
Biomphalaria glabrata 2562    3934     52%     0.0     69%

%d bloggers like this: