Gas station without pumps

2019 March 27

Comparing 23andme and Dante Labs data

Filed under: Uncategorized — gasstationwithoutpumps @ 06:22
Tags: , , , ,

I got the grading for the last lab of winter quarter done yesterday (I took me several days longer than I expected, even allowing for an hour per paper—they took me more like 2 hours each). I have to turn in grades today, and I just found out last night that the graders had not finished grading homework 11, so I need to grade that also.

Before I found out that I have unexpectedly even more grading to do, I had taken an hour to write a short Python program to compare my data from 23andme with my data from Dante Labs. The two seem very concordant, and so I now believe I have gotten good data from Dante Labs:

23_and_me                              | vcf from Dante Labs
638531 genotype_sites                  | 3499617 genotype_sites
  chr   no-call haploid diploid matches|   chr   no-call haploid diploid mismatches
  chrM    306    3995       0      0   |   chrM      0      21       0      2   
  chr1   1177       0   48337  11756   |   chr1      0       0  267787     65   
  chr2   1121       0   50654  12303   |   chr2      0       0  287274     59   
  chr3   1013       0   42011  10184   |   chr3      0       0  240607     43   
  chr4   1006       0   38468   9719   |   chr4      0       0  261070     56   
  chr5    885       0   36147   8733   |   chr5      0       0  212325     41   
  chr6    956       0   43067   8505   |   chr6      0       0  204102     53   
  chr7    857       0   33500   8348   |   chr7      0       0  203647     44   
  chr8    626       0   31057   7690   |   chr8      0       0  183159     21   
  chr9    700       0   25746   6431   |   chr9      0       0  148637     65   
  chr10   656       0   29869   7494   |   chr10     0       0  177729     35   
  chr11   605       0   30337   7523   |   chr11     0       0  174926     28   
  chr12   677       0   28755   7366   |   chr12     0       0  165987     26   
  chr13   392       0   21688   5571   |   chr13     0       0  133928     30   
  chr14   472       0   19489   4871   |   chr14     0       0  111553     23   
  chr15   452       0   18554   4757   |   chr15     0       0  103635     16   
  chr16   504       0   19893   5019   |   chr16     0       0  107064     20   
  chr17   510       0   18891   4370   |   chr17     0       0   89744     24   
  chr18   307       0   17368   4591   |   chr18     0       0  100640     12   
  chr19   551       0   14366   3554   |   chr19     0       0   75073     29   
  chr20   295       0   14486   3603   |   chr20     0       0   73566     19   
  chr21   227       0    8380   2261   |   chr21     0       0   52060     18   
  chr22   244       0    8671   2073   |   chr22     0       0   45153     12   
  chrX   1033   14970     527   3663   |   chrX      0   74099    1801     36   
  chrY    506    3226       1    161   |   chrY      0    1393    2637      2   

  total 16078   22191  600262 150546   |   total     0   75513 3424104    779                     

Count of types of genotype
   CT   41086                          |    CT  696904
   AG   40444                          |    AG  696669
   CC  142899                          |    CC  358678
   GG  142411                          |    GG  358891
   TT  104122                          |    TT  320081
   AA  104648                          |    AA  319335
   GT    9760                          |    GT  173342
   AC    9810                          |    AC  172678
   CG     321                          |    CG  178718
   AT     215                          |    AT  148808
   C     5797                          |    C    18824
   G     5495                          |    G    19064
   T     5343                          |    T    18873
   A     5272                          |    A    18752
   --   16078                          |    --       0
   II    3245                          |    II       0
   DD    1259                          |    DD       0
   I      195                          |    I        0
   D       89                          |    D        0
   DI      42                          |    DI       0

There are only 779 sites where both 23andme and DanteLabs call a variant and disagree about what it is—a 0.5% disagreement, which is lower than I would have expected given the differences in the technology and the error rates of DNA chips. I think that 23andme is being fairly conservative and not calling many of the low-quality hybridization reads.

The biggest difference seems to be that Dante Labs does not cover the mitochondrion—the very small number of variant calls there could be mismapping of reads from homologous regions of the nuclear genome. Of course, 23andme does extremely thorough coverage of the mitochondrion, in order to get as much maternal haplotype data as feasible. If you are looking for maternal ancestry information or mitochondrial variants related to disease, the Dante Labs whole-genome sequencing is not the way to go.

The 23andme data also has a lot of coverage of the Y chromosome, in an attempt to get as much paternal haplotype information as possible, but the VCF file has few calls on the Y chromosome, and many of them are diploid calls, probably from homology to the X chromosome (the 23andme sites appear to be carefully chosen to avoid the homologous regions of the X and Y chromosomes, which may or may not be reasonable, depending on what is going on in those regions). Again, if you are mainly interested in ancestry information, the Dante Labs whole-genome sequencing is probably not the way to go.

The Dante Labs vcf file does not include deletion and insertion genotypes (the I and D codes in the 23andme data), but I think that the full data Dante Labs sent me on disk may have that information in a different VCF file. It may be a while before I have time to examine that more detailed data.

There are about 5.5 times as many SNPs in the VCF file as in the 23andme file, but only about a quarter of the 23andme sites are matched by the Dante Labs variants—the rest may be places where I am homozygous for the reference allele, which the VCF file does not report, or they may be places where Dante Labs had insufficient coverage to do a variant call. It will take a lot more work for me to analyze the Dante Labs data to figure out which is correct. The 23andme genotype data has a lot more homozygous calls than heterozygous ones, so I suspect that the bulk of the differences will be just that I am homozygous for the reference allele.

The most common SNP variants in the Dante Labs VCF file are CT (or the equivalent on the other strand AG), which is to be expected, as C⇒T conversion is common in DNA, because of C⇒U deamination and subsequent treatment of U as T in replication.

The Dante Labs data shows a lot higher proportion of CG and AT variants than the 23andme data—I don’t know how to interpret that. Perhaps when I get the fullgenomes data, which uses a different sequencing technology, I’ll be able to compare VCF files and see if there is technology effect here.

I clearly have a lot more work to do to interpret the data, but this preliminary look convinces me that I have good data from Dante Labs.

I retract my former claim that Dante Labs is a scam with apologies to them—it appears that they just had very bad delivery times and poor customer service. If they are now delivering data, they may actually be a good deal, as their prices are much lower than other whole-genome sequencing services. (Of course, it is still possible that they are only delivering data to a fraction of their customers, but I have no information about that—only that the data they eventually sent me seems to be good.)

2011 July 5

More on the banana slug mitochondrion

Last week I thought I was done with assembling the mitochondrial genome of UCSC’s beloved banana slug mascot Ariolimax dolichophallus. All that was left was finding some wet-lab volunteers to do some PCR to disambiguate a repeat region.  I even blogged about it.  Despite that, I’ve spent the past week still working on the genome.

First, I had to come up with primers for the wet-lab person to order to do the long-range PCR.  That took me longer than it should have, because I’d never designed primers before, and used the wrong set of parameters for Primer3. That is, I used the default set, but was later told that the SantaLucia 1998 set is preferable.  I also had to find out the favorite melting temperature for the people doing the wet-lab work (they preferred 55°C).  Even after that, the high AT content of the banana-slug mitochondrial genome required some fussing to get adequate GC content in the primer and a GC clamp at the 3′ end (all this was new to me, so it took me over a day to get primers that satisfied everyone).  I believe that one or two pairs of primers have now been ordered for doing the long-range PCR.  When that is working, I’ll have to design more primers for doing Sanger sequencing.

After doing the primer design, I decided to use my new “look-for-exit” Python program to see if the Illumina reads that I had identified as mitochondrial suggested any other variants on the assembled genome.  I’d already used the program to find lots of variants of the 615-long repeat, but I’d not applied it to the whole mitochondrial genome.  When I did, I found another region about 150 base pairs upstream of the long repeats that seemed to be another repeat region.  So I ended up spending another day or two extracting all the variants of that repeat (which seems to be about 20 copies of a 185-long sequence).  I also designed primers for doing long-range PCR over the short repeat region also, though that went much faster than my first attempt.

When I had all variants I could find of both the short and the long repeats, I did another mapping of all the Illumina reads with BWA, to identify the mitochondrial ones.  I expected the number of reads to go up slightly, as reads that matched the new repeat blocks were kept.  Instead the number went down!  Something was wrong. The single-ended merged reads behaved as expected, but fewer paired-end reads were being kept, even though I was asking BWA for read pairs in which either read mapped.  It turns out that if both ends map, but BWA doesn’t like the separation, it throws the pair out.  Because I now had more copies of the repeats, but probably in the wrong order, BWA sometimes found a better mapping, but then didn’t keep the pair, because the separation was too big.  So I had to redo the mapping setting the maximum insert size to be bigger than the length of the genome.

Now I have lots of reads, and look-for-exit does not find any more variants supported by more than 7 reads (there is one C->T change that has 7 reads supporting it, but over 500 supporting the C, so that may just be sequencing error or a rare somatic mutation due to deamination).

Notes on this stuff have been put on the mitochondrion page of the banana slug genomics wiki.

So I’m ready to put the mitochondrial genome aside and do some work on another project, right? 

Well, I thought so, but I woke up in the middle of the night with more things to do.

First I tried to figure out how to annotate the genes of the mitochondrial genome.  There does not seem to be a high-quality pipeline like RAST, which I’ve used for bacterial annotation.  There is a crude web tool called dogma, but it produced essentially unusable outputs, without even the option of saving the results as a GENBANK record.  It did suggest that I had all the standard protein genes except ATP8, which I already knew from doing BLAST searches.  (It turns out that losing ATP8 is quite common in mollusks, as well as some other clades, so losing it is no big surprise.)

I’ll probably have to write my own ORF finder to nail down the ends of the protein genes, since no one bothers with ORF finders for the 14 or so different mitochondrial genetic codes.  Most people apparently blast the genome to find the protein genes, then adjust the ends of the ORFs by hand—I might end up doing that myself, though my computer science soul cringes a bit at doing things by hand.  With only 12 or 13 proteins, though, it may not be worth the trouble of programming (unless I decide to do a population study that requires annotating lots of mitochondrial genomes).

I also ran the draft genome (with the repeat variants present but still not correctly ordered) through tRNAscan-SE to find the tRNAs.  The dogma server had found hundreds of tRNAs (the long repeat includes some tRNA genes), but tRNAscan-SE only found 7 tRNA genes.  I looked in mitochondrion genome papers and found that was a common result for mollusk mitochondria—people were using BLAST to find the tRNA genes (there are usually around 20).  I talked with Todd Lowe, the author of tRNAscan-SE, and he admitted that they had never developed good models for mitochondrial tRNAs.  Slugs and snails may be particularly difficult, as it seems there is some post-transcriptional modification to make the aminoacyl tails base pair.  I’ve put the tRNA search aside for now (though I’ll probably look for the tRNAs using BLAST), but Todd and I will look at it together later this summer, perhaps putting together some mitochondrial tRNA models to improve tRNAscan-SE.  There are lots of annotated mitchondrial genomes out there, so it should not be too hard to put together some decent covariance models of the tRNAs.

So the annotation is mostly on hold now.  Can I get some sleep or work on something else?

No, I can’t do that, because I’m still being bugged by the fact that I haven’t quite used all the data I have to try to order the repeats.  I’m now working on a new Python program to try mapping the paired-ends to the genome to try to come up with better ordering of the repeats.  BWA is not much use to me for this, as I’ve not figured out a way to get it to report multiple hits (one read is in repeat S04, and its partner is later in the genome in S02 or S08, for example).  So I’m doing crude mapping using suffix arrays and trying to figure out how to get the right balance between tolerating sequencing error and identifying subtly different repeat blocks.  I’m also trying to figure out how much I should use the quality information (if at all).  Maybe when I’ve got this program working, I’ll be able to sleep through the night and go back to working on a different project.

2011 May 31

A use for an Ion Torrent

Filed under: Uncategorized — gasstationwithoutpumps @ 09:21
Tags: , ,

I’ve been wondering what an Ion Torrent sequencer is useful for.  I mainly deal with de novo assembly of genomes, which needs a lot more data than an Ion Torrent sequencer provides, even for assembling bacterial genomes.  The high error rate and relatively short read length of the Ion Torrent reads is also a problem.  For de-novo sequencing, almost everyone is going with the Illumina platform, which provides barely long enough reads (a little over 100-long at each end of a pair) at the lowest cost.  I like to have some longer 454 reads to throw into the mix, but they are more expensive and confuse some of the de Bruijn graph assemblers.

This past week, though, I’ve been working on a problem that might be ideal for the Ion Torrent: assembling the mitochondrial sequence of the banana slug, Ariolimax dolichophallus.  I’ve been trying to assemble it from 10x whole-genome shotgun sequence using Illumina reads (with paired ends that were too close together, so many of the reads overlap in the middle).  The library prep looks like it was very good at excluding mitochondria:  the mitochondrial genomes seem to have little more coverage than the nuclear genome. [Correction: I must have dropped a decimal point somewhere—the coverage is indeed much higher for the mitochondrion: more like 200x than 10x.]

Since mitochondrial genomes are the primary way of identifying eukaryotic species (often using only a tiny snippet, the “barcode of life“), there is a lot of value in being able to determine the genome quickly and cheaply. A mitochondrial genome is much shorter than bacterial genomes (only about 15 kbases), which makes the low coverage, short reads, and high error rates not much oof a problem. If you have over 100x coverage on a short genome, you can still align and assemble it despite the noise, especially since repeats are not a problem in mitochondria.

Isolating mitochondrial DNA is also supposed to be relatively easy, so it might be good for Ion Torrent to put out a “mitochondrial genome kit” that makes isolating the mitochondrial DNA, sequencing it, and assembling the resulting genome very cheap.  This would take the rather thin taxonomic sampling of 2654 mitochondrial genomes at NCBI to hundreds of thousands in just a few years.  The key thing is to make the library prep very cheap and simple, since otherwise one could do barcoding to multiplex samples and piggyback on sequencing runs on the larger batch machines.

Average error rate and length of Ion Torrent reads after trimming off the bad bases at the end of the read. Note: this is not my data, so I can't provide any information about the library prep, regents or chips used, or any of the other information that may have a major bearing on the quality of the data.  Anyone who has Ion Torrent data that has a reference genome that can be mapped to can do a similar plot.  It would be useful to do such plots for all the platforms in common use, but I don't have the data for it.

For this data, the error rate on 50-base-pair reads is about 1%.  That is much worse than Illumina (which gets 1% around 90 base-pair reads) or 454 (which gets 1% around 400 base-pair reads).  Note that the Q values are reasonably (perhaps slightly optimistically) calibrated in that an error rate of x needs a cutoff about -10 log_10 x

I expect that Ion Torrent will improve their read length and accuracy, but without plots like this one, it is difficult to compare how platforms really perform.  The raw "number of bases" and "read length ignoring error" figures that get touted by the companies are so misleading as to verge on fraud.


%d bloggers like this: