I got the grading for the last lab of winter quarter done yesterday (I took me several days longer than I expected, even allowing for an hour per paper—they took me more like 2 hours each). I have to turn in grades today, and I just found out last night that the graders had not finished grading homework 11, so I need to grade that also.
Before I found out that I have unexpectedly even more grading to do, I had taken an hour to write a short Python program to compare my data from 23andme with my data from Dante Labs. The two seem very concordant, and so I now believe I have gotten good data from Dante Labs:
23_and_me | vcf from Dante Labs
638531 genotype_sites | 3499617 genotype_sites
chr no-call haploid diploid matches| chr no-call haploid diploid mismatches
chrM 306 3995 0 0 | chrM 0 21 0 2
chr1 1177 0 48337 11756 | chr1 0 0 267787 65
chr2 1121 0 50654 12303 | chr2 0 0 287274 59
chr3 1013 0 42011 10184 | chr3 0 0 240607 43
chr4 1006 0 38468 9719 | chr4 0 0 261070 56
chr5 885 0 36147 8733 | chr5 0 0 212325 41
chr6 956 0 43067 8505 | chr6 0 0 204102 53
chr7 857 0 33500 8348 | chr7 0 0 203647 44
chr8 626 0 31057 7690 | chr8 0 0 183159 21
chr9 700 0 25746 6431 | chr9 0 0 148637 65
chr10 656 0 29869 7494 | chr10 0 0 177729 35
chr11 605 0 30337 7523 | chr11 0 0 174926 28
chr12 677 0 28755 7366 | chr12 0 0 165987 26
chr13 392 0 21688 5571 | chr13 0 0 133928 30
chr14 472 0 19489 4871 | chr14 0 0 111553 23
chr15 452 0 18554 4757 | chr15 0 0 103635 16
chr16 504 0 19893 5019 | chr16 0 0 107064 20
chr17 510 0 18891 4370 | chr17 0 0 89744 24
chr18 307 0 17368 4591 | chr18 0 0 100640 12
chr19 551 0 14366 3554 | chr19 0 0 75073 29
chr20 295 0 14486 3603 | chr20 0 0 73566 19
chr21 227 0 8380 2261 | chr21 0 0 52060 18
chr22 244 0 8671 2073 | chr22 0 0 45153 12
chrX 1033 14970 527 3663 | chrX 0 74099 1801 36
chrY 506 3226 1 161 | chrY 0 1393 2637 2
total 16078 22191 600262 150546 | total 0 75513 3424104 779
Count of types of genotype
CT 41086 | CT 696904
AG 40444 | AG 696669
CC 142899 | CC 358678
GG 142411 | GG 358891
TT 104122 | TT 320081
AA 104648 | AA 319335
GT 9760 | GT 173342
AC 9810 | AC 172678
CG 321 | CG 178718
AT 215 | AT 148808
C 5797 | C 18824
G 5495 | G 19064
T 5343 | T 18873
A 5272 | A 18752
-- 16078 | -- 0
II 3245 | II 0
DD 1259 | DD 0
I 195 | I 0
D 89 | D 0
DI 42 | DI 0
There are only 779 sites where both 23andme and DanteLabs call a variant and disagree about what it is—a 0.5% disagreement, which is lower than I would have expected given the differences in the technology and the error rates of DNA chips. I think that 23andme is being fairly conservative and not calling many of the low-quality hybridization reads.
The biggest difference seems to be that Dante Labs does not cover the mitochondrion—the very small number of variant calls there could be mismapping of reads from homologous regions of the nuclear genome. Of course, 23andme does extremely thorough coverage of the mitochondrion, in order to get as much maternal haplotype data as feasible. If you are looking for maternal ancestry information or mitochondrial variants related to disease, the Dante Labs whole-genome sequencing is not the way to go.
The 23andme data also has a lot of coverage of the Y chromosome, in an attempt to get as much paternal haplotype information as possible, but the VCF file has few calls on the Y chromosome, and many of them are diploid calls, probably from homology to the X chromosome (the 23andme sites appear to be carefully chosen to avoid the homologous regions of the X and Y chromosomes, which may or may not be reasonable, depending on what is going on in those regions). Again, if you are mainly interested in ancestry information, the Dante Labs whole-genome sequencing is probably not the way to go.
The Dante Labs vcf file does not include deletion and insertion genotypes (the I and D codes in the 23andme data), but I think that the full data Dante Labs sent me on disk may have that information in a different VCF file. It may be a while before I have time to examine that more detailed data.
There are about 5.5 times as many SNPs in the VCF file as in the 23andme file, but only about a quarter of the 23andme sites are matched by the Dante Labs variants—the rest may be places where I am homozygous for the reference allele, which the VCF file does not report, or they may be places where Dante Labs had insufficient coverage to do a variant call. It will take a lot more work for me to analyze the Dante Labs data to figure out which is correct. The 23andme genotype data has a lot more homozygous calls than heterozygous ones, so I suspect that the bulk of the differences will be just that I am homozygous for the reference allele.
The most common SNP variants in the Dante Labs VCF file are CT (or the equivalent on the other strand AG), which is to be expected, as C⇒T conversion is common in DNA, because of C⇒U deamination and subsequent treatment of U as T in replication.
The Dante Labs data shows a lot higher proportion of CG and AT variants than the 23andme data—I don’t know how to interpret that. Perhaps when I get the fullgenomes data, which uses a different sequencing technology, I’ll be able to compare VCF files and see if there is technology effect here.
I clearly have a lot more work to do to interpret the data, but this preliminary look convinces me that I have good data from Dante Labs.
I retract my former claim that Dante Labs is a scam with apologies to them—it appears that they just had very bad delivery times and poor customer service. If they are now delivering data, they may actually be a good deal, as their prices are much lower than other whole-genome sequencing services. (Of course, it is still possible that they are only delivering data to a fraction of their customers, but I have no information about that—only that the data they eventually sent me seems to be good.)
Like this:
Like Loading...