Now that my grading is done, I finally had some time to look at the whole-genome sequencing data I got from fullgenomes.com. I ordered the sequencing on 7 March 2019, shipped the spit kit on 13 March 2019, and got all the data by 3 May 2019. Their price, $1175, for 30× whole-genome sequencing, is fairly typical of the direct-to-consumer sequencing outfits. (Dante Labs is much cheaper, but a number of people have been unhappy with the slow or non-delivery of data.)
Here is a summary of what I saw in the fullgenomes snpeff.vep.vcf file:
There are 4906211 genotype sites. Count by filter values: . 4906211 By chromosome, the number of no-call, haploid, and diploid genotypes: chr1 0 0 383857 chr2 0 0 390198 chr3 0 0 323097 chr4 0 0 348597 chr5 0 0 286416 chr6 0 0 297452 chr7 0 0 279633 chr8 0 0 240326 chr9 0 0 216318 chr10 0 0 247472 chr11 0 0 233892 chr12 0 0 224952 chr13 0 0 189629 chr14 0 0 150472 chr15 0 0 143682 chr16 0 0 146914 chr17 0 0 136966 chr18 0 0 136431 chr19 0 0 109775 chr20 0 0 125200 chr21 0 0 87210 chr22 0 0 83531 chrX 0 105204 12991 chrY 0 5979 0 chrM 0 17 0 By type of genotype, there are CT 854737 AG 828712 AA 416556 TT 414670 CC 401031 GG 400383 DI 269235 AC 266527 AT 245779 GT 239801 CG 225762 II 176278 DD 46625 A 26598 T 26573 G 23312 C 23061 I 11656 *T 2890 *A 2522 *G 1757 *C 1746
I don’t know what the *T, *A, *G, and *C sites are. The diploid sites on the X chromosome may be sites that have close homologs on the Y chromosome—close enough that mapping algorithms see them as being the same.
What I wanted to do first was to compare the fullgenomes data with the 23andme genotyping, like I had done with the Dante Labs data. That turned out to be somewhat difficult, as fullgenomes called variants relative to the latest reference genome (gRCH38), while 23andme and Dante Labs both used the older (gRCH37 =hg19) reference genome. That difference means that all the coordinates are different, so simple comparisons are difficult.
I have variant calls for the Dante Labs data on both the references using Google’s DeepVariant, so I could compare the Dante Labs calls (which I believe are done with the GATK pipeline) with the DeepVariant calls on the same data, and I could compare the fullgenomes calls with the DeepVariant calls on the Dante Labs data.
I can compare the Dante Labs data with the two variant callers, to see how much difference variant calling makes, and then compare the fullgenomes data with the DeepVariant calls on the Dante Labs data, where both the variant caller and the sequencing method differ.
dantelabs GATK |dantelabs-deepvariant 3499617 genotype_sites |5884795 genotype_sites chr no-call haploid diploid matches| chr no-call haploid diploid mismatches chrM 0 21 0 0 | chrM 0 0 0 0 chr1 0 0 267787 262884 | chr1 0 0 446877 4556 chr2 0 0 287274 282806 | chr2 0 0 459298 4145 chr3 0 0 240607 237530 | chr3 0 0 367320 2822 chr4 0 0 261070 257783 | chr4 0 0 400640 2868 chr5 0 0 212325 210325 | chr5 0 0 327319 1827 chr6 0 0 204102 201254 | chr6 0 0 318344 2616 chr7 0 0 203647 199384 | chr7 0 0 350292 3857 chr8 0 0 183159 181044 | chr8 0 0 280030 1902 chr9 0 0 148637 143697 | chr9 0 0 257715 4696 chr10 0 0 177729 174246 | chr10 0 0 295033 3087 chr11 0 0 174926 172444 | chr11 0 0 275753 2256 chr12 0 0 165987 163409 | chr12 0 0 269918 2286 chr13 0 0 133928 132587 | chr13 0 0 199795 1215 chr14 0 0 111553 109462 | chr14 0 0 178071 1979 chr15 0 0 103635 101314 | chr15 0 0 169214 2201 chr16 0 0 107064 104448 | chr16 0 0 194206 2398 chr17 0 0 89744 88302 | chr17 0 0 166728 1300 chr18 0 0 100640 99358 | chr18 0 0 154510 1143 chr19 0 0 75073 73488 | chr19 0 0 146076 1415 chr20 0 0 73566 71604 | chr20 0 0 131282 1844 chr21 0 0 52060 50356 | chr21 0 0 101284 1595 chr22 0 0 45153 43796 | chr22 0 0 84522 1248 chrX 0 74099 1801 73134 | chrX 0 88863 84484 2496 chrY 0 1393 2637 1709 | chrY 0 3098 43760 2020 chr1_gl000191_random 0 0 0 0 | chr1_gl000191_random 0 0 99 0 chr1_gl000192_random 0 0 0 0 | chr1_gl000192_random 0 0 976 0 chr4_ctg9_hap1 0 0 0 0 | chr4_ctg9_hap1 0 0 263 0 chr4_gl000193_random 0 0 0 0 | chr4_gl000193_random 0 0 1887 0 chr4_gl000194_random 0 0 0 0 | chr4_gl000194_random 0 0 2374 0 chr6_apd_hap1 0 0 0 0 | chr6_apd_hap1 0 0 24 0 chr6_cox_hap2 0 0 0 0 | chr6_cox_hap2 0 0 383 0 chr6_dbb_hap3 0 0 0 0 | chr6_dbb_hap3 0 0 284 0 chr6_mann_hap4 0 0 0 0 | chr6_mann_hap4 0 0 798 0 chr6_mcf_hap5 0 0 0 0 | chr6_mcf_hap5 0 0 132 0 chr6_qbl_hap6 0 0 0 0 | chr6_qbl_hap6 0 0 960 0 chr6_ssto_hap7 0 0 0 0 | chr6_ssto_hap7 0 0 812 0 chr7_gl000195_random 0 0 0 0 | chr7_gl000195_random 0 0 3267 0 chr8_gl000196_random 0 0 0 0 | chr8_gl000196_random 0 0 11 0 chr8_gl000197_random 0 0 0 0 | chr8_gl000197_random 0 0 1 0 chr9_gl000198_random 0 0 0 0 | chr9_gl000198_random 0 0 1866 0 chr9_gl000199_random 0 0 0 0 | chr9_gl000199_random 0 0 7465 0 chr9_gl000200_random 0 0 0 0 | chr9_gl000200_random 0 0 1 0 chr9_gl000201_random 0 0 0 0 | chr9_gl000201_random 0 0 12 0 chr11_gl000202_random 0 0 0 0 | chr11_gl000202_random 0 0 78 0 chr17_ctg5_hap1 0 0 0 0 | chr17_ctg5_hap1 0 0 608 0 chr17_gl000203_random 0 0 0 0 | chr17_gl000203_random 0 0 444 0 chr17_gl000204_random 0 0 0 0 | chr17_gl000204_random 0 0 92 0 chr17_gl000205_random 0 0 0 0 | chr17_gl000205_random 0 0 2874 0 chr17_gl000206_random 0 0 0 0 | chr17_gl000206_random 0 0 13 0 chr18_gl000207_random 0 0 0 0 | chr18_gl000207_random 0 0 100 0 chr19_gl000208_random 0 0 0 0 | chr19_gl000208_random 0 0 2216 0 chr19_gl000209_random 0 0 0 0 | chr19_gl000209_random 0 0 371 0 chr21_gl000210_random 0 0 0 0 | chr21_gl000210_random 0 0 21 0 chrUn_gl000211 0 0 0 0 | chrUn_gl000211 0 0 1967 0 chrUn_gl000212 0 0 0 0 | chrUn_gl000212 0 0 1316 0 chrUn_gl000213 0 0 0 0 | chrUn_gl000213 0 0 266 0 chrUn_gl000214 0 0 0 0 | chrUn_gl000214 0 0 2159 0 chrUn_gl000215 0 0 0 0 | chrUn_gl000215 0 0 129 0 chrUn_gl000216 0 0 0 0 | chrUn_gl000216 0 0 8530 0 chrUn_gl000217 0 0 0 0 | chrUn_gl000217 0 0 2138 0 chrUn_gl000218 0 0 0 0 | chrUn_gl000218 0 0 1480 0 chrUn_gl000219 0 0 0 0 | chrUn_gl000219 0 0 5820 0 chrUn_gl000220 0 0 0 0 | chrUn_gl000220 0 0 1730 0 chrUn_gl000221 0 0 0 0 | chrUn_gl000221 0 0 958 0 chrUn_gl000222 0 0 0 0 | chrUn_gl000222 0 0 2193 0 chrUn_gl000223 0 0 0 0 | chrUn_gl000223 0 0 12 0 chrUn_gl000224 0 0 0 0 | chrUn_gl000224 0 0 3738 0 chrUn_gl000225 0 0 0 0 | chrUn_gl000225 0 0 15234 0 chrUn_gl000226 0 0 0 0 | chrUn_gl000226 0 0 257 0 chrUn_gl000227 0 0 0 0 | chrUn_gl000227 0 0 80 0 chrUn_gl000228 0 0 0 0 | chrUn_gl000228 0 0 1299 0 chrUn_gl000229 0 0 0 0 | chrUn_gl000229 0 0 1080 0 chrUn_gl000230 0 0 0 0 | chrUn_gl000230 0 0 409 0 chrUn_gl000231 0 0 0 0 | chrUn_gl000231 0 0 1118 0 chrUn_gl000232 0 0 0 0 | chrUn_gl000232 0 0 2148 0 chrUn_gl000233 0 0 0 0 | chrUn_gl000233 0 0 433 0 chrUn_gl000234 0 0 0 0 | chrUn_gl000234 0 0 2281 0 chrUn_gl000235 0 0 0 0 | chrUn_gl000235 0 0 1224 0 chrUn_gl000236 0 0 0 0 | chrUn_gl000236 0 0 131 0 chrUn_gl000237 0 0 0 0 | chrUn_gl000237 0 0 493 0 chrUn_gl000238 0 0 0 0 | chrUn_gl000238 0 0 19 0 chrUn_gl000239 0 0 0 0 | chrUn_gl000239 0 0 80 0 chrUn_gl000240 0 0 0 0 | chrUn_gl000240 0 0 696 0 chrUn_gl000241 0 0 0 0 | chrUn_gl000241 0 0 1665 0 chrUn_gl000242 0 0 0 0 | chrUn_gl000242 0 0 32 0 chrUn_gl000243 0 0 0 0 | chrUn_gl000243 0 0 328 0 chrUn_gl000244 0 0 0 0 | chrUn_gl000244 0 0 83 0 chrUn_gl000245 0 0 0 0 | chrUn_gl000245 0 0 110 0 chrUn_gl000246 0 0 0 0 | chrUn_gl000246 0 0 73 0 chrUn_gl000247 0 0 0 0 | chrUn_gl000247 0 0 196 0 chrUn_gl000248 0 0 0 0 | chrUn_gl000248 0 0 26 0 total 0 75513 3424104 3436364 | total 0 91961 5792834 57772 Count of types of genotype CT 696904 | CT 777273 AG 696669 | AG 757560 CC 358678 | CC 686562 AA 319335 | AA 715299 TT 320081 | TT 713424 GG 358891 | GG 633290 AC 172678 | AC 228469 GT 173342 | GT 208315 CG 178718 | CG 196041 AT 148808 | AT 212648 DD 0 | DD 249895 DI 0 | DI 237611 II 0 | II 176447 T 18873 | T 22689 A 18752 | A 22586 G 19064 | G 20696 C 18824 | C 20492 I 0 | I 5271 D 0 | D 227
DeepVariant makes a lot more calls (mainly because it also reports places where it decides that the genotype is homozygous reference, which GATK doesn’t report), but also because the GATK calls were filtered to remove the low-evidence calls, while DeepVariant was set up to report everything.
DeepVariant does a huge number of diploid calls on X and Y, which is a little suspicious.
The ratio of mismatches to matches is 0.01681, about a 1.65% discrepancy rate. I don’t know which of the genome callers is better on this data, but DeepVariant was supposedly better on some recent tests on autosomal chromosomes (I’ve not looked up the paper yet).
Comparing the Dante Labs DeepVariant calls with the fullgenomes calls (on gRCH38) shows a bigger difference:
dantelabs-deepvariant |fullgenomes snpeff.vep.vcf.gz 5508932 genotype_sites |4906211 genotype_sites chr no-call haploid diploid matches| chr no-call haploid diploid mismatches chr1 0 0 443442 336398 | chr1 0 0 383857 23983 chr2 0 0 434888 356031 | chr2 0 0 390198 17928 chr3 0 0 355613 296731 | chr3 0 0 323097 12684 chr4 0 0 380259 320989 | chr4 0 0 348597 13450 chr5 0 0 316748 262579 | chr5 0 0 286416 11314 chr6 0 0 299986 253300 | chr6 0 0 297452 8599 chr7 0 0 314759 252073 | chr7 0 0 279633 14099 chr8 0 0 259602 217907 | chr8 0 0 240326 8761 chr9 0 0 242579 183556 | chr9 0 0 216318 19382 chr10 0 0 280582 224653 | chr10 0 0 247472 12447 chr11 0 0 256993 214312 | chr11 0 0 233892 8491 chr12 0 0 252220 205678 | chr12 0 0 224952 8601 chr13 0 0 209801 165921 | chr13 0 0 189629 11023 chr14 0 0 161466 132483 | chr14 0 0 150472 6803 chr15 0 0 154740 123209 | chr15 0 0 143682 8381 chr16 0 0 169234 128717 | chr16 0 0 146914 9197 chr17 0 0 158498 111002 | chr17 0 0 136966 12128 chr18 0 0 150941 123100 | chr18 0 0 136431 5881 chr19 0 0 129958 94953 | chr19 0 0 109775 6398 chr20 0 0 147206 95398 | chr20 0 0 125200 20597 chr21 0 0 94029 64558 | chr21 0 0 87210 14591 chr22 0 0 93982 58824 | chr22 0 0 83531 14887 chrX 0 93759 70714 96500 | chrX 0 105204 12991 14629 chrY 0 2818 34115 2134 | chrY 0 5979 0 3321 chrM 0 0 0 0 | chrM 0 17 0 0 total 0 96577 5412355 4321006 | total 0 111200 4795011 287575 Count of types of genotype CT 761277 | CT 854737 AG 741910 | AG 828712 AA 636195 | AA 416556 TT 635850 | TT 414670 CC 609922 | CC 401031 GG 557555 | GG 400383 DI 234030 | DI 269235 AC 224266 | AC 266527 AT 209513 | AT 245779 GT 204389 | GT 239801 CG 192211 | CG 225762 II 173913 | II 176278 DD 231324 | DD 46625 A 23799 | A 26598 T 23687 | T 26573 G 21821 | G 23312 C 21561 | C 23061 I 5438 | I 11656 *T 0 | *T 2890 *A 0 | *A 2522 *G 0 | *G 1757 *C 0 | *C 1746 D 271 | D 0
Now the ratio of mismatches to matches is 0.06655, a 6.2% discrepancy rate. A few of the discrepancies were haploid/diploid differences on chromosome X, but that is only about 1100 differences. None of the mismatches involve the weird *A,*C,*G, *T genotype calls.
I do have 4,321,006 genotype calls that I am now pretty confident of, as they were called by two different variant callers from two different sequencing runs using different sequencing technology.
But I’m not sure which data set or variant caller to favor on the 287,575 disagreements, nor what to do about the locations where one variant caller made a call for a site and the other didn’t. The fullgenomes data includes a gVCF file, which has calls for every base that got reads mapped to it, but I’ve not tried extracting data from that format yet (it’s bad enough having to try to extract data from the two different vcf formats).
I was planning to compare the 23andme data with each of the whole-genome vcf calls, making the assumption that the sequencing and variant caller that agrees most with the hybridization-based genotyping by 23andme would be the most accurate. (I also want to make a revised “23andme” data set that replaces any genotyping calls where both whole-genome sequences agreed with each other, but disagreed with 23andme.)
To make this all work, I need to have all the variant calling be relative to the same reference genome, which means either lifting Dante Labs and 23andme to gRCH38 or reversing that and moving the fullgenomes vcf files to gRCH37. I could also try having Kishwar do the DeepVariant calls on both reference genomes for the fullgenomes data.
I’ll need to think a bit about what would be most useful.
Leave a Reply