Now that my grading is done, I finally had some time to look at the whole-genome sequencing data I got from fullgenomes.com. I ordered the sequencing on 7 March 2019, shipped the spit kit on 13 March 2019, and got all the data by 3 May 2019. Their price, $1175, for 30× whole-genome sequencing, is fairly typical of the direct-to-consumer sequencing outfits. (Dante Labs is much cheaper, but a number of people have been unhappy with the slow or non-delivery of data.)
Here is a summary of what I saw in the fullgenomes snpeff.vep.vcf file:
There are 4906211 genotype sites.
Count by filter values:
. 4906211
By chromosome, the number of no-call, haploid, and diploid genotypes:
chr1 0 0 383857
chr2 0 0 390198
chr3 0 0 323097
chr4 0 0 348597
chr5 0 0 286416
chr6 0 0 297452
chr7 0 0 279633
chr8 0 0 240326
chr9 0 0 216318
chr10 0 0 247472
chr11 0 0 233892
chr12 0 0 224952
chr13 0 0 189629
chr14 0 0 150472
chr15 0 0 143682
chr16 0 0 146914
chr17 0 0 136966
chr18 0 0 136431
chr19 0 0 109775
chr20 0 0 125200
chr21 0 0 87210
chr22 0 0 83531
chrX 0 105204 12991
chrY 0 5979 0
chrM 0 17 0
By type of genotype, there are
CT 854737
AG 828712
AA 416556
TT 414670
CC 401031
GG 400383
DI 269235
AC 266527
AT 245779
GT 239801
CG 225762
II 176278
DD 46625
A 26598
T 26573
G 23312
C 23061
I 11656
*T 2890
*A 2522
*G 1757
*C 1746
I don’t know what the *T, *A, *G, and *C sites are. The diploid sites on the X chromosome may be sites that have close homologs on the Y chromosome—close enough that mapping algorithms see them as being the same.
What I wanted to do first was to compare the fullgenomes data with the 23andme genotyping, like I had done with the Dante Labs data. That turned out to be somewhat difficult, as fullgenomes called variants relative to the latest reference genome (gRCH38), while 23andme and Dante Labs both used the older (gRCH37 =hg19) reference genome. That difference means that all the coordinates are different, so simple comparisons are difficult.
I have variant calls for the Dante Labs data on both the references using Google’s DeepVariant, so I could compare the Dante Labs calls (which I believe are done with the GATK pipeline) with the DeepVariant calls on the same data, and I could compare the fullgenomes calls with the DeepVariant calls on the Dante Labs data.
I can compare the Dante Labs data with the two variant callers, to see how much difference variant calling makes, and then compare the fullgenomes data with the DeepVariant calls on the Dante Labs data, where both the variant caller and the sequencing method differ.
dantelabs GATK |dantelabs-deepvariant
3499617 genotype_sites |5884795 genotype_sites
chr no-call haploid diploid matches| chr no-call haploid diploid mismatches
chrM 0 21 0 0 | chrM 0 0 0 0
chr1 0 0 267787 262884 | chr1 0 0 446877 4556
chr2 0 0 287274 282806 | chr2 0 0 459298 4145
chr3 0 0 240607 237530 | chr3 0 0 367320 2822
chr4 0 0 261070 257783 | chr4 0 0 400640 2868
chr5 0 0 212325 210325 | chr5 0 0 327319 1827
chr6 0 0 204102 201254 | chr6 0 0 318344 2616
chr7 0 0 203647 199384 | chr7 0 0 350292 3857
chr8 0 0 183159 181044 | chr8 0 0 280030 1902
chr9 0 0 148637 143697 | chr9 0 0 257715 4696
chr10 0 0 177729 174246 | chr10 0 0 295033 3087
chr11 0 0 174926 172444 | chr11 0 0 275753 2256
chr12 0 0 165987 163409 | chr12 0 0 269918 2286
chr13 0 0 133928 132587 | chr13 0 0 199795 1215
chr14 0 0 111553 109462 | chr14 0 0 178071 1979
chr15 0 0 103635 101314 | chr15 0 0 169214 2201
chr16 0 0 107064 104448 | chr16 0 0 194206 2398
chr17 0 0 89744 88302 | chr17 0 0 166728 1300
chr18 0 0 100640 99358 | chr18 0 0 154510 1143
chr19 0 0 75073 73488 | chr19 0 0 146076 1415
chr20 0 0 73566 71604 | chr20 0 0 131282 1844
chr21 0 0 52060 50356 | chr21 0 0 101284 1595
chr22 0 0 45153 43796 | chr22 0 0 84522 1248
chrX 0 74099 1801 73134 | chrX 0 88863 84484 2496
chrY 0 1393 2637 1709 | chrY 0 3098 43760 2020
chr1_gl000191_random 0 0 0 0 | chr1_gl000191_random 0 0 99 0
chr1_gl000192_random 0 0 0 0 | chr1_gl000192_random 0 0 976 0
chr4_ctg9_hap1 0 0 0 0 | chr4_ctg9_hap1 0 0 263 0
chr4_gl000193_random 0 0 0 0 | chr4_gl000193_random 0 0 1887 0
chr4_gl000194_random 0 0 0 0 | chr4_gl000194_random 0 0 2374 0
chr6_apd_hap1 0 0 0 0 | chr6_apd_hap1 0 0 24 0
chr6_cox_hap2 0 0 0 0 | chr6_cox_hap2 0 0 383 0
chr6_dbb_hap3 0 0 0 0 | chr6_dbb_hap3 0 0 284 0
chr6_mann_hap4 0 0 0 0 | chr6_mann_hap4 0 0 798 0
chr6_mcf_hap5 0 0 0 0 | chr6_mcf_hap5 0 0 132 0
chr6_qbl_hap6 0 0 0 0 | chr6_qbl_hap6 0 0 960 0
chr6_ssto_hap7 0 0 0 0 | chr6_ssto_hap7 0 0 812 0
chr7_gl000195_random 0 0 0 0 | chr7_gl000195_random 0 0 3267 0
chr8_gl000196_random 0 0 0 0 | chr8_gl000196_random 0 0 11 0
chr8_gl000197_random 0 0 0 0 | chr8_gl000197_random 0 0 1 0
chr9_gl000198_random 0 0 0 0 | chr9_gl000198_random 0 0 1866 0
chr9_gl000199_random 0 0 0 0 | chr9_gl000199_random 0 0 7465 0
chr9_gl000200_random 0 0 0 0 | chr9_gl000200_random 0 0 1 0
chr9_gl000201_random 0 0 0 0 | chr9_gl000201_random 0 0 12 0
chr11_gl000202_random 0 0 0 0 | chr11_gl000202_random 0 0 78 0
chr17_ctg5_hap1 0 0 0 0 | chr17_ctg5_hap1 0 0 608 0
chr17_gl000203_random 0 0 0 0 | chr17_gl000203_random 0 0 444 0
chr17_gl000204_random 0 0 0 0 | chr17_gl000204_random 0 0 92 0
chr17_gl000205_random 0 0 0 0 | chr17_gl000205_random 0 0 2874 0
chr17_gl000206_random 0 0 0 0 | chr17_gl000206_random 0 0 13 0
chr18_gl000207_random 0 0 0 0 | chr18_gl000207_random 0 0 100 0
chr19_gl000208_random 0 0 0 0 | chr19_gl000208_random 0 0 2216 0
chr19_gl000209_random 0 0 0 0 | chr19_gl000209_random 0 0 371 0
chr21_gl000210_random 0 0 0 0 | chr21_gl000210_random 0 0 21 0
chrUn_gl000211 0 0 0 0 | chrUn_gl000211 0 0 1967 0
chrUn_gl000212 0 0 0 0 | chrUn_gl000212 0 0 1316 0
chrUn_gl000213 0 0 0 0 | chrUn_gl000213 0 0 266 0
chrUn_gl000214 0 0 0 0 | chrUn_gl000214 0 0 2159 0
chrUn_gl000215 0 0 0 0 | chrUn_gl000215 0 0 129 0
chrUn_gl000216 0 0 0 0 | chrUn_gl000216 0 0 8530 0
chrUn_gl000217 0 0 0 0 | chrUn_gl000217 0 0 2138 0
chrUn_gl000218 0 0 0 0 | chrUn_gl000218 0 0 1480 0
chrUn_gl000219 0 0 0 0 | chrUn_gl000219 0 0 5820 0
chrUn_gl000220 0 0 0 0 | chrUn_gl000220 0 0 1730 0
chrUn_gl000221 0 0 0 0 | chrUn_gl000221 0 0 958 0
chrUn_gl000222 0 0 0 0 | chrUn_gl000222 0 0 2193 0
chrUn_gl000223 0 0 0 0 | chrUn_gl000223 0 0 12 0
chrUn_gl000224 0 0 0 0 | chrUn_gl000224 0 0 3738 0
chrUn_gl000225 0 0 0 0 | chrUn_gl000225 0 0 15234 0
chrUn_gl000226 0 0 0 0 | chrUn_gl000226 0 0 257 0
chrUn_gl000227 0 0 0 0 | chrUn_gl000227 0 0 80 0
chrUn_gl000228 0 0 0 0 | chrUn_gl000228 0 0 1299 0
chrUn_gl000229 0 0 0 0 | chrUn_gl000229 0 0 1080 0
chrUn_gl000230 0 0 0 0 | chrUn_gl000230 0 0 409 0
chrUn_gl000231 0 0 0 0 | chrUn_gl000231 0 0 1118 0
chrUn_gl000232 0 0 0 0 | chrUn_gl000232 0 0 2148 0
chrUn_gl000233 0 0 0 0 | chrUn_gl000233 0 0 433 0
chrUn_gl000234 0 0 0 0 | chrUn_gl000234 0 0 2281 0
chrUn_gl000235 0 0 0 0 | chrUn_gl000235 0 0 1224 0
chrUn_gl000236 0 0 0 0 | chrUn_gl000236 0 0 131 0
chrUn_gl000237 0 0 0 0 | chrUn_gl000237 0 0 493 0
chrUn_gl000238 0 0 0 0 | chrUn_gl000238 0 0 19 0
chrUn_gl000239 0 0 0 0 | chrUn_gl000239 0 0 80 0
chrUn_gl000240 0 0 0 0 | chrUn_gl000240 0 0 696 0
chrUn_gl000241 0 0 0 0 | chrUn_gl000241 0 0 1665 0
chrUn_gl000242 0 0 0 0 | chrUn_gl000242 0 0 32 0
chrUn_gl000243 0 0 0 0 | chrUn_gl000243 0 0 328 0
chrUn_gl000244 0 0 0 0 | chrUn_gl000244 0 0 83 0
chrUn_gl000245 0 0 0 0 | chrUn_gl000245 0 0 110 0
chrUn_gl000246 0 0 0 0 | chrUn_gl000246 0 0 73 0
chrUn_gl000247 0 0 0 0 | chrUn_gl000247 0 0 196 0
chrUn_gl000248 0 0 0 0 | chrUn_gl000248 0 0 26 0
total 0 75513 3424104 3436364 | total 0 91961 5792834 57772
Count of types of genotype
CT 696904 | CT 777273
AG 696669 | AG 757560
CC 358678 | CC 686562
AA 319335 | AA 715299
TT 320081 | TT 713424
GG 358891 | GG 633290
AC 172678 | AC 228469
GT 173342 | GT 208315
CG 178718 | CG 196041
AT 148808 | AT 212648
DD 0 | DD 249895
DI 0 | DI 237611
II 0 | II 176447
T 18873 | T 22689
A 18752 | A 22586
G 19064 | G 20696
C 18824 | C 20492
I 0 | I 5271
D 0 | D 227
DeepVariant makes a lot more calls (mainly because it also reports places where it decides that the genotype is homozygous reference, which GATK doesn’t report), but also because the GATK calls were filtered to remove the low-evidence calls, while DeepVariant was set up to report everything.
DeepVariant does a huge number of diploid calls on X and Y, which is a little suspicious.
The ratio of mismatches to matches is 0.01681, about a 1.65% discrepancy rate. I don’t know which of the genome callers is better on this data, but DeepVariant was supposedly better on some recent tests on autosomal chromosomes (I’ve not looked up the paper yet).
Comparing the Dante Labs DeepVariant calls with the fullgenomes calls (on gRCH38) shows a bigger difference:
dantelabs-deepvariant |fullgenomes snpeff.vep.vcf.gz
5508932 genotype_sites |4906211 genotype_sites
chr no-call haploid diploid matches| chr no-call haploid diploid mismatches
chr1 0 0 443442 336398 | chr1 0 0 383857 23983
chr2 0 0 434888 356031 | chr2 0 0 390198 17928
chr3 0 0 355613 296731 | chr3 0 0 323097 12684
chr4 0 0 380259 320989 | chr4 0 0 348597 13450
chr5 0 0 316748 262579 | chr5 0 0 286416 11314
chr6 0 0 299986 253300 | chr6 0 0 297452 8599
chr7 0 0 314759 252073 | chr7 0 0 279633 14099
chr8 0 0 259602 217907 | chr8 0 0 240326 8761
chr9 0 0 242579 183556 | chr9 0 0 216318 19382
chr10 0 0 280582 224653 | chr10 0 0 247472 12447
chr11 0 0 256993 214312 | chr11 0 0 233892 8491
chr12 0 0 252220 205678 | chr12 0 0 224952 8601
chr13 0 0 209801 165921 | chr13 0 0 189629 11023
chr14 0 0 161466 132483 | chr14 0 0 150472 6803
chr15 0 0 154740 123209 | chr15 0 0 143682 8381
chr16 0 0 169234 128717 | chr16 0 0 146914 9197
chr17 0 0 158498 111002 | chr17 0 0 136966 12128
chr18 0 0 150941 123100 | chr18 0 0 136431 5881
chr19 0 0 129958 94953 | chr19 0 0 109775 6398
chr20 0 0 147206 95398 | chr20 0 0 125200 20597
chr21 0 0 94029 64558 | chr21 0 0 87210 14591
chr22 0 0 93982 58824 | chr22 0 0 83531 14887
chrX 0 93759 70714 96500 | chrX 0 105204 12991 14629
chrY 0 2818 34115 2134 | chrY 0 5979 0 3321
chrM 0 0 0 0 | chrM 0 17 0 0
total 0 96577 5412355 4321006 | total 0 111200 4795011 287575
Count of types of genotype
CT 761277 | CT 854737
AG 741910 | AG 828712
AA 636195 | AA 416556
TT 635850 | TT 414670
CC 609922 | CC 401031
GG 557555 | GG 400383
DI 234030 | DI 269235
AC 224266 | AC 266527
AT 209513 | AT 245779
GT 204389 | GT 239801
CG 192211 | CG 225762
II 173913 | II 176278
DD 231324 | DD 46625
A 23799 | A 26598
T 23687 | T 26573
G 21821 | G 23312
C 21561 | C 23061
I 5438 | I 11656
*T 0 | *T 2890
*A 0 | *A 2522
*G 0 | *G 1757
*C 0 | *C 1746
D 271 | D 0
Now the ratio of mismatches to matches is 0.06655, a 6.2% discrepancy rate. A few of the discrepancies were haploid/diploid differences on chromosome X, but that is only about 1100 differences. None of the mismatches involve the weird *A,*C,*G, *T genotype calls.
I do have 4,321,006 genotype calls that I am now pretty confident of, as they were called by two different variant callers from two different sequencing runs using different sequencing technology.
But I’m not sure which data set or variant caller to favor on the 287,575 disagreements, nor what to do about the locations where one variant caller made a call for a site and the other didn’t. The fullgenomes data includes a gVCF file, which has calls for every base that got reads mapped to it, but I’ve not tried extracting data from that format yet (it’s bad enough having to try to extract data from the two different vcf formats).
I was planning to compare the 23andme data with each of the whole-genome vcf calls, making the assumption that the sequencing and variant caller that agrees most with the hybridization-based genotyping by 23andme would be the most accurate. (I also want to make a revised “23andme” data set that replaces any genotyping calls where both whole-genome sequences agreed with each other, but disagreed with 23andme.)
To make this all work, I need to have all the variant calling be relative to the same reference genome, which means either lifting Dante Labs and 23andme to gRCH38 or reversing that and moving the fullgenomes vcf files to gRCH37. I could also try having Kishwar do the DeepVariant calls on both reference genomes for the fullgenomes data.
I’ll need to think a bit about what would be most useful.
Like this:
Like Loading...
Keep in mind that the “become a hedge fund manager” advice is based on the current situation where many charities have volunteers lining up to help while being strapped for cash.
In a world that already has many hedge fund managers channeling their money to the most effective charities but a lack of volunteers the most effective choice might be to volunteer.
If you imagine it as a market where the options all have some expected return, the rational thing is to invest in the cheapest item. This moves the price and the guy walking in the door behind you might look at the same market and see that the most effective choice is to invest in the next option which is now the cheapest.
I’m going to agree that I don’t buy arguments based on numbers of future people, I don’t believe in there being a moral incentive to maximize numbers of future people though future-suffering of people who have yet to exist I would attach some weight to. I don’t care if there could be a trillion people on earth, that’s not a goal in my mind but I do care if the grandkids of the current generation might live in misery.