Gas station without pumps

2019 June 18

Comparing fullgenomes.com with Dante Labs

Filed under: Uncategorized — gasstationwithoutpumps @ 23:58
Tags: , , , , ,

Now that my grading is done, I finally had some time to look at the whole-genome sequencing data I got from fullgenomes.com. I ordered the sequencing on 7 March 2019, shipped the spit kit on 13 March 2019, and got all the data by 3 May 2019.  Their price, $1175, for 30× whole-genome sequencing, is fairly typical of the direct-to-consumer sequencing outfits.  (Dante Labs is much cheaper, but a number of people have been unhappy with the slow or non-delivery of data.)

Here is a summary of what I saw in the fullgenomes snpeff.vep.vcf file:

There are 4906211 genotype sites.
Count by filter values:
	 . 	 4906211
By chromosome, the number of no-call, haploid, and diploid genotypes:
	 chr1 	 0 	 0 	 383857
	 chr2 	 0 	 0 	 390198
	 chr3 	 0 	 0 	 323097
	 chr4 	 0 	 0 	 348597
	 chr5 	 0 	 0 	 286416
	 chr6 	 0 	 0 	 297452
	 chr7 	 0 	 0 	 279633
	 chr8 	 0 	 0 	 240326
	 chr9 	 0 	 0 	 216318
	 chr10 	 0 	 0 	 247472
	 chr11 	 0 	 0 	 233892
	 chr12 	 0 	 0 	 224952
	 chr13 	 0 	 0 	 189629
	 chr14 	 0 	 0 	 150472
	 chr15 	 0 	 0 	 143682
	 chr16 	 0 	 0 	 146914
	 chr17 	 0 	 0 	 136966
	 chr18 	 0 	 0 	 136431
	 chr19 	 0 	 0 	 109775
	 chr20 	 0 	 0 	 125200
	 chr21 	 0 	 0 	 87210
	 chr22 	 0 	 0 	 83531
	 chrX 	 0 	 105204 	 12991
	 chrY 	 0 	 5979 	 0
	 chrM 	 0 	 17 	 0

By type of genotype, there are
	 CT 	 854737
	 AG 	 828712
	 AA 	 416556
	 TT 	 414670
	 CC 	 401031
	 GG 	 400383
	 DI 	 269235
	 AC 	 266527
	 AT 	 245779
	 GT 	 239801
	 CG 	 225762
	 II 	 176278
	 DD 	 46625
	 A 	 26598
	 T 	 26573
	 G 	 23312
	 C 	 23061
	 I 	 11656
	 *T 	 2890
	 *A 	 2522
	 *G 	 1757
	 *C 	 1746

I don’t know what the *T, *A, *G, and *C sites are.  The diploid sites on the X chromosome may be sites that have close homologs on the Y chromosome—close enough that mapping algorithms see them as being the same.

What I wanted to do first was to compare the fullgenomes data with the 23andme genotyping, like I had done with the Dante Labs data.  That turned out to be somewhat difficult, as fullgenomes called variants relative to the latest reference genome (gRCH38), while 23andme and Dante Labs both used the older (gRCH37 =hg19) reference genome.  That difference means that all the coordinates are different, so simple comparisons are difficult.

I have variant calls for the Dante Labs data on both the references using Google’s DeepVariant, so I could compare the Dante Labs calls (which I believe are done with the GATK pipeline) with the DeepVariant calls on the same data, and I could compare the fullgenomes calls with the DeepVariant calls on the Dante Labs data.

I can compare the Dante Labs data with the two variant callers, to see how much difference variant calling makes, and then compare the fullgenomes data with the DeepVariant calls on the Dante Labs data, where both the variant caller and the sequencing method differ.

dantelabs GATK                         |dantelabs-deepvariant
3499617 genotype_sites                 |5884795 genotype_sites                 
  chr   no-call haploid diploid matches|  chr   no-call haploid diploid mismatches
  chrM      0      21       0      0   |  chrM      0       0       0      0   
  chr1      0       0  267787 262884   |  chr1      0       0  446877   4556   
  chr2      0       0  287274 282806   |  chr2      0       0  459298   4145   
  chr3      0       0  240607 237530   |  chr3      0       0  367320   2822   
  chr4      0       0  261070 257783   |  chr4      0       0  400640   2868   
  chr5      0       0  212325 210325   |  chr5      0       0  327319   1827   
  chr6      0       0  204102 201254   |  chr6      0       0  318344   2616   
  chr7      0       0  203647 199384   |  chr7      0       0  350292   3857   
  chr8      0       0  183159 181044   |  chr8      0       0  280030   1902   
  chr9      0       0  148637 143697   |  chr9      0       0  257715   4696   
  chr10     0       0  177729 174246   |  chr10     0       0  295033   3087   
  chr11     0       0  174926 172444   |  chr11     0       0  275753   2256   
  chr12     0       0  165987 163409   |  chr12     0       0  269918   2286   
  chr13     0       0  133928 132587   |  chr13     0       0  199795   1215   
  chr14     0       0  111553 109462   |  chr14     0       0  178071   1979   
  chr15     0       0  103635 101314   |  chr15     0       0  169214   2201   
  chr16     0       0  107064 104448   |  chr16     0       0  194206   2398   
  chr17     0       0   89744  88302   |  chr17     0       0  166728   1300   
  chr18     0       0  100640  99358   |  chr18     0       0  154510   1143   
  chr19     0       0   75073  73488   |  chr19     0       0  146076   1415   
  chr20     0       0   73566  71604   |  chr20     0       0  131282   1844   
  chr21     0       0   52060  50356   |  chr21     0       0  101284   1595   
  chr22     0       0   45153  43796   |  chr22     0       0   84522   1248   
  chrX      0   74099    1801  73134   |  chrX      0   88863   84484   2496   
  chrY      0    1393    2637   1709   |  chrY      0    3098   43760   2020   
  chr1_gl000191_random     0       0       0      0   |  chr1_gl000191_random     0       0      99      0   
  chr1_gl000192_random     0       0       0      0   |  chr1_gl000192_random     0       0     976      0   
  chr4_ctg9_hap1     0       0       0      0   |  chr4_ctg9_hap1     0       0     263      0   
  chr4_gl000193_random     0       0       0      0   |  chr4_gl000193_random     0       0    1887      0   
  chr4_gl000194_random     0       0       0      0   |  chr4_gl000194_random     0       0    2374      0   
  chr6_apd_hap1     0       0       0      0   |  chr6_apd_hap1     0       0      24      0   
  chr6_cox_hap2     0       0       0      0   |  chr6_cox_hap2     0       0     383      0   
  chr6_dbb_hap3     0       0       0      0   |  chr6_dbb_hap3     0       0     284      0   
  chr6_mann_hap4     0       0       0      0   |  chr6_mann_hap4     0       0     798      0   
  chr6_mcf_hap5     0       0       0      0   |  chr6_mcf_hap5     0       0     132      0   
  chr6_qbl_hap6     0       0       0      0   |  chr6_qbl_hap6     0       0     960      0   
  chr6_ssto_hap7     0       0       0      0   |  chr6_ssto_hap7     0       0     812      0   
  chr7_gl000195_random     0       0       0      0   |  chr7_gl000195_random     0       0    3267      0   
  chr8_gl000196_random     0       0       0      0   |  chr8_gl000196_random     0       0      11      0   
  chr8_gl000197_random     0       0       0      0   |  chr8_gl000197_random     0       0       1      0   
  chr9_gl000198_random     0       0       0      0   |  chr9_gl000198_random     0       0    1866      0   
  chr9_gl000199_random     0       0       0      0   |  chr9_gl000199_random     0       0    7465      0   
  chr9_gl000200_random     0       0       0      0   |  chr9_gl000200_random     0       0       1      0   
  chr9_gl000201_random     0       0       0      0   |  chr9_gl000201_random     0       0      12      0   
  chr11_gl000202_random     0       0       0      0   |  chr11_gl000202_random     0       0      78      0   
  chr17_ctg5_hap1     0       0       0      0   |  chr17_ctg5_hap1     0       0     608      0   
  chr17_gl000203_random     0       0       0      0   |  chr17_gl000203_random     0       0     444      0   
  chr17_gl000204_random     0       0       0      0   |  chr17_gl000204_random     0       0      92      0   
  chr17_gl000205_random     0       0       0      0   |  chr17_gl000205_random     0       0    2874      0   
  chr17_gl000206_random     0       0       0      0   |  chr17_gl000206_random     0       0      13      0   
  chr18_gl000207_random     0       0       0      0   |  chr18_gl000207_random     0       0     100      0   
  chr19_gl000208_random     0       0       0      0   |  chr19_gl000208_random     0       0    2216      0   
  chr19_gl000209_random     0       0       0      0   |  chr19_gl000209_random     0       0     371      0   
  chr21_gl000210_random     0       0       0      0   |  chr21_gl000210_random     0       0      21      0   
  chrUn_gl000211     0       0       0      0   |  chrUn_gl000211     0       0    1967      0   
  chrUn_gl000212     0       0       0      0   |  chrUn_gl000212     0       0    1316      0   
  chrUn_gl000213     0       0       0      0   |  chrUn_gl000213     0       0     266      0   
  chrUn_gl000214     0       0       0      0   |  chrUn_gl000214     0       0    2159      0   
  chrUn_gl000215     0       0       0      0   |  chrUn_gl000215     0       0     129      0   
  chrUn_gl000216     0       0       0      0   |  chrUn_gl000216     0       0    8530      0   
  chrUn_gl000217     0       0       0      0   |  chrUn_gl000217     0       0    2138      0   
  chrUn_gl000218     0       0       0      0   |  chrUn_gl000218     0       0    1480      0   
  chrUn_gl000219     0       0       0      0   |  chrUn_gl000219     0       0    5820      0   
  chrUn_gl000220     0       0       0      0   |  chrUn_gl000220     0       0    1730      0   
  chrUn_gl000221     0       0       0      0   |  chrUn_gl000221     0       0     958      0   
  chrUn_gl000222     0       0       0      0   |  chrUn_gl000222     0       0    2193      0   
  chrUn_gl000223     0       0       0      0   |  chrUn_gl000223     0       0      12      0   
  chrUn_gl000224     0       0       0      0   |  chrUn_gl000224     0       0    3738      0   
  chrUn_gl000225     0       0       0      0   |  chrUn_gl000225     0       0   15234      0   
  chrUn_gl000226     0       0       0      0   |  chrUn_gl000226     0       0     257      0   
  chrUn_gl000227     0       0       0      0   |  chrUn_gl000227     0       0      80      0   
  chrUn_gl000228     0       0       0      0   |  chrUn_gl000228     0       0    1299      0   
  chrUn_gl000229     0       0       0      0   |  chrUn_gl000229     0       0    1080      0   
  chrUn_gl000230     0       0       0      0   |  chrUn_gl000230     0       0     409      0   
  chrUn_gl000231     0       0       0      0   |  chrUn_gl000231     0       0    1118      0   
  chrUn_gl000232     0       0       0      0   |  chrUn_gl000232     0       0    2148      0   
  chrUn_gl000233     0       0       0      0   |  chrUn_gl000233     0       0     433      0   
  chrUn_gl000234     0       0       0      0   |  chrUn_gl000234     0       0    2281      0   
  chrUn_gl000235     0       0       0      0   |  chrUn_gl000235     0       0    1224      0   
  chrUn_gl000236     0       0       0      0   |  chrUn_gl000236     0       0     131      0   
  chrUn_gl000237     0       0       0      0   |  chrUn_gl000237     0       0     493      0   
  chrUn_gl000238     0       0       0      0   |  chrUn_gl000238     0       0      19      0   
  chrUn_gl000239     0       0       0      0   |  chrUn_gl000239     0       0      80      0   
  chrUn_gl000240     0       0       0      0   |  chrUn_gl000240     0       0     696      0   
  chrUn_gl000241     0       0       0      0   |  chrUn_gl000241     0       0    1665      0   
  chrUn_gl000242     0       0       0      0   |  chrUn_gl000242     0       0      32      0   
  chrUn_gl000243     0       0       0      0   |  chrUn_gl000243     0       0     328      0   
  chrUn_gl000244     0       0       0      0   |  chrUn_gl000244     0       0      83      0   
  chrUn_gl000245     0       0       0      0   |  chrUn_gl000245     0       0     110      0   
  chrUn_gl000246     0       0       0      0   |  chrUn_gl000246     0       0      73      0   
  chrUn_gl000247     0       0       0      0   |  chrUn_gl000247     0       0     196      0   
  chrUn_gl000248     0       0       0      0   |  chrUn_gl000248     0       0      26      0   

  total     0   75513 3424104 3436364   |  total     0   91961 5792834  57772   

Count of types of genotype
   CT  696904                          |   CT  777273                          
   AG  696669                          |   AG  757560                          
   CC  358678                          |   CC  686562                          
   AA  319335                          |   AA  715299                          
   TT  320081                          |   TT  713424                          
   GG  358891                          |   GG  633290                          
   AC  172678                          |   AC  228469                          
   GT  173342                          |   GT  208315                          
   CG  178718                          |   CG  196041                          
   AT  148808                          |   AT  212648                          
   DD       0                          |   DD  249895                          
   DI       0                          |   DI  237611                          
   II       0                          |   II  176447                          
   T    18873                          |   T    22689                          
   A    18752                          |   A    22586                          
   G    19064                          |   G    20696                          
   C    18824                          |   C    20492                          
   I        0                          |   I     5271                          
   D        0                          |   D      227                          

DeepVariant makes a lot more calls (mainly because it also reports places where it decides that the genotype is homozygous reference, which GATK doesn’t report), but also because the GATK calls were filtered to remove the low-evidence calls, while DeepVariant was set up to report everything.
DeepVariant does a huge number of diploid calls on X and Y, which is a little suspicious.
The ratio of mismatches to matches is 0.01681, about a 1.65% discrepancy rate. I don’t know which of the genome callers is better on this data, but DeepVariant was supposedly better on some recent tests on autosomal chromosomes (I’ve not looked up the paper yet).

Comparing the Dante Labs DeepVariant calls with the fullgenomes calls (on gRCH38) shows a bigger difference:


dantelabs-deepvariant                  |fullgenomes snpeff.vep.vcf.gz
5508932 genotype_sites                 |4906211 genotype_sites                 
  chr   no-call haploid diploid matches|  chr   no-call haploid diploid mismatches
  chr1      0       0  443442 336398   |  chr1      0       0  383857  23983   
  chr2      0       0  434888 356031   |  chr2      0       0  390198  17928   
  chr3      0       0  355613 296731   |  chr3      0       0  323097  12684   
  chr4      0       0  380259 320989   |  chr4      0       0  348597  13450   
  chr5      0       0  316748 262579   |  chr5      0       0  286416  11314   
  chr6      0       0  299986 253300   |  chr6      0       0  297452   8599   
  chr7      0       0  314759 252073   |  chr7      0       0  279633  14099   
  chr8      0       0  259602 217907   |  chr8      0       0  240326   8761   
  chr9      0       0  242579 183556   |  chr9      0       0  216318  19382   
  chr10     0       0  280582 224653   |  chr10     0       0  247472  12447   
  chr11     0       0  256993 214312   |  chr11     0       0  233892   8491   
  chr12     0       0  252220 205678   |  chr12     0       0  224952   8601   
  chr13     0       0  209801 165921   |  chr13     0       0  189629  11023   
  chr14     0       0  161466 132483   |  chr14     0       0  150472   6803   
  chr15     0       0  154740 123209   |  chr15     0       0  143682   8381   
  chr16     0       0  169234 128717   |  chr16     0       0  146914   9197   
  chr17     0       0  158498 111002   |  chr17     0       0  136966  12128   
  chr18     0       0  150941 123100   |  chr18     0       0  136431   5881   
  chr19     0       0  129958  94953   |  chr19     0       0  109775   6398   
  chr20     0       0  147206  95398   |  chr20     0       0  125200  20597   
  chr21     0       0   94029  64558   |  chr21     0       0   87210  14591   
  chr22     0       0   93982  58824   |  chr22     0       0   83531  14887   
  chrX      0   93759   70714  96500   |  chrX      0  105204   12991  14629   
  chrY      0    2818   34115   2134   |  chrY      0    5979       0   3321   
  chrM      0       0       0      0   |  chrM      0      17       0      0   

  total     0   96577 5412355 4321006   |  total     0  111200 4795011 287575   

Count of types of genotype
   CT  761277                          |   CT  854737                          
   AG  741910                          |   AG  828712                          
   AA  636195                          |   AA  416556                          
   TT  635850                          |   TT  414670                          
   CC  609922                          |   CC  401031                          
   GG  557555                          |   GG  400383                          
   DI  234030                          |   DI  269235                          
   AC  224266                          |   AC  266527                          
   AT  209513                          |   AT  245779                          
   GT  204389                          |   GT  239801                          
   CG  192211                          |   CG  225762                          
   II  173913                          |   II  176278                          
   DD  231324                          |   DD   46625                          
   A    23799                          |   A    26598                          
   T    23687                          |   T    26573                          
   G    21821                          |   G    23312                          
   C    21561                          |   C    23061                          
   I     5438                          |   I    11656                          
   *T       0                          |   *T    2890                          
   *A       0                          |   *A    2522                          
   *G       0                          |   *G    1757                          
   *C       0                          |   *C    1746                          
   D      271                          |   D        0                          

Now the ratio of mismatches to matches is 0.06655, a 6.2% discrepancy rate. A few of the discrepancies were haploid/diploid differences on chromosome X, but that is only about 1100 differences. None of the mismatches involve the weird *A,*C,*G, *T genotype calls.

I do have 4,321,006 genotype calls that I am now pretty confident of, as they were called by two different variant callers from two different sequencing runs using different sequencing technology.

But I’m not sure which data set or variant caller to favor on the 287,575 disagreements, nor what to do about the locations where one variant caller made a call for a site and the other didn’t. The fullgenomes data includes a gVCF file, which has calls for every base that got reads mapped to it, but I’ve not tried extracting data from that format yet (it’s bad enough having to try to extract data from the two different vcf formats).

I was planning to compare the 23andme data with each of the whole-genome vcf calls, making the assumption that the sequencing and variant caller that agrees most with the hybridization-based genotyping by 23andme would be the most accurate. (I also want to make a revised “23andme” data set that replaces any genotyping calls where both whole-genome sequences agreed with each other, but disagreed with 23andme.)

To make this all work, I need to have all the variant calling be relative to the same reference genome, which means either lifting Dante Labs and 23andme to gRCH38 or reversing that and moving the fullgenomes vcf files to gRCH37. I could also try having Kishwar do the DeepVariant calls on both reference genomes for the fullgenomes data.

I’ll need to think a bit about what would be most useful.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: