Gas station without pumps

2019 March 14

Another spit kit sent

Filed under: Uncategorized — gasstationwithoutpumps @ 00:09

I sent in another spit kit 2019 Mar 13—this one to  I had ordered from them shortly before Dante Labs sent me email saying that my data had been ready for months—Dante Labs had just neglected to tell me how to get the data.  Dante Labs also says that they are finally sending me the raw data I requested (not just a VCF file). Getting the raw data is good, as I can run it through the best mapping and variant-calling software pipelines, and check those calls against the ones made by the lab.

When I get the fullgenomes data, I’ll compare it with the Dante Labs data—I expect some differences, as they use different sequencing platforms (Illumina for fullgenomes, BGI for Dante Labs).

I’ve done a minimal comparison of the Dante Labs VCF file with the 23andme data—just looking at the top hits in the Promethease analysis of each.  They seem to be saying the same thing.  When I get some time, I’ll write a little script that goes through the VCF file looking for the genotype at each site in the 23andme data, to see how many discrepancies there are.

One problem with VCF files is that they only report the differences from the reference genome—there is no way to distinguish “not covered by enough reads” from “homozygous for the reference allele”.  The gVCF format attempts to fix that, but at the expense of an enormous increase in file size.

I think that there might be a use for a new format that provides terse genotype information (in a format like that used by 23andme) for every location in SNPedia (110,026 currently) or for every one in dbSNP (113,862,023 validated clusters in build 150).  Doing 114M locations in the format used by 23andme would take about 3GB, which is smaller than a gVCF file, but much larger than the variant-only VCF files (about 175MB).  The 23andme format is much less informative than the VCF file, as it just has the genotype call, with no information about how reliable the call is, so I’m not really sure it would be worthwhile to create a 3GB file in such a format.

It will be a few weeks before I do anything interesting with the genome data, as I have about 53 hours of grading still to over the next week and a half.


  1. Hi, so did you compare the data with Dante Labs data? I have gotten 4x results from Dante Labs and am yet to receive my full 30x results, but I wanyed to ask you that apart from promethease, what other resources you used to analyze your data?

    Comment by OphthaDox — 2020 March 11 @ 10:15 | Reply

  2. I’ve done some comparisons, but nothing very informative yet. I had a grad student redo the mapping and variant calling, using a different pipeline, but again, the results were not very informative. I’m also interested in doing haplotyping, as I have my father’s genome also. I might have time to get back to this project over the summer, when I’m not spending all my time grading and fixing the undergraduate curriculum.

    Comment by gasstationwithoutpumps — 2020 March 11 @ 13:45 | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: