Gas station without pumps

2019 March 14

Another spit kit sent

Filed under: Uncategorized — gasstationwithoutpumps @ 00:09
Tags:

I sent in another spit kit 2019 Mar 13—this one to fullgenomes.com.  I had ordered from them shortly before Dante Labs sent me email saying that my data had been ready for months—Dante Labs had just neglected to tell me how to get the data.  Dante Labs also says that they are finally sending me the raw data I requested (not just a VCF file). Getting the raw data is good, as I can run it through the best mapping and variant-calling software pipelines, and check those calls against the ones made by the lab.

When I get the fullgenomes data, I’ll compare it with the Dante Labs data—I expect some differences, as they use different sequencing platforms (Illumina for fullgenomes, BGI for Dante Labs).

I’ve done a minimal comparison of the Dante Labs VCF file with the 23andme data—just looking at the top hits in the Promethease analysis of each.  They seem to be saying the same thing.  When I get some time, I’ll write a little script that goes through the VCF file looking for the genotype at each site in the 23andme data, to see how many discrepancies there are.

One problem with VCF files is that they only report the differences from the reference genome—there is no way to distinguish “not covered by enough reads” from “homozygous for the reference allele”.  The gVCF format attempts to fix that, but at the expense of an enormous increase in file size.

I think that there might be a use for a new format that provides terse genotype information (in a format like that used by 23andme) for every location in SNPedia (110,026 currently) or for every one in dbSNP (113,862,023 validated clusters in build 150).  Doing 114M locations in the format used by 23andme would take about 3GB, which is smaller than a gVCF file, but much larger than the variant-only VCF files (about 175MB).  The 23andme format is much less informative than the VCF file, as it just has the genotype call, with no information about how reliable the call is, so I’m not really sure it would be worthwhile to create a 3GB file in such a format.

It will be a few weeks before I do anything interesting with the genome data, as I have about 53 hours of grading still to over the next week and a half.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: