Gas station without pumps

2019 March 27

Comparing 23andme and Dante Labs data

Filed under: Uncategorized — gasstationwithoutpumps @ 06:22
Tags: , , , ,

I got the grading for the last lab of winter quarter done yesterday (I took me several days longer than I expected, even allowing for an hour per paper—they took me more like 2 hours each). I have to turn in grades today, and I just found out last night that the graders had not finished grading homework 11, so I need to grade that also.

Before I found out that I have unexpectedly even more grading to do, I had taken an hour to write a short Python program to compare my data from 23andme with my data from Dante Labs. The two seem very concordant, and so I now believe I have gotten good data from Dante Labs:

23_and_me                              | vcf from Dante Labs
638531 genotype_sites                  | 3499617 genotype_sites
  chr   no-call haploid diploid matches|   chr   no-call haploid diploid mismatches
  chrM    306    3995       0      0   |   chrM      0      21       0      2   
  chr1   1177       0   48337  11756   |   chr1      0       0  267787     65   
  chr2   1121       0   50654  12303   |   chr2      0       0  287274     59   
  chr3   1013       0   42011  10184   |   chr3      0       0  240607     43   
  chr4   1006       0   38468   9719   |   chr4      0       0  261070     56   
  chr5    885       0   36147   8733   |   chr5      0       0  212325     41   
  chr6    956       0   43067   8505   |   chr6      0       0  204102     53   
  chr7    857       0   33500   8348   |   chr7      0       0  203647     44   
  chr8    626       0   31057   7690   |   chr8      0       0  183159     21   
  chr9    700       0   25746   6431   |   chr9      0       0  148637     65   
  chr10   656       0   29869   7494   |   chr10     0       0  177729     35   
  chr11   605       0   30337   7523   |   chr11     0       0  174926     28   
  chr12   677       0   28755   7366   |   chr12     0       0  165987     26   
  chr13   392       0   21688   5571   |   chr13     0       0  133928     30   
  chr14   472       0   19489   4871   |   chr14     0       0  111553     23   
  chr15   452       0   18554   4757   |   chr15     0       0  103635     16   
  chr16   504       0   19893   5019   |   chr16     0       0  107064     20   
  chr17   510       0   18891   4370   |   chr17     0       0   89744     24   
  chr18   307       0   17368   4591   |   chr18     0       0  100640     12   
  chr19   551       0   14366   3554   |   chr19     0       0   75073     29   
  chr20   295       0   14486   3603   |   chr20     0       0   73566     19   
  chr21   227       0    8380   2261   |   chr21     0       0   52060     18   
  chr22   244       0    8671   2073   |   chr22     0       0   45153     12   
  chrX   1033   14970     527   3663   |   chrX      0   74099    1801     36   
  chrY    506    3226       1    161   |   chrY      0    1393    2637      2   

  total 16078   22191  600262 150546   |   total     0   75513 3424104    779                     

Count of types of genotype
   CT   41086                          |    CT  696904
   AG   40444                          |    AG  696669
   CC  142899                          |    CC  358678
   GG  142411                          |    GG  358891
   TT  104122                          |    TT  320081
   AA  104648                          |    AA  319335
   GT    9760                          |    GT  173342
   AC    9810                          |    AC  172678
   CG     321                          |    CG  178718
   AT     215                          |    AT  148808
   C     5797                          |    C    18824
   G     5495                          |    G    19064
   T     5343                          |    T    18873
   A     5272                          |    A    18752
   --   16078                          |    --       0
   II    3245                          |    II       0
   DD    1259                          |    DD       0
   I      195                          |    I        0
   D       89                          |    D        0
   DI      42                          |    DI       0

There are only 779 sites where both 23andme and DanteLabs call a variant and disagree about what it is—a 0.5% disagreement, which is lower than I would have expected given the differences in the technology and the error rates of DNA chips. I think that 23andme is being fairly conservative and not calling many of the low-quality hybridization reads.

The biggest difference seems to be that Dante Labs does not cover the mitochondrion—the very small number of variant calls there could be mismapping of reads from homologous regions of the nuclear genome. Of course, 23andme does extremely thorough coverage of the mitochondrion, in order to get as much maternal haplotype data as feasible. If you are looking for maternal ancestry information or mitochondrial variants related to disease, the Dante Labs whole-genome sequencing is not the way to go.

The 23andme data also has a lot of coverage of the Y chromosome, in an attempt to get as much paternal haplotype information as possible, but the VCF file has few calls on the Y chromosome, and many of them are diploid calls, probably from homology to the X chromosome (the 23andme sites appear to be carefully chosen to avoid the homologous regions of the X and Y chromosomes, which may or may not be reasonable, depending on what is going on in those regions). Again, if you are mainly interested in ancestry information, the Dante Labs whole-genome sequencing is probably not the way to go.

The Dante Labs vcf file does not include deletion and insertion genotypes (the I and D codes in the 23andme data), but I think that the full data Dante Labs sent me on disk may have that information in a different VCF file. It may be a while before I have time to examine that more detailed data.

There are about 5.5 times as many SNPs in the VCF file as in the 23andme file, but only about a quarter of the 23andme sites are matched by the Dante Labs variants—the rest may be places where I am homozygous for the reference allele, which the VCF file does not report, or they may be places where Dante Labs had insufficient coverage to do a variant call. It will take a lot more work for me to analyze the Dante Labs data to figure out which is correct. The 23andme genotype data has a lot more homozygous calls than heterozygous ones, so I suspect that the bulk of the differences will be just that I am homozygous for the reference allele.

The most common SNP variants in the Dante Labs VCF file are CT (or the equivalent on the other strand AG), which is to be expected, as C⇒T conversion is common in DNA, because of C⇒U deamination and subsequent treatment of U as T in replication.

The Dante Labs data shows a lot higher proportion of CG and AT variants than the 23andme data—I don’t know how to interpret that. Perhaps when I get the fullgenomes data, which uses a different sequencing technology, I’ll be able to compare VCF files and see if there is technology effect here.

I clearly have a lot more work to do to interpret the data, but this preliminary look convinces me that I have good data from Dante Labs.

I retract my former claim that Dante Labs is a scam with apologies to them—it appears that they just had very bad delivery times and poor customer service. If they are now delivering data, they may actually be a good deal, as their prices are much lower than other whole-genome sequencing services. (Of course, it is still possible that they are only delivering data to a fraction of their customers, but I have no information about that—only that the data they eventually sent me seems to be good.)

2019 March 14

Another spit kit sent

Filed under: Uncategorized — gasstationwithoutpumps @ 00:09
Tags:

I sent in another spit kit 2019 Mar 13—this one to fullgenomes.com.  I had ordered from them shortly before Dante Labs sent me email saying that my data had been ready for months—Dante Labs had just neglected to tell me how to get the data.  Dante Labs also says that they are finally sending me the raw data I requested (not just a VCF file). Getting the raw data is good, as I can run it through the best mapping and variant-calling software pipelines, and check those calls against the ones made by the lab.

When I get the fullgenomes data, I’ll compare it with the Dante Labs data—I expect some differences, as they use different sequencing platforms (Illumina for fullgenomes, BGI for Dante Labs).

I’ve done a minimal comparison of the Dante Labs VCF file with the 23andme data—just looking at the top hits in the Promethease analysis of each.  They seem to be saying the same thing.  When I get some time, I’ll write a little script that goes through the VCF file looking for the genotype at each site in the 23andme data, to see how many discrepancies there are.

One problem with VCF files is that they only report the differences from the reference genome—there is no way to distinguish “not covered by enough reads” from “homozygous for the reference allele”.  The gVCF format attempts to fix that, but at the expense of an enormous increase in file size.

I think that there might be a use for a new format that provides terse genotype information (in a format like that used by 23andme) for every location in SNPedia (110,026 currently) or for every one in dbSNP (113,862,023 validated clusters in build 150).  Doing 114M locations in the format used by 23andme would take about 3GB, which is smaller than a gVCF file, but much larger than the variant-only VCF files (about 175MB).  The 23andme format is much less informative than the VCF file, as it just has the genotype call, with no information about how reliable the call is, so I’m not really sure it would be worthwhile to create a 3GB file in such a format.

It will be a few weeks before I do anything interesting with the genome data, as I have about 53 hours of grading still to over the next week and a half.

2019 February 17

Full-genome sequencing pricing

Filed under: Uncategorized — gasstationwithoutpumps @ 12:23
Tags: , ,

In the comments on Dante Labs is a scam, there has been some discussion on pricing of whole-genome sequencing.  There are a lot of companies out there with different business models, different pricing schemes, and subtly different offerings—all of which is undoubtedly confusing to consumers.  I’ve been trying to collect pricing information for the past year, and I’m still often confused by the offerings.

Consumers buy sequencing for two main purposes: to find out about their ancestry and to find out about the genetic risks to their health.

For ancestry, there is no real need for sequencing—the information from DNA microarrays (as used by companies like 23andme or ancestry.com) is more than sufficient, and those companies have big proprietary databases that allow more precise ancestry information than the public databases accessible to companies that do full sequencing.  The microarray approach is currently far cheaper than sequencing, though the difference is shrinking.

The major, well-documented risk factors for health are also covered by the DNA microarrays, but there are thousands of risk factors being discovered and published every year, and the DNA microarray tests need to redesigned and rerun on a regular basis to keep up. If whole-genome sequencing is done, almost all of the data needed for analysis is collected at once, and only analysis needs to be redone.  (This is not quite true—long-read sequencing is beginning to provide information about structural rearrangements of the genome that are not visible in the older short-read technologies, and some of these structural rearrangements are clinically significant, though usually only in cancer tumors, not in the germ line.)

For most consumers mildly interested in ancestry and genetic risks, the 23andme $200 package is all they need.  If they are just interested in ancestry, there are even cheaper options ($100 from 23andme or ancestry.com—I have no idea which is better).

My interest in my genome is to try to figure out the genetics of my inherited low heart rate.  It is not a common condition, and it seems to be beneficial rather than harmful (at any rate, my ancestors who had it were mostly long-lived), so the microarrays are not looking for variants that might be responsible.  Whole genome sequencing would give me a much larger pool of variants to examine to try to track down the cause.  To get high probability of seeing every variant, I would need 30× sequencing of my whole genome.  If I thought that the problem was in a protein-coding gene, I could get 100× exome sequencing instead.

The problem with whole-genome sequencing is that everybody has about a million variants, almost all of which are irrelevant to any specific health question.  The variants that have already been studied and well documented are not too hard to deal with, but most of them are already in the DNA microarrays, so whole-genome sequencing doesn’t offer much more on them.  Looking for a rare variant that has not been well studied is much harder—which of the millions of base changes matters?

The popular, and expensive, approach in recent genomics literature is to do genome-wide association studies (GWAS).  These take a large population of people with and without the phenotype of interest, then looks for variants that reliably separate the groups.  If there are many possible hypotheses (generally in the thousands or millions), a huge population is needed to separate out the real signal from random noise.  Many of the early GWAS papers were later shown to have bogus results, because the researchers did not have a proper appreciation of how easy it was to fool themselves.

Earlier studies focussed on families, where there is a lot of common genetic background, and each additional person in the study cuts the candidate hypothesis pool almost in half.  To narrow down from a million candidate variants to only one would take a little over 20 closely related people (assuming that the phenotype was caused by just a single variant—always a dangerous assumption).  I can probably get 4 or 5 of my relatives to participate in a study like this, but probably not 20.  I don’t think I want to pay for 20 whole-genome sequencing runs out of my own pocket anyway.

I have some hope of working with a smaller number of samples, though, as there has been an open-access paper on inherited bradycardia implicating about 16 genes.  If I have variants in those genes or their promoters, they are likely to be the interesting variants, even if no one has previously seen or studied the variants.  Of course, the size of the region means I’m likely to have about 80 variants in those regions just by chance, so I’ll still need to have some of my relatives’ genomes to narrow down the possibilities, but 8 or 9 relatives may be enough to get a solid conjecture.  (Proving that the variant is responsible would be more difficult—I’d either need a much larger cohort or someone would have to do genetic experiments in animal models.)

How expensive is the whole-genome sequencing anyway?  It can be hard to tell, as different labs offer different packages and many require more than the advertised price.

A university research lab like UC Davis will do the DNA library prep and 30× sequencing for about $1000, but not the extraction of the DNA from a spit kit or cheek swabs.  That is a fairly cheap procedure (about $50, I think), but arranging for one lab to do the extraction and ship to another lab increased the complexity of the logistics, to the point where I don’t think I’d ever get around to doing it.  Storing the sequencing results (FASTQ files), doing the mapping of the reads to a reference genome to get BAM files, and calling variants to get VCF files adds to the cost, though cloud-based systems are available that make this reasonably cheap (I think about $50 a year for storage and about $50 for the analysis).  Interpreting the VCF files can be aided by using Promethease for $12 to find relevant entries in SNPedia.

Fullgenomes.com offers packages from $545 to $2900, with an extra $250 for analysis.  The most relevant package for what I want would be the 30× sequencing package for $1295, probably without their $250 analysis, which I suspect is not much more than consumer-friendly rewrite of the results from Promethease (which can be very hard to read, so most consumers would need the rewrite).  Their pricing is a little weird, as the 15× sequencing is less than half the price of 30×, while the underlying technology should make the 30× cheaper per base.  I’ll have to check on exactly what is included in the $1295 package, as that is looking like the best deal I can find right now.

BGI advertises bulk whole-genome sequencing at low prices for researchers, but never responded to my email (from my university account) trying to get actual prices.  A lot of other companies (like Novogene) also have “request a quote” buttons.  My usual reaction to that is that if you have to ask the price, you can’t afford it.  Secret pricing is almost always ridiculously high pricing, and I prefer not to deal with companies that have secret pricing.

Dante Labs advertises very low prices, but does not deliver results—they seem to be a scam.

Veritas Genetics offers a low price ($999), but that does not include giving you back your data—they want to hang onto it and sell you additional “tests” that cost ridiculously large amounts.  I believe they will sell the VCF file (but not the BAM or FASTQ files it is based on) for an additional fee.

Most of the other companies I’ve seen have 30× whole-genome sequencing priced at over $2000, which is a little out of my price range.

 

2018 December 26

Dante Labs is probably not a scam

Filed under: Uncategorized — gasstationwithoutpumps @ 15:51
Tags: , ,

As readers may remember from Spit kit sent, I spent $500 in March 2018 in an attempt to get my genome sequenced through Dante Labs. I worried at the time that their price was too low for current sequencing capabilities, but I thought that I’d give them a try.

I am now convinced that Dante Labs is indeed a scam. (Update: 2019 March 15—I’m an no longer convinced one way or the other.  They have, finally, sent me the promised data.)

On 2018 August 31, they sent me an e-mail:

Your Raw Data are ready!

Dear Customer, 

We’re excited to let you know that we completed the sequencing of your sample. 

Over a month later, I asked when I would get the data they claimed to have.  On 2018 Oct 9, they sent me an e-mail:

Dear Kevin,
we apologize for this delay.
Your hard disk is among the next few to be shipped.
Again, we are sorry for this inconvenience.
Best regards,
Paul
Dante Labs
It is now four months since they claimed to have data for me, and almost three months since they claimed that mine was “among the next few to be shipped”.
It is clear now that they are incapable of delivering the whole genome sequence that they promised.  I suspect that they are doing a variant of a Ponzi scheme, where they use the money of new customers to pay for sequencing of a few select early customers, so that they don’t have all their customers complaining about non-delivery.
For my part, I’ve given up on ever expecting anything from them—now I have to decide if it is worthwhile to report them to the Better Business Bureau and appropriate district attorneys.  I can, at least, warn all my readers not to do business with them.
I’ll also have to look for a different, more reputable business to get my whole-genome sequencing done.  While I’m looking, I wonder whether it would be worth the $200 for a SNP panel from 23andme—that would not answer the most interesting question for me (the genetic cause of our family’s inherited bradycardia), but it might provide some data of interest.
If anyone has suggestions for whole-genome sequencing companies I should check, please let me know.

UPDATE: 2019 March 8.  Dante Labs informed me today that my data “has been ready for a while” and there is indeed a VCF report for me on their web site (with an 2018 August 11 date on it, though I don’t believe it was there then).   They had promised me the raw data, which they never sent.

I will check the VCF file to see whether it is consistent with 23andme data for me, and see whether Promethease can process the data (there is mention on the Promethease website that they have had trouble with Dante Labs data, though no reasons are given).  I have also ordered another whole-genome sequencing from fullgenomes.com, which I will also check for consistency.

If the VCF file for Dante Labs turns out to be correct and usable, I will remove the accusation of “scam” and just say that their customer service is terrible.

UPDATE: 2019 Mar 15.  The hard disk with the FASTQ and and alignment files (the raw data Dante Labs promised me last year) arrived yesterday.  After I catch up with my grading (only 51 more hours to go—10 hours more grading on Lab 5, then 41 hours of grading on Lab 6, which comes in on Monday), I’ll be consulting with the experts at UCSC about the best way to re-analyze the data.

Promethease had no trouble processing the VCF file from Dante Labs, but their analysis assumed that any variant not mentioned vas not tested, while most were sequenced but homozygous for the reference allele.  There needs to be a better format for communicating genotypes!

The preliminary results from the Promethease analysis are compatible with the 23andme data, but I’ll need to write a Python program to compare the 23andme genotype with the VCF file to see how much they differ.  This may be a bit tricky, as I’ll probably first have to create a 23andme reference, so that I can guess what the genotype is where the VCF has no variant reported or remove from the comparison any places where my 23andme genotype matches the reference.  Again, this will have to wait until my grading is done.

Right now, I am cautiously optimistic that Dante Labs is not a scam—though their delays in delivering data are not real encouraging.   If they really deliver whole-genome data at scale at the prices they are currently charging, I’ll be impressed.  If they are only delivering to a few selected customers, then not so impressed.

If the data checks out ok, I will have some of my relatives send samples to Dante Labs for sequencing—the prices is so much lower that it is worth some risk.

2018 April 13

Spit kit sent

Filed under: Uncategorized — gasstationwithoutpumps @ 09:22
Tags:

I sent in my spit kit (formally “ORAgene⋅Dx For collection of human DNA”) to Dante Labs for whole-genome sequencing (WGS) for myself yesterday, as the next step after ordering the sequencing as mentioned in Personal genome sequencing.

I expect that it will take about 3 months before I have any data from them, as they are not paying for quick turnaround from the sequencing labs, and even quick turnaround would be a couple of weeks with the low-cost sequencing methods.   Sometime this summer I’ll either be posting some (limited) information about the results or complaining about a scam—we’ll have to wait and see.

Next Page »

%d bloggers like this: