Gas station without pumps

2019 June 18

Comparing fullgenomes.com with Dante Labs

Filed under: Uncategorized — gasstationwithoutpumps @ 23:58
Tags: , , , , ,

Now that my grading is done, I finally had some time to look at the whole-genome sequencing data I got from fullgenomes.com. I ordered the sequencing on 7 March 2019, shipped the spit kit on 13 March 2019, and got all the data by 3 May 2019.  Their price, $1175, for 30× whole-genome sequencing, is fairly typical of the direct-to-consumer sequencing outfits.  (Dante Labs is much cheaper, but a number of people have been unhappy with the slow or non-delivery of data.)

Here is a summary of what I saw in the fullgenomes snpeff.vep.vcf file:

There are 4906211 genotype sites.
Count by filter values:
	 . 	 4906211
By chromosome, the number of no-call, haploid, and diploid genotypes:
	 chr1 	 0 	 0 	 383857
	 chr2 	 0 	 0 	 390198
	 chr3 	 0 	 0 	 323097
	 chr4 	 0 	 0 	 348597
	 chr5 	 0 	 0 	 286416
	 chr6 	 0 	 0 	 297452
	 chr7 	 0 	 0 	 279633
	 chr8 	 0 	 0 	 240326
	 chr9 	 0 	 0 	 216318
	 chr10 	 0 	 0 	 247472
	 chr11 	 0 	 0 	 233892
	 chr12 	 0 	 0 	 224952
	 chr13 	 0 	 0 	 189629
	 chr14 	 0 	 0 	 150472
	 chr15 	 0 	 0 	 143682
	 chr16 	 0 	 0 	 146914
	 chr17 	 0 	 0 	 136966
	 chr18 	 0 	 0 	 136431
	 chr19 	 0 	 0 	 109775
	 chr20 	 0 	 0 	 125200
	 chr21 	 0 	 0 	 87210
	 chr22 	 0 	 0 	 83531
	 chrX 	 0 	 105204 	 12991
	 chrY 	 0 	 5979 	 0
	 chrM 	 0 	 17 	 0

By type of genotype, there are
	 CT 	 854737
	 AG 	 828712
	 AA 	 416556
	 TT 	 414670
	 CC 	 401031
	 GG 	 400383
	 DI 	 269235
	 AC 	 266527
	 AT 	 245779
	 GT 	 239801
	 CG 	 225762
	 II 	 176278
	 DD 	 46625
	 A 	 26598
	 T 	 26573
	 G 	 23312
	 C 	 23061
	 I 	 11656
	 *T 	 2890
	 *A 	 2522
	 *G 	 1757
	 *C 	 1746

I don’t know what the *T, *A, *G, and *C sites are.  The diploid sites on the X chromosome may be sites that have close homologs on the Y chromosome—close enough that mapping algorithms see them as being the same.

What I wanted to do first was to compare the fullgenomes data with the 23andme genotyping, like I had done with the Dante Labs data.  That turned out to be somewhat difficult, as fullgenomes called variants relative to the latest reference genome (gRCH38), while 23andme and Dante Labs both used the older (gRCH37 =hg19) reference genome.  That difference means that all the coordinates are different, so simple comparisons are difficult.

I have variant calls for the Dante Labs data on both the references using Google’s DeepVariant, so I could compare the Dante Labs calls (which I believe are done with the GATK pipeline) with the DeepVariant calls on the same data, and I could compare the fullgenomes calls with the DeepVariant calls on the Dante Labs data.

I can compare the Dante Labs data with the two variant callers, to see how much difference variant calling makes, and then compare the fullgenomes data with the DeepVariant calls on the Dante Labs data, where both the variant caller and the sequencing method differ.

dantelabs GATK                         |dantelabs-deepvariant
3499617 genotype_sites                 |5884795 genotype_sites                 
  chr   no-call haploid diploid matches|  chr   no-call haploid diploid mismatches
  chrM      0      21       0      0   |  chrM      0       0       0      0   
  chr1      0       0  267787 262884   |  chr1      0       0  446877   4556   
  chr2      0       0  287274 282806   |  chr2      0       0  459298   4145   
  chr3      0       0  240607 237530   |  chr3      0       0  367320   2822   
  chr4      0       0  261070 257783   |  chr4      0       0  400640   2868   
  chr5      0       0  212325 210325   |  chr5      0       0  327319   1827   
  chr6      0       0  204102 201254   |  chr6      0       0  318344   2616   
  chr7      0       0  203647 199384   |  chr7      0       0  350292   3857   
  chr8      0       0  183159 181044   |  chr8      0       0  280030   1902   
  chr9      0       0  148637 143697   |  chr9      0       0  257715   4696   
  chr10     0       0  177729 174246   |  chr10     0       0  295033   3087   
  chr11     0       0  174926 172444   |  chr11     0       0  275753   2256   
  chr12     0       0  165987 163409   |  chr12     0       0  269918   2286   
  chr13     0       0  133928 132587   |  chr13     0       0  199795   1215   
  chr14     0       0  111553 109462   |  chr14     0       0  178071   1979   
  chr15     0       0  103635 101314   |  chr15     0       0  169214   2201   
  chr16     0       0  107064 104448   |  chr16     0       0  194206   2398   
  chr17     0       0   89744  88302   |  chr17     0       0  166728   1300   
  chr18     0       0  100640  99358   |  chr18     0       0  154510   1143   
  chr19     0       0   75073  73488   |  chr19     0       0  146076   1415   
  chr20     0       0   73566  71604   |  chr20     0       0  131282   1844   
  chr21     0       0   52060  50356   |  chr21     0       0  101284   1595   
  chr22     0       0   45153  43796   |  chr22     0       0   84522   1248   
  chrX      0   74099    1801  73134   |  chrX      0   88863   84484   2496   
  chrY      0    1393    2637   1709   |  chrY      0    3098   43760   2020   
  chr1_gl000191_random     0       0       0      0   |  chr1_gl000191_random     0       0      99      0   
  chr1_gl000192_random     0       0       0      0   |  chr1_gl000192_random     0       0     976      0   
  chr4_ctg9_hap1     0       0       0      0   |  chr4_ctg9_hap1     0       0     263      0   
  chr4_gl000193_random     0       0       0      0   |  chr4_gl000193_random     0       0    1887      0   
  chr4_gl000194_random     0       0       0      0   |  chr4_gl000194_random     0       0    2374      0   
  chr6_apd_hap1     0       0       0      0   |  chr6_apd_hap1     0       0      24      0   
  chr6_cox_hap2     0       0       0      0   |  chr6_cox_hap2     0       0     383      0   
  chr6_dbb_hap3     0       0       0      0   |  chr6_dbb_hap3     0       0     284      0   
  chr6_mann_hap4     0       0       0      0   |  chr6_mann_hap4     0       0     798      0   
  chr6_mcf_hap5     0       0       0      0   |  chr6_mcf_hap5     0       0     132      0   
  chr6_qbl_hap6     0       0       0      0   |  chr6_qbl_hap6     0       0     960      0   
  chr6_ssto_hap7     0       0       0      0   |  chr6_ssto_hap7     0       0     812      0   
  chr7_gl000195_random     0       0       0      0   |  chr7_gl000195_random     0       0    3267      0   
  chr8_gl000196_random     0       0       0      0   |  chr8_gl000196_random     0       0      11      0   
  chr8_gl000197_random     0       0       0      0   |  chr8_gl000197_random     0       0       1      0   
  chr9_gl000198_random     0       0       0      0   |  chr9_gl000198_random     0       0    1866      0   
  chr9_gl000199_random     0       0       0      0   |  chr9_gl000199_random     0       0    7465      0   
  chr9_gl000200_random     0       0       0      0   |  chr9_gl000200_random     0       0       1      0   
  chr9_gl000201_random     0       0       0      0   |  chr9_gl000201_random     0       0      12      0   
  chr11_gl000202_random     0       0       0      0   |  chr11_gl000202_random     0       0      78      0   
  chr17_ctg5_hap1     0       0       0      0   |  chr17_ctg5_hap1     0       0     608      0   
  chr17_gl000203_random     0       0       0      0   |  chr17_gl000203_random     0       0     444      0   
  chr17_gl000204_random     0       0       0      0   |  chr17_gl000204_random     0       0      92      0   
  chr17_gl000205_random     0       0       0      0   |  chr17_gl000205_random     0       0    2874      0   
  chr17_gl000206_random     0       0       0      0   |  chr17_gl000206_random     0       0      13      0   
  chr18_gl000207_random     0       0       0      0   |  chr18_gl000207_random     0       0     100      0   
  chr19_gl000208_random     0       0       0      0   |  chr19_gl000208_random     0       0    2216      0   
  chr19_gl000209_random     0       0       0      0   |  chr19_gl000209_random     0       0     371      0   
  chr21_gl000210_random     0       0       0      0   |  chr21_gl000210_random     0       0      21      0   
  chrUn_gl000211     0       0       0      0   |  chrUn_gl000211     0       0    1967      0   
  chrUn_gl000212     0       0       0      0   |  chrUn_gl000212     0       0    1316      0   
  chrUn_gl000213     0       0       0      0   |  chrUn_gl000213     0       0     266      0   
  chrUn_gl000214     0       0       0      0   |  chrUn_gl000214     0       0    2159      0   
  chrUn_gl000215     0       0       0      0   |  chrUn_gl000215     0       0     129      0   
  chrUn_gl000216     0       0       0      0   |  chrUn_gl000216     0       0    8530      0   
  chrUn_gl000217     0       0       0      0   |  chrUn_gl000217     0       0    2138      0   
  chrUn_gl000218     0       0       0      0   |  chrUn_gl000218     0       0    1480      0   
  chrUn_gl000219     0       0       0      0   |  chrUn_gl000219     0       0    5820      0   
  chrUn_gl000220     0       0       0      0   |  chrUn_gl000220     0       0    1730      0   
  chrUn_gl000221     0       0       0      0   |  chrUn_gl000221     0       0     958      0   
  chrUn_gl000222     0       0       0      0   |  chrUn_gl000222     0       0    2193      0   
  chrUn_gl000223     0       0       0      0   |  chrUn_gl000223     0       0      12      0   
  chrUn_gl000224     0       0       0      0   |  chrUn_gl000224     0       0    3738      0   
  chrUn_gl000225     0       0       0      0   |  chrUn_gl000225     0       0   15234      0   
  chrUn_gl000226     0       0       0      0   |  chrUn_gl000226     0       0     257      0   
  chrUn_gl000227     0       0       0      0   |  chrUn_gl000227     0       0      80      0   
  chrUn_gl000228     0       0       0      0   |  chrUn_gl000228     0       0    1299      0   
  chrUn_gl000229     0       0       0      0   |  chrUn_gl000229     0       0    1080      0   
  chrUn_gl000230     0       0       0      0   |  chrUn_gl000230     0       0     409      0   
  chrUn_gl000231     0       0       0      0   |  chrUn_gl000231     0       0    1118      0   
  chrUn_gl000232     0       0       0      0   |  chrUn_gl000232     0       0    2148      0   
  chrUn_gl000233     0       0       0      0   |  chrUn_gl000233     0       0     433      0   
  chrUn_gl000234     0       0       0      0   |  chrUn_gl000234     0       0    2281      0   
  chrUn_gl000235     0       0       0      0   |  chrUn_gl000235     0       0    1224      0   
  chrUn_gl000236     0       0       0      0   |  chrUn_gl000236     0       0     131      0   
  chrUn_gl000237     0       0       0      0   |  chrUn_gl000237     0       0     493      0   
  chrUn_gl000238     0       0       0      0   |  chrUn_gl000238     0       0      19      0   
  chrUn_gl000239     0       0       0      0   |  chrUn_gl000239     0       0      80      0   
  chrUn_gl000240     0       0       0      0   |  chrUn_gl000240     0       0     696      0   
  chrUn_gl000241     0       0       0      0   |  chrUn_gl000241     0       0    1665      0   
  chrUn_gl000242     0       0       0      0   |  chrUn_gl000242     0       0      32      0   
  chrUn_gl000243     0       0       0      0   |  chrUn_gl000243     0       0     328      0   
  chrUn_gl000244     0       0       0      0   |  chrUn_gl000244     0       0      83      0   
  chrUn_gl000245     0       0       0      0   |  chrUn_gl000245     0       0     110      0   
  chrUn_gl000246     0       0       0      0   |  chrUn_gl000246     0       0      73      0   
  chrUn_gl000247     0       0       0      0   |  chrUn_gl000247     0       0     196      0   
  chrUn_gl000248     0       0       0      0   |  chrUn_gl000248     0       0      26      0   

  total     0   75513 3424104 3436364   |  total     0   91961 5792834  57772   

Count of types of genotype
   CT  696904                          |   CT  777273                          
   AG  696669                          |   AG  757560                          
   CC  358678                          |   CC  686562                          
   AA  319335                          |   AA  715299                          
   TT  320081                          |   TT  713424                          
   GG  358891                          |   GG  633290                          
   AC  172678                          |   AC  228469                          
   GT  173342                          |   GT  208315                          
   CG  178718                          |   CG  196041                          
   AT  148808                          |   AT  212648                          
   DD       0                          |   DD  249895                          
   DI       0                          |   DI  237611                          
   II       0                          |   II  176447                          
   T    18873                          |   T    22689                          
   A    18752                          |   A    22586                          
   G    19064                          |   G    20696                          
   C    18824                          |   C    20492                          
   I        0                          |   I     5271                          
   D        0                          |   D      227                          

DeepVariant makes a lot more calls (mainly because it also reports places where it decides that the genotype is homozygous reference, which GATK doesn’t report), but also because the GATK calls were filtered to remove the low-evidence calls, while DeepVariant was set up to report everything.
DeepVariant does a huge number of diploid calls on X and Y, which is a little suspicious.
The ratio of mismatches to matches is 0.01681, about a 1.65% discrepancy rate. I don’t know which of the genome callers is better on this data, but DeepVariant was supposedly better on some recent tests on autosomal chromosomes (I’ve not looked up the paper yet).

Comparing the Dante Labs DeepVariant calls with the fullgenomes calls (on gRCH38) shows a bigger difference:


dantelabs-deepvariant                  |fullgenomes snpeff.vep.vcf.gz
5508932 genotype_sites                 |4906211 genotype_sites                 
  chr   no-call haploid diploid matches|  chr   no-call haploid diploid mismatches
  chr1      0       0  443442 336398   |  chr1      0       0  383857  23983   
  chr2      0       0  434888 356031   |  chr2      0       0  390198  17928   
  chr3      0       0  355613 296731   |  chr3      0       0  323097  12684   
  chr4      0       0  380259 320989   |  chr4      0       0  348597  13450   
  chr5      0       0  316748 262579   |  chr5      0       0  286416  11314   
  chr6      0       0  299986 253300   |  chr6      0       0  297452   8599   
  chr7      0       0  314759 252073   |  chr7      0       0  279633  14099   
  chr8      0       0  259602 217907   |  chr8      0       0  240326   8761   
  chr9      0       0  242579 183556   |  chr9      0       0  216318  19382   
  chr10     0       0  280582 224653   |  chr10     0       0  247472  12447   
  chr11     0       0  256993 214312   |  chr11     0       0  233892   8491   
  chr12     0       0  252220 205678   |  chr12     0       0  224952   8601   
  chr13     0       0  209801 165921   |  chr13     0       0  189629  11023   
  chr14     0       0  161466 132483   |  chr14     0       0  150472   6803   
  chr15     0       0  154740 123209   |  chr15     0       0  143682   8381   
  chr16     0       0  169234 128717   |  chr16     0       0  146914   9197   
  chr17     0       0  158498 111002   |  chr17     0       0  136966  12128   
  chr18     0       0  150941 123100   |  chr18     0       0  136431   5881   
  chr19     0       0  129958  94953   |  chr19     0       0  109775   6398   
  chr20     0       0  147206  95398   |  chr20     0       0  125200  20597   
  chr21     0       0   94029  64558   |  chr21     0       0   87210  14591   
  chr22     0       0   93982  58824   |  chr22     0       0   83531  14887   
  chrX      0   93759   70714  96500   |  chrX      0  105204   12991  14629   
  chrY      0    2818   34115   2134   |  chrY      0    5979       0   3321   
  chrM      0       0       0      0   |  chrM      0      17       0      0   

  total     0   96577 5412355 4321006   |  total     0  111200 4795011 287575   

Count of types of genotype
   CT  761277                          |   CT  854737                          
   AG  741910                          |   AG  828712                          
   AA  636195                          |   AA  416556                          
   TT  635850                          |   TT  414670                          
   CC  609922                          |   CC  401031                          
   GG  557555                          |   GG  400383                          
   DI  234030                          |   DI  269235                          
   AC  224266                          |   AC  266527                          
   AT  209513                          |   AT  245779                          
   GT  204389                          |   GT  239801                          
   CG  192211                          |   CG  225762                          
   II  173913                          |   II  176278                          
   DD  231324                          |   DD   46625                          
   A    23799                          |   A    26598                          
   T    23687                          |   T    26573                          
   G    21821                          |   G    23312                          
   C    21561                          |   C    23061                          
   I     5438                          |   I    11656                          
   *T       0                          |   *T    2890                          
   *A       0                          |   *A    2522                          
   *G       0                          |   *G    1757                          
   *C       0                          |   *C    1746                          
   D      271                          |   D        0                          

Now the ratio of mismatches to matches is 0.06655, a 6.2% discrepancy rate. A few of the discrepancies were haploid/diploid differences on chromosome X, but that is only about 1100 differences. None of the mismatches involve the weird *A,*C,*G, *T genotype calls.

I do have 4,321,006 genotype calls that I am now pretty confident of, as they were called by two different variant callers from two different sequencing runs using different sequencing technology.

But I’m not sure which data set or variant caller to favor on the 287,575 disagreements, nor what to do about the locations where one variant caller made a call for a site and the other didn’t. The fullgenomes data includes a gVCF file, which has calls for every base that got reads mapped to it, but I’ve not tried extracting data from that format yet (it’s bad enough having to try to extract data from the two different vcf formats).

I was planning to compare the 23andme data with each of the whole-genome vcf calls, making the assumption that the sequencing and variant caller that agrees most with the hybridization-based genotyping by 23andme would be the most accurate. (I also want to make a revised “23andme” data set that replaces any genotyping calls where both whole-genome sequences agreed with each other, but disagreed with 23andme.)

To make this all work, I need to have all the variant calling be relative to the same reference genome, which means either lifting Dante Labs and 23andme to gRCH38 or reversing that and moving the fullgenomes vcf files to gRCH37. I could also try having Kishwar do the DeepVariant calls on both reference genomes for the fullgenomes data.

I’ll need to think a bit about what would be most useful.

2019 February 17

Full-genome sequencing pricing

Filed under: Uncategorized — gasstationwithoutpumps @ 12:23
Tags: , ,

In the comments on Dante Labs is a scam, there has been some discussion on pricing of whole-genome sequencing.  There are a lot of companies out there with different business models, different pricing schemes, and subtly different offerings—all of which is undoubtedly confusing to consumers.  I’ve been trying to collect pricing information for the past year, and I’m still often confused by the offerings.

Consumers buy sequencing for two main purposes: to find out about their ancestry and to find out about the genetic risks to their health.

For ancestry, there is no real need for sequencing—the information from DNA microarrays (as used by companies like 23andme or ancestry.com) is more than sufficient, and those companies have big proprietary databases that allow more precise ancestry information than the public databases accessible to companies that do full sequencing.  The microarray approach is currently far cheaper than sequencing, though the difference is shrinking.

The major, well-documented risk factors for health are also covered by the DNA microarrays, but there are thousands of risk factors being discovered and published every year, and the DNA microarray tests need to redesigned and rerun on a regular basis to keep up. If whole-genome sequencing is done, almost all of the data needed for analysis is collected at once, and only analysis needs to be redone.  (This is not quite true—long-read sequencing is beginning to provide information about structural rearrangements of the genome that are not visible in the older short-read technologies, and some of these structural rearrangements are clinically significant, though usually only in cancer tumors, not in the germ line.)

For most consumers mildly interested in ancestry and genetic risks, the 23andme $200 package is all they need.  If they are just interested in ancestry, there are even cheaper options ($100 from 23andme or ancestry.com—I have no idea which is better).

My interest in my genome is to try to figure out the genetics of my inherited low heart rate.  It is not a common condition, and it seems to be beneficial rather than harmful (at any rate, my ancestors who had it were mostly long-lived), so the microarrays are not looking for variants that might be responsible.  Whole genome sequencing would give me a much larger pool of variants to examine to try to track down the cause.  To get high probability of seeing every variant, I would need 30× sequencing of my whole genome.  If I thought that the problem was in a protein-coding gene, I could get 100× exome sequencing instead.

The problem with whole-genome sequencing is that everybody has about a million variants, almost all of which are irrelevant to any specific health question.  The variants that have already been studied and well documented are not too hard to deal with, but most of them are already in the DNA microarrays, so whole-genome sequencing doesn’t offer much more on them.  Looking for a rare variant that has not been well studied is much harder—which of the millions of base changes matters?

The popular, and expensive, approach in recent genomics literature is to do genome-wide association studies (GWAS).  These take a large population of people with and without the phenotype of interest, then looks for variants that reliably separate the groups.  If there are many possible hypotheses (generally in the thousands or millions), a huge population is needed to separate out the real signal from random noise.  Many of the early GWAS papers were later shown to have bogus results, because the researchers did not have a proper appreciation of how easy it was to fool themselves.

Earlier studies focussed on families, where there is a lot of common genetic background, and each additional person in the study cuts the candidate hypothesis pool almost in half.  To narrow down from a million candidate variants to only one would take a little over 20 closely related people (assuming that the phenotype was caused by just a single variant—always a dangerous assumption).  I can probably get 4 or 5 of my relatives to participate in a study like this, but probably not 20.  I don’t think I want to pay for 20 whole-genome sequencing runs out of my own pocket anyway.

I have some hope of working with a smaller number of samples, though, as there has been an open-access paper on inherited bradycardia implicating about 16 genes.  If I have variants in those genes or their promoters, they are likely to be the interesting variants, even if no one has previously seen or studied the variants.  Of course, the size of the region means I’m likely to have about 80 variants in those regions just by chance, so I’ll still need to have some of my relatives’ genomes to narrow down the possibilities, but 8 or 9 relatives may be enough to get a solid conjecture.  (Proving that the variant is responsible would be more difficult—I’d either need a much larger cohort or someone would have to do genetic experiments in animal models.)

How expensive is the whole-genome sequencing anyway?  It can be hard to tell, as different labs offer different packages and many require more than the advertised price.

A university research lab like UC Davis will do the DNA library prep and 30× sequencing for about $1000, but not the extraction of the DNA from a spit kit or cheek swabs.  That is a fairly cheap procedure (about $50, I think), but arranging for one lab to do the extraction and ship to another lab increased the complexity of the logistics, to the point where I don’t think I’d ever get around to doing it.  Storing the sequencing results (FASTQ files), doing the mapping of the reads to a reference genome to get BAM files, and calling variants to get VCF files adds to the cost, though cloud-based systems are available that make this reasonably cheap (I think about $50 a year for storage and about $50 for the analysis).  Interpreting the VCF files can be aided by using Promethease for $12 to find relevant entries in SNPedia.

Fullgenomes.com offers packages from $545 to $2900, with an extra $250 for analysis.  The most relevant package for what I want would be the 30× sequencing package for $1295, probably without their $250 analysis, which I suspect is not much more than consumer-friendly rewrite of the results from Promethease (which can be very hard to read, so most consumers would need the rewrite).  Their pricing is a little weird, as the 15× sequencing is less than half the price of 30×, while the underlying technology should make the 30× cheaper per base.  I’ll have to check on exactly what is included in the $1295 package, as that is looking like the best deal I can find right now.

BGI advertises bulk whole-genome sequencing at low prices for researchers, but never responded to my email (from my university account) trying to get actual prices.  A lot of other companies (like Novogene) also have “request a quote” buttons.  My usual reaction to that is that if you have to ask the price, you can’t afford it.  Secret pricing is almost always ridiculously high pricing, and I prefer not to deal with companies that have secret pricing.

Dante Labs advertises very low prices, but does not deliver results—they seem to be a scam.

Veritas Genetics offers a low price ($999), but that does not include giving you back your data—they want to hang onto it and sell you additional “tests” that cost ridiculously large amounts.  I believe they will sell the VCF file (but not the BAM or FASTQ files it is based on) for an additional fee.

Most of the other companies I’ve seen have 30× whole-genome sequencing priced at over $2000, which is a little out of my price range.

 

2015 April 23

Very long couple of days

Yesterday and today have been draining.

Yesterday, I had three classes each 70 minutes long: banana slug genomics, applied electronics for bioengineers, and a guest lecture for another class on protein structure.  I also had my usual 2 hours of office hours, delayed by half an hour because of the guest lecture.

The banana-slug-genomics class is going well.  My co-instructor (Ed Green) has done most of the organizing and has either arranged guest lectures or taught classes himself. This week and part of next we are getting preliminary reports from the 5 student groups on how the assemblies are coming.  No one has done an assembly yet, but there has been a fair amount of data cleanup and prep work (adapter removal, error correction, and estimates of what kmer sizes will work best in the de Bruijn graphs for assembly).  The data is quite clean, and we have about 23-fold coverage currently, which is just a little low for making good contigs.   (See https://banana-slug.soe.ucsc.edu/data_overview for more info about the data.) Most of the data is from a couple of lanes of HiSeq sequencing (2×100 bp) from 2 libraries (insert sizes around 370 and 600) , but some is from an early MySeq run (2×300bp), used to confirm that the libraries were good before the HiSeq run.  In class, we decided to seek a NextSeq run (2×250bp), either with the same libraries or with a new one, so that we could get more data quickly (we can get the data by next week, rather than waiting 2 or 3 weeks for a HiSeq run to piggyback on).  With the new data, we’ll have more than enough shotgun data for making the contigs.  The mate-pair libraries for scaffolding are still not ready (they’ve been failing quality checks and need to be redone), or we would run one of them on the NextSeq run.  We’ll probably also do a transcriptome library (in part to check quality of scaffolding, and in part to annotate the genome), and possibly a small-RNA library (a UCSC special interest).

The applied electronics lecture had a lot to cover, because the material on hysteresis that was not covered on Monday needed to be done before today’s lab, plus I had to show students how to interpret the 74HC15N datasheet for the Schmitt trigger, as we run them at 3.3V, but specs are only given for 2V, 4.5V, and 6V.  I also had to explain how the relaxation oscillator works (see last year’s blog post for the circuit they are using for the capacitance touch sensor).

Before getting to all the stuff on hysteresis, I had to finish up the data analysis for Tuesday’s lab, showing them how to fit models to the measured magnitude of impedance of the loudspeakers using gnuplot.  The fitting is fairly tricky, as the resistor has to be fit in one part of the curve, the inductor in another, and the RLC parameters for the resonance peak in yet another.  Furthermore, the radius of convergence is pretty small for the RLC parameters, so we had to do a lot of guessing reasonable values and seeing if we got convergence.  (See my post of 2 years ago for models that worked for measurements I made then.)

After the overstuffed electronics lecture, I had to move to the next classroom over and give a guest lecture on protein structure.  For this lecture I did some stuff on the chalk board, but mostly worked with 3D Darling models. When I did the guest lecture last year, I prepared a bunch of PDB files of protein structures to show the class, but I didn’t have the time or energy for that this year, so decided to do it all with the physical models.  I told students that the Darling models (which are the best kits I’ve seen for studying protein structure) are available for check out at the library, and that I had instructions for building protein chains with the Darling models plus homework in Spring 2011 with suggestions of things to build.  The protein structure lecture went fairly well, but I’m not sure how much students learned from it (as opposed to just being entertained).  The real learning comes from building the models oneself, but I did not have the luxury of making assignments for the course—nor would I have had time to grade them.

Speaking of grading, right after my 2 hours of office hours (full, as usual, with students wanting waivers for requirements that they had somehow neglected to fulfill), I had a stack of prelab assignments to grade for the hysteresis lab.  The results were not very encouraging, so I rewrote a section of my book to try to clarify the points that gave the students the most difficulty, adding in some scaffolding that I had thought would be unnecessary.  I’ve got too many students who can’t read something (like the derivation of the oscillation frequency for a relaxation oscillator on Wikipedia) and apply the same reasoning to their slightly different relaxation oscillator.  All they could do was copy the equations (which did not quite apply).  I put the updated book on the web site at about 11:30 p.m., emailed the students about it, ordered some more inductors for the power-amp lab, made my lunch for today, and crashed.

This morning, I got up around 6:30 a.m. (as I’ve been doing all quarter, though I am emphatically not a morning person), to make a thermos of tea, and process my half-day’s backlog of email (I get 50–100 messages a day, many of them needing immediate attention). I cycled up to work in time to open the lab at 10 a.m., then was there supervising students until after 7:30 pm. I had sort of expected that this time, as I knew that this lab was a long one (see Hysteresis lab too long from last year, and that was when the hysteresis lab was a two-day lab, not just one day).  Still, it made for a very long day.

I probably should be grading redone assignments today (I have a pile that were turned in Monday), but I don’t have the mental energy needed for grading tonight.  Tomorrow will be busy again, as I have banana-slug genomics, a visiting collaborator from UW, the electronics lecture (which needs to be about electrodes, and I’m not an expert on electrochemistry), and the grad research symposium all afternoon. I’ll also be getting another stack of design reports (14 of them, about 5 pages each) for this week’s lab, to fill up my weekend with grading. Plus I need to update a couple more chapters of the book before students get to them.

2012 June 21

Crowdfunding genome project

Filed under: Uncategorized — gasstationwithoutpumps @ 20:37
Tags: , , , ,

Manuel Corpas is trying to get the genome of 5 members of his family sequenced, so that he can release the data for public analysis and development of genome analysis tools.

Crowdfunding Genome Project] Day 2: BGI Officially Agrees Sequencing « Manuel Corpas’ Blog.

Donations Sought For Whole Genome Sequencing: 40 Days To Go!

He previously released the genotyping of the same 5 members of his family, so you know that he is serious about doing a public release of the data.

2012 May 17

Performance of benchtop sequencers

Filed under: Uncategorized — gasstationwithoutpumps @ 18:17
Tags: , ,

I just read a recent article in Nature Biotechnology about the new small “benchtop” sequencing machines: Performance comparison of benchtop high-throughput sequencing platforms.  The authors compared the sequencers on de novo assembly of a pathogenic E. coli genome.

Unfortunately, since the article is published by Nature Publishing Group, it is hidden behind an expensive paywall ($32 for the article if your library does not subscribe).

The bottom line of the article is well summarized in the abstract, though:

The MiSeq had the highest throughput per run (1.6 Gb/run, 60 Mb/h) and lowest error rates. The 454 GS Junior generated the longest reads (up to 600 bases) and most contiguous assemblies but had the lowest throughput (70 Mb/run, 9 Mb/h). Run in 100-bp mode, the Ion Torrent PGM had the highest throughput (80–100 Mb/h). Unlike the MiSeq, the Ion Torrent PGM and 454 GS Junior both produced homopolymer-associated indel errors (1.5 and 0.38 errors per 100 bases, respectively).

The MiSeq generally came out looking best in most of the measures, because of its low error rate and large amount of data.  The short reads caused some problems in not being able to place some repeats, resulting in somewhat shorter contigs than when 2 454 GS Junior runs were used.  The MiSeq was the only one of the instruments run with paired ends (none were run with mate pairs), and there are repeats longer than the read lengths so none of the assemblies got down to one contig per replicon.

The error rate on the Ion Torrent was very high, though I understand that the company has come out with more improvements since the experiment was done, so the numbers may not be representative of results you would get today.

I look forward to a similar comparison of long-read sequencers later this year, when the Oxford Nanopore machine can be compared to the PacBio machine, and to the benchtop short-read machines tested in this paper.

Next Page »

%d bloggers like this: