Gas station without pumps

2018 December 26

Dante Labs is probably not a scam

Filed under: Uncategorized — gasstationwithoutpumps @ 15:51
Tags: , ,

As readers may remember from Spit kit sent, I spent $500 in March 2018 in an attempt to get my genome sequenced through Dante Labs. I worried at the time that their price was too low for current sequencing capabilities, but I thought that I’d give them a try.

I am now convinced that Dante Labs is indeed a scam. (Update: 2019 March 15—I’m an no longer convinced one way or the other.  They have, finally, sent me the promised data.)

On 2018 August 31, they sent me an e-mail:

Your Raw Data are ready!

Dear Customer, 

We’re excited to let you know that we completed the sequencing of your sample. 

Over a month later, I asked when I would get the data they claimed to have.  On 2018 Oct 9, they sent me an e-mail:

Dear Kevin,
we apologize for this delay.
Your hard disk is among the next few to be shipped.
Again, we are sorry for this inconvenience.
Best regards,
Paul
Dante Labs
It is now four months since they claimed to have data for me, and almost three months since they claimed that mine was “among the next few to be shipped”.
It is clear now that they are incapable of delivering the whole genome sequence that they promised.  I suspect that they are doing a variant of a Ponzi scheme, where they use the money of new customers to pay for sequencing of a few select early customers, so that they don’t have all their customers complaining about non-delivery.
For my part, I’ve given up on ever expecting anything from them—now I have to decide if it is worthwhile to report them to the Better Business Bureau and appropriate district attorneys.  I can, at least, warn all my readers not to do business with them.
I’ll also have to look for a different, more reputable business to get my whole-genome sequencing done.  While I’m looking, I wonder whether it would be worth the $200 for a SNP panel from 23andme—that would not answer the most interesting question for me (the genetic cause of our family’s inherited bradycardia), but it might provide some data of interest.
If anyone has suggestions for whole-genome sequencing companies I should check, please let me know.

UPDATE: 2019 March 8.  Dante Labs informed me today that my data “has been ready for a while” and there is indeed a VCF report for me on their web site (with an 2018 August 11 date on it, though I don’t believe it was there then).   They had promised me the raw data, which they never sent.

I will check the VCF file to see whether it is consistent with 23andme data for me, and see whether Promethease can process the data (there is mention on the Promethease website that they have had trouble with Dante Labs data, though no reasons are given).  I have also ordered another whole-genome sequencing from fullgenomes.com, which I will also check for consistency.

If the VCF file for Dante Labs turns out to be correct and usable, I will remove the accusation of “scam” and just say that their customer service is terrible.

UPDATE: 2019 Mar 15.  The hard disk with the FASTQ and and alignment files (the raw data Dante Labs promised me last year) arrived yesterday.  After I catch up with my grading (only 51 more hours to go—10 hours more grading on Lab 5, then 41 hours of grading on Lab 6, which comes in on Monday), I’ll be consulting with the experts at UCSC about the best way to re-analyze the data.

Promethease had no trouble processing the VCF file from Dante Labs, but their analysis assumed that any variant not mentioned vas not tested, while most were sequenced but homozygous for the reference allele.  There needs to be a better format for communicating genotypes!

The preliminary results from the Promethease analysis are compatible with the 23andme data, but I’ll need to write a Python program to compare the 23andme genotype with the VCF file to see how much they differ.  This may be a bit tricky, as I’ll probably first have to create a 23andme reference, so that I can guess what the genotype is where the VCF has no variant reported or remove from the comparison any places where my 23andme genotype matches the reference.  Again, this will have to wait until my grading is done.

Right now, I am cautiously optimistic that Dante Labs is not a scam—though their delays in delivering data are not real encouraging.   If they really deliver whole-genome data at scale at the prices they are currently charging, I’ll be impressed.  If they are only delivering to a few selected customers, then not so impressed.

If the data checks out ok, I will have some of my relatives send samples to Dante Labs for sequencing—the prices is so much lower that it is worth some risk.

2015 October 2

What is probability?

Filed under: Uncategorized — gasstationwithoutpumps @ 17:43
Tags: , ,

Today’s class in Bioinformatics: Models and Algorithms went fairly well.

I started by collecting the first assignment and asking if there were any questions, about anything in the class.  I used a bit longer wait time than I usually use for that prompt, and was rewarded by having a reasonable question asked by someone who had clearly been hesitant to ask.  I’ve been finding that “Wait Time” is one of the most powerful techniques in Teach Like a Champion.

The first part of class was just a quick summary of DNA sequencing techniques, with an emphasis on the effect the different technologies had on the sequence data collected (read lengths, number of reads, error models).  Many of the students had already had much more extensive coverage of DNA sequencing elsewhere (there is an undergraduate course about half of which is sequencing technology), and several students were able to volunteer more up-to-date information about library preparation than I have (since they worked directly with the stuff in wet labs).

I reserved the last 15 minutes of the class for a simple question that I asked the students “What is probability?”

I managed to elicit many concepts related to probability, which I affirmed and talked about briefly, even though they weren’t directly part of the definition of probability.  This included things like “frequency”, “observable data”, and “randomness”. One volunteered concept that I put off for later was “conditional probability”—we need to get a definition of probability before we can deal with conditional probability.

Somewhat surprising this year, as that the first concept that was volunteered was that we needed an “event space”.  That is usually the hardest concept to lead students to, so I was surprised that it came up first.  It took a while to get someone to bring up the concept of “number”—that probabilities are numeric.  Someone also came up with the idea “the event space equals 1”, which I pressed them to make more precise and rigorous, which quickly became that the sum of probabilities of of events is 1.  I snuck in that probabilities of events meant a function (I usually press the students to come up with the word “function”, but we were running low on time), and got them to give me the range of the function.

Someone in the class volunteered that summation only worked for discrete event spaces and that integration was needed for continuous ones (the same person who had brought up event spaces initially—clearly someone who had paid attention in a probability class—possibly as a math major, since people who just regard probability as a tool to apply rarely remember or think about the definitions).

So by the end of the class we had a definition of probability that is good enough for this course:

  • A function from an event space to [0,1]
  • that sums (or integrates) to 1.

I did not have time to point out that this definition does not inherently have any notion of frequency, observable data, or randomness.  Those all come up in applications of probability, not in the underlying mathematical definition.  One of my main purposes in asking a simple definitional question (material that should have been coming from prerequisite courses) was to get broad student participation, setting the expectation that everyone contributes.  I think I got about 50% of the students saying something in class today, and I’ll try to get the ones who didn’t speak to say something on Monday.  Unfortunately, I only know about 2 names out of the 19 students, and it takes me forever to learn names, so I may have to resort to random cold calling from the class list.

In retrospect, I wish I had spent 5–10 minutes less time on DNA sequencing, so that there was a bit more time to go into probability, but it won’t hurt to review the definition of probability on Monday.

2014 October 22

Banana Slug genome crowd funding

Filed under: Uncategorized — gasstationwithoutpumps @ 21:20
Tags: , , , , ,
T-shirt design from the first offering of the class.

T-shirt design from the first offering of the class. (click for high-res image)

A few years ago, I taught a Banana Slug Genomics course, based on some sequencing done for free as a training exercise for new technician.  I’ve mentioned the course occasionally on this blog:

The initial, donated sequencing runs did not produce enough date or high enough quality data to assemble the genome to an annotatable state, though we did get a lot of snippets and a reasonable estimate of the genome size (about 2.3GB total and about 1.2GB unique, so a lot of repeats).  All the class notes are in a wiki at https://banana-slug.soe.ucsc.edu/) and the genome size estimates are at https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish.

I did manage to assemble the mitochondrion after the class ended (notes at https://banana-slug.soe.ucsc.edu/computer_resources:assemblies:mitochondrion), but I now think I made a serious error in doing the assembly, treating variants due to a heterogeneous mitochondrial population as repeats instead.  The mitochondrion was relatively easy, because it is much shorter than the nuclear genome (probably in the range 23kB to 36kB, depending on whether the repeats are real) and has many more copies in the DNA library, so coverage was high enough to assemble it—the hard part was just selecting the relevant reads out of the sea of nuclear reads.

Ariolimax dolichophallus at UCSC

Ariolimax dolichophallus at UCSC, from larger image at http://commons.wikipedia.org/wiki/File:Banana_slug_at_UCSC.jpg

The banana slug genomics class has not been taught since Spring 2011, because there was no new data, and we’d milked the small amount of sequence data we had for all that we could get for it.  I’ve played with the idea of trying to get more sequence data, but Ariolimax dolichophallus is not the sort of organism that funding agencies love: it isn’t a pathogen, it isn’t a crop, it isn’t an agricultural pest, and it isn’t a popular model organism for studying basic biology. Although it has some cool biology (only capable of moving forward, genital opening on the side of its head, penis as long as its body, sex for up to 24 hours, sometimes will gnaw off penis to separate after sex, …), funding agencies just don’t see why anyone should care about the UCSC mascot.

Obviously, if anyone is ever going to determine the genome of this terrestrial mollusk, it will UCSC, and the sequencing will be done because it is a cool thing to do, not for monetary gain.  Of course, there is a lot of teaching value in having new data on an organism that is not closely related to any of the already sequenced organisms—the students will have to do almost everything from scratch, for real, as there is no back-of-the-book to look up answers in.

At one point I considered asking alumni for donations to fund more sequence data, but our dean at the time didn’t like the idea (or perhaps the course) and squelched the plan, not allowing us to send any requests to alumni. When the University started getting interested in crowd funding, I started tentative feelers with development about getting the project going, but the development people I talked with all left the University, so the project fizzled.  I had a full teaching load, so did not push for adding starting a crowd-funding campaign and teaching a course based on it to my workload.

This fall, seemingly out of nowhere (but perhaps prompted by the DNA Day celebrations last spring or by the upcoming 50-year anniversary of UCSC), I was asked what it would take to actually get a complete draft genome of the slug—someone else was interested in pushing it forward!  I talked with other faculty, and we decided that we could make some progress for about $5k–10k, and that for $20k in sequencing we could probably create a draft genome with most of the genes annotated.  This is a lot cheaper than 5 years ago, when we did the first banana slug sequencing.

Although the top tentacles of the banana slug are called eyestalks and are light sensing, they do not have vertebrate-style eyes as shown in this cartoon.  Nor do they stick out quite that much.

Although the top tentacles of the banana slug are called eyestalks and are light sensing, they do not have vertebrate-style eyes as shown in this cartoon. Nor do they stick out quite that much.

And now there is a crowd funding campaign at http://proj.at/1rqVNj8 to raise $20k to do the project right!  They even put together this silly video to advertise the project:

Nader Pourmand will supervise students building the DNA library for sequencing during the winter, and Ed Green and I will teach the grad students in the spring how to assemble and annotate the genome.  Ed has much more experience at that than me, having worked with Neanderthal, Denisovan, polar bear, allligator, and other eukaryotic genomes, while I’ve only worked on tiny prokaryotic ones. (He’s also more famous and more photogenic, which is why he is in the advertising video.) We’re both taking on this class as overload this year (it will make my 6th course, in addition to my over-300-student advising load and administrative jobs), because we really like the project. Assuming that we get good data and can assemble the slug genome into big enough pieces to find genes, we’ll put up a genome browser for the slug.

I’m hoping that this time the class can do a better job of the Wiki, so that it is easier to find things on it and there is more background information.  I’d like to make the site be a comprehensive overview of banana-slug facts and research, as well as detailed lab notebook of the process we follow for constructing the genome.

Everyone, watch the video, visit the crowd funding site, read the info there (and as much of the Wiki as you can stomach), and tell your friends about the banana-slug-sequencing effort.  (Oh, and if you feel like donating, we’ll put the money to very good use.)

Update 30 Oct 2014: UCSC has put out a press release about the project.

Update 31 Oct 2014: It looks like they’ve made a better URL for the crowd-funding project: http://crowdfund.ucsc.edu/sluggenome

2012 May 17

Performance of benchtop sequencers

Filed under: Uncategorized — gasstationwithoutpumps @ 18:17
Tags: , ,

I just read a recent article in Nature Biotechnology about the new small “benchtop” sequencing machines: Performance comparison of benchtop high-throughput sequencing platforms.  The authors compared the sequencers on de novo assembly of a pathogenic E. coli genome.

Unfortunately, since the article is published by Nature Publishing Group, it is hidden behind an expensive paywall ($32 for the article if your library does not subscribe).

The bottom line of the article is well summarized in the abstract, though:

The MiSeq had the highest throughput per run (1.6 Gb/run, 60 Mb/h) and lowest error rates. The 454 GS Junior generated the longest reads (up to 600 bases) and most contiguous assemblies but had the lowest throughput (70 Mb/run, 9 Mb/h). Run in 100-bp mode, the Ion Torrent PGM had the highest throughput (80–100 Mb/h). Unlike the MiSeq, the Ion Torrent PGM and 454 GS Junior both produced homopolymer-associated indel errors (1.5 and 0.38 errors per 100 bases, respectively).

The MiSeq generally came out looking best in most of the measures, because of its low error rate and large amount of data.  The short reads caused some problems in not being able to place some repeats, resulting in somewhat shorter contigs than when 2 454 GS Junior runs were used.  The MiSeq was the only one of the instruments run with paired ends (none were run with mate pairs), and there are repeats longer than the read lengths so none of the assemblies got down to one contig per replicon.

The error rate on the Ion Torrent was very high, though I understand that the company has come out with more improvements since the experiment was done, so the numbers may not be representative of results you would get today.

I look forward to a similar comparison of long-read sequencers later this year, when the Oxford Nanopore machine can be compared to the PacBio machine, and to the benchtop short-read machines tested in this paper.

2012 April 15

Reconstructing genes from PacBio reads

Filed under: Uncategorized — gasstationwithoutpumps @ 11:09
Tags: , ,

In Working with other people’s data, I talked a bit about the data I was using for a gene sequencing project:

… there is a prokaryotic gene that is highly variable and it varies by having a couple of repeat regions that have more or fewer copies of a longish repeat sequence.  The repeat sequence is not always the same, but recent duplication events can result in 1000-long identical blocks.  The repeat block itself may be over 3000 bases long. The goal of the project is to accurately reconstruct the gene for 100s of different strains.

I mentioned the importance of long reads:

More important for me than the average length is the greatly increased number of long reads. That extra length makes a huge difference in how easy it is to reconstruct the repeats—having reads that actually span the repeat blocks makes reconstruction much, much easier. 

And I talked about the error model:

I used a very different error model for Pacbio reads than I would for other sequencing technologies.  …  For the PacBio reads, I treated all errors as indel errors, because indels are very common and base substitution is fairly rare.  

Today I want to talk a little about how the reconstruction works and about some of the errors in PacBio reads other than short indels are important to consider in doing the reconstruction of the gene sequence.

The basic method I’m using for reconstructing a gene is a simple iterative procedure for improving a guess at the gene sequence:

  • Find subset of the reads that are similar to the guess.
  • Orient the reads so that they are all on the same strand as the guess.
  • Build a profile hidden Markov model (HMM) from the guess and retrain it on the reads found. Use model surgery to improve the structure of the HMM.
  • Find a larger set of reads that are similar to the consensus of the retrained HMM, and orient them.
  • Align the larger set using the HMM.
  • Make a new guess based on the consensus of the alignment.
  • Starting from this new guess, repeat the procedure.

There are a lot of details to be filled in to make this outline of the process functional. I’ll cover just a few of them in this post.

Selecting seeds

One of the most important things in any iterative-improvement method is starting from a good initial guess (I call the initial guess a “seed”).  Unless you are working on very easy problems, iterative-improvement methods nearly always suffer from “local optima”—you can only find the best solution close to the guess you start with.  If your seed is really bad, you may converge to something other than the best overall solution.

A technique that has been used in other genome assembly projects is to start from a “reference sequence” and look for changes.  Because the gene I’m trying to sequence has been sequenced in other strains, this initially looks like a good approach—big chunks of the gene have pretty high conservation.  But there is a big danger of “reference bias”—reconstructing the gene to look like the reference, even though the data supports a different model of the repeats better.

I have seen that iterative-improvement method I’m using is reluctant to change the number of repeats in a block. If it starts from a seed with too few repeats, it is unlikely to align the repeats so perfectly that many reads support adding an extra repeat in the same place (which is what would be needed for model surgery to insert the repeat).  The correction of the number of repeats does happen sometimes, but not often, so starting from a reference sequence would introduce a large bias toward keeping the number of repeats the same.  Getting the number of repeats right is precisely the problem that the expensive long-read sequencing is supposed to be addressing, so I cannot tolerate this reference bias.

Instead of starting from a reference sequence, I start from one of the reads.  This means that the iteration does not have any reference bias, but it brings up another question: which read should I use as a seed?  Much of the work I’ve done over the last few months has been on ways to choose good seeds.

I started with an obvious method, suggested by something I saw on the Pac Bio site, but now can’t find: using the longest reads as seeds.  This turned out to be a terrible choice.  Looking at the histogram of lengths in my Working with other people’s data post, you can easily see that in the C2 chemistry, almost all the really long reads are artifacts: they’re longer than the input DNA fragment.  What is not so obvious is that the long reads in the older chemistry were also full of artifacts. The artifacts seem to be of two types: doubled reads and circular reads without adapters.

The doubled reads look like what would happen if two of the double-stranded DNA molecules were ligated together before the hairpin adapters were added in library preparation.  If you take a chunk out of the middle of such a doubled molecule, you’d see a circular permutation of the original DNA fragment.

The circular reads without adapters look like what would happen if you managed to circularize the fragment without an adapter, with the end of the molecule followed by its reverse complement, without an intervening adapter.  I suppose that could result from the same sort of mechanism as a doubled read: if you ligate two of your double-stranded molecules together head to head (instead of head to tail), then a chunk out of the middle would look like a mirroring about one of the ends of the original molecule.  I looked at a few of the very long reads for the C2 chemistry, and they looked like the sequence followed by the reversed sequence, though there were some other artifacts that were harder to characterize.

It would be tempting to use quality data to select seeds, but much of the data I had was in fasta format without quality data.  In any case, I expect a similar problem as with selecting by length:  there may be some very good reads that have duplication artifacts or that are from contamination of the initial DNA sample.

I ended up using a reference sequence, but only to select reads to use as seeds.  I did not use a full gene as a reference sequence, but took the three non-repeat regions (there are two different variable repeat blocks in this gene), and looked for reads that matched before and after the first repeat block and for reads that matched before and after the second repeat block (looking for a consistent orientation of the matches).  Rather than taking just one seed, I took ten seeds, five attempting to span the first repeat block and five attempting to span the second repeat block.  After training for a few iterations from each seed, I took the consensus of the HMM that scored reads the best from each set of five, then merged those consensus sequences to get a full-length consensus sequence.  That full-length consensus was then used as a seed for more iterations.

It turned out that the most important criterion for a successful reconstruction of the gene was finding at least one good read to use as a seed spanning the first repeat block and one good read to use as a seed spanning the second repeat block.

HMM regularizers

The HMM program I’m using is the the buildmodel program from the SAM HMM suite, mainly because I’m very familiar with the suite, having used it for about 15 years in protein structure prediction.  It is not optimized for the problem I’m giving it here, as I don’t want to change the emission probabilities of the states, and am mainly interested in its model surgery capabilities (which are not as robust as I’d like, as they have not been exercised much).  I solved the problem of the emission probabilities by giving the HMMs a very stiff emission regularizer and a moderately stiff transition regularizer that makes mismatches much more expensive than insertions and deletions.  The transition regularizer was trained from an alignment of a few hundred good reads to the correct sequence.  I ended up using different regularizers for the C1 and C2 chemistries, because I had the data to train them separately, but they are not so different that one would get a huge difference in results from using the wrong one.

Orienting Reads

One problem with SAM’s buildmodel is that it expects all the training data to be on the same strand.  (This was never a limitation when working with protein sequences, but it is a big problem when working with unoriented reads.)  I needed to orient the reads, but I quickly found out that there are a lot of strand-switching artifacts in PacBio reads.  A lot of them look like circularization problems, sometimes with fragments that are not full length.  That is, the reads look like A A’ or even A A’ A, rather than random chimeras, but the mirroring is not necessarily at the end of the original DNA fragment.  I suspect that careful library prep makes a big difference in how much of what sort of artifact ones sees, but that is well outside the control of a bioinformatician.

To select and orient reads, I used blastn to find matches between all the reads and the current seeds.  After selecting some number of them as training reads, I used the blastn output to select orientation for either the whole read or for two parts to the read, pretending that it had the form A A’, and looking for the best place to put the symmetric cut point. This orienting of subparts of reads gave me much cleaner training data for the HMMs.  It was an absolutely essential step in removing artifacts that otherwise trapped the iterative improvement into bogus optima.  Just discarding reads that matched both strands would have eliminated too many of the best reads.

Note that this read orientation relies on there not being any significant mirroring happening in the real DNA being sequenced.  That was true for the gene I was working on, but would definitely not be true for all sequencing one might want to do.  For example, David Bernick identified transpositions in the Pyrobaculum oguniense genome that result from homologous (and so far as we can determine identical) regions of the genome that are oriented in opposite directions.  Trying to reconstruct a piece of DNA that contained such an inverted repeat would need a different orientation method than the one that I’m using for the gene I’ve been reconstructing from PacBio data.

 

 

There are still a lot of details I’ve not talked about, and I haven’t gotten to one of the interesting results (how many reads are needed to do a decent reconstruction), but this post is getting too long, so I’ll leave those for a later post.

Next Page »

%d bloggers like this: