Gas station without pumps

2015 April 23

Very long couple of days

Yesterday and today have been draining.

Yesterday, I had three classes each 70 minutes long: banana slug genomics, applied electronics for bioengineers, and a guest lecture for another class on protein structure.  I also had my usual 2 hours of office hours, delayed by half an hour because of the guest lecture.

The banana-slug-genomics class is going well.  My co-instructor (Ed Green) has done most of the organizing and has either arranged guest lectures or taught classes himself. This week and part of next we are getting preliminary reports from the 5 student groups on how the assemblies are coming.  No one has done an assembly yet, but there has been a fair amount of data cleanup and prep work (adapter removal, error correction, and estimates of what kmer sizes will work best in the de Bruijn graphs for assembly).  The data is quite clean, and we have about 23-fold coverage currently, which is just a little low for making good contigs.   (See https://banana-slug.soe.ucsc.edu/data_overview for more info about the data.) Most of the data is from a couple of lanes of HiSeq sequencing (2×100 bp) from 2 libraries (insert sizes around 370 and 600) , but some is from an early MySeq run (2×300bp), used to confirm that the libraries were good before the HiSeq run.  In class, we decided to seek a NextSeq run (2×250bp), either with the same libraries or with a new one, so that we could get more data quickly (we can get the data by next week, rather than waiting 2 or 3 weeks for a HiSeq run to piggyback on).  With the new data, we’ll have more than enough shotgun data for making the contigs.  The mate-pair libraries for scaffolding are still not ready (they’ve been failing quality checks and need to be redone), or we would run one of them on the NextSeq run.  We’ll probably also do a transcriptome library (in part to check quality of scaffolding, and in part to annotate the genome), and possibly a small-RNA library (a UCSC special interest).

The applied electronics lecture had a lot to cover, because the material on hysteresis that was not covered on Monday needed to be done before today’s lab, plus I had to show students how to interpret the 74HC15N datasheet for the Schmitt trigger, as we run them at 3.3V, but specs are only given for 2V, 4.5V, and 6V.  I also had to explain how the relaxation oscillator works (see last year’s blog post for the circuit they are using for the capacitance touch sensor).

Before getting to all the stuff on hysteresis, I had to finish up the data analysis for Tuesday’s lab, showing them how to fit models to the measured magnitude of impedance of the loudspeakers using gnuplot.  The fitting is fairly tricky, as the resistor has to be fit in one part of the curve, the inductor in another, and the RLC parameters for the resonance peak in yet another.  Furthermore, the radius of convergence is pretty small for the RLC parameters, so we had to do a lot of guessing reasonable values and seeing if we got convergence.  (See my post of 2 years ago for models that worked for measurements I made then.)

After the overstuffed electronics lecture, I had to move to the next classroom over and give a guest lecture on protein structure.  For this lecture I did some stuff on the chalk board, but mostly worked with 3D Darling models. When I did the guest lecture last year, I prepared a bunch of PDB files of protein structures to show the class, but I didn’t have the time or energy for that this year, so decided to do it all with the physical models.  I told students that the Darling models (which are the best kits I’ve seen for studying protein structure) are available for check out at the library, and that I had instructions for building protein chains with the Darling models plus homework in Spring 2011 with suggestions of things to build.  The protein structure lecture went fairly well, but I’m not sure how much students learned from it (as opposed to just being entertained).  The real learning comes from building the models oneself, but I did not have the luxury of making assignments for the course—nor would I have had time to grade them.

Speaking of grading, right after my 2 hours of office hours (full, as usual, with students wanting waivers for requirements that they had somehow neglected to fulfill), I had a stack of prelab assignments to grade for the hysteresis lab.  The results were not very encouraging, so I rewrote a section of my book to try to clarify the points that gave the students the most difficulty, adding in some scaffolding that I had thought would be unnecessary.  I’ve got too many students who can’t read something (like the derivation of the oscillation frequency for a relaxation oscillator on Wikipedia) and apply the same reasoning to their slightly different relaxation oscillator.  All they could do was copy the equations (which did not quite apply).  I put the updated book on the web site at about 11:30 p.m., emailed the students about it, ordered some more inductors for the power-amp lab, made my lunch for today, and crashed.

This morning, I got up around 6:30 a.m. (as I’ve been doing all quarter, though I am emphatically not a morning person), to make a thermos of tea, and process my half-day’s backlog of email (I get 50–100 messages a day, many of them needing immediate attention). I cycled up to work in time to open the lab at 10 a.m., then was there supervising students until after 7:30 pm. I had sort of expected that this time, as I knew that this lab was a long one (see Hysteresis lab too long from last year, and that was when the hysteresis lab was a two-day lab, not just one day).  Still, it made for a very long day.

I probably should be grading redone assignments today (I have a pile that were turned in Monday), but I don’t have the mental energy needed for grading tonight.  Tomorrow will be busy again, as I have banana-slug genomics, a visiting collaborator from UW, the electronics lecture (which needs to be about electrodes, and I’m not an expert on electrochemistry), and the grad research symposium all afternoon. I’ll also be getting another stack of design reports (14 of them, about 5 pages each) for this week’s lab, to fill up my weekend with grading. Plus I need to update a couple more chapters of the book before students get to them.

2011 October 13

Single-cell genome sequencing

I read a news blurb in GenomeWeb Daily News: Researchers Sequence Genome of Single Marine Bacterial Cell. The paper they are referring to is

Efficient de novo assembly of single-cell bacterial genomes from short-read data sets.
Hamidreza Chitsaz, Joyclyn L Yee-Greenbaum, Glenn Tesler, Mary-Jane Lombardo, Christopher L Dupont, Jonathan H Badger, Mark Novotny, Douglas B Rusch, Louise J Fraser, Niall A Gormley, Ole Schulz-Trieglaff, Geoffrey P Smith, Dirk J Evers, Pavel A Pevzner, & Roger S Lasken.
Nature Biotechnology 29: 915–921 (2011) doi:10.1038/nbt.1966

Sorry, it is behind a pay wall, but you can read it at any UC library, since the University of California subscribes.

To do sequencing you need a lot of DNA, but a single cell provides only one copy of the molecule.  The trick is to make many copies—in this case using a technique called Multiple Displacement Amplification (MDA) using phi29 DNA polymerase, to go from femtograms to micrograms of DNA (a billion-fold amplification).  Unfortunately, the amplification is not uniform across the genome, and chimeras can be formed.

The main advance in this paper is computational, rather than wet-lab techniques.  They developed new genome assembly methods that are more tolerant of the non-uniform coverage of MDA than the standard algorithms.  The algorithm used is a variant of the standard de Bruijn graph method used in Velvet.  Instead of throwing away low-coverage contigs (which in standard assemblies are often the result of sequencing errors or contaminants) using a fixed threshold, Velvet-SC uses a gradually increasing threshold.  When discarding a contig allows two others to merge, the average coverage of the merged contig is recomputed, which may rescue a previously moderately low coverage contig that now has a unique neighbor to merge with. The whole algorithm consists of error correction using Euler-SR, followed by assembly using Velvet-SC.

The authors tested their assembly method using DNA from single E. coli cells and a single S. aureus cell. But this was not real single-cell sequencing, as they made several amplifications from different E. coli cells, and sequenced the ones that seemed to be most complete (containing 10 pre-selected loci). They do not say how many cells they had to amplify to get the 2 E. coli and one S. aureus DNA sets that they sequenced. Similarly for the uncultured marine bacterium, they did many MDA reactions for different cells, and selected one reaction product based on the 16S ribosomal gene.

Data from MDA amplification is pretty noisy: only about 93% of the 100bp paired-end Illumina reads were mappable to the reference genome (vs. 99% for sequencing from a multicellular clonal population).  Most of the problem was amplification of minor contaminants, but up to 2% of the reads were chimeric reads due to the MDA process itself.

Their final assembly of the novel marine bacterium had 823 contigs with an N50 of 30,293 and a longest contig of 113,282.  Since the gaps between the contigs may be due to the biases of MDA, and there is no culture to get more copies of the same bacterium from, it may not be possible to close this genome.  They do not discuss finishing the genome by doing PCR on metagenomic DNA, though they mention that the SAR324 clade is found in numerous marine samples.  They do discuss the possibility of using mate-pair libraries to order and orient the contigs, but are worried that the chimeric rearrangement that occurs every 10–30 kbases in MDA would limit the usefulness of mate-pair libraries.

2011 April 9

Genome Assemblathon 2

There is a new genome assembly competition/experiment starting up.  This one uses real data, not synthetic data, so the “right” answer is not known.  There will, however, be some data available for evaluation that is not available to the assemblers, so it should be possible to tell the fairly good methods from the poor ones, even if fine distinctions between similar results are not feasible.

Two animals are being sequenced: a parrot (a budgerigar named Mr. B) and a fish (a Lake Malawi Cichlid).  The raw reads are intended to be released by June 1.  Most of the sequence data will be from Illumina machines at BGI or Broad, because that is currently the cheapest technology for high-volume sequencing.  There is also some 454 data for the parrot (about 3x), and many of the Illumina runs will be with new GC rich Illumina chemistry (TrueSeq v3) to try to get around some problems with sequencing GC-rich regions of the parrot genome.  I believe that the fish genome will be just Illumina runs from the Broad Institute, using the standard set of insert sizes that they use.

One interesting aspect of the Assemblathon is figuring out how to get the raw data to the researchers.  They currently expect over a terabyte for each species, which can be rather slow to deliver over the internet.  They are planning to mail terabyte hard drives around the world, rather than try to ship the information electronically.

People wanting to join the Assemblathon should contact Ian Korf at UC Davis.  (If you need me to find his e-mail address for you, then you certainly don’t have the skills needed to assemble a genome computationally.)

Although I’m teaching a course on genome assembly now, I have not written any tools capable of tackling a eukaryotic genome.  I may be developing some next year, if I can find something clever that can be implemented by one person and hasn’t already been done.  So far, some of my best ideas have turned out to already be in the literature, along with some clever stuff that I didn’t think of.

Even very simple ideas (like counting k-mers) are tricky to do well.  I was quite impressed by the memory and time efficiency of the Jellyfish program, for example, which uses very tight bit packing and very low overhead locking of multiple threads on a shared-memory machine to do very fast k-mer counting.  As a former computer engineer, it warms my heart to see some of the bit-twiddling techniques that were being dismissed as irrelevant details by the “high-level” computer scientists now becoming important for big algorithmic problems, not just device drivers and low-level hardware.

2011 January 16

New DNA sequencing platforms

Filed under: Uncategorized — gasstationwithoutpumps @ 15:02
Tags: , , , , ,

In his Omics! Omics! blog, Keith Robison has just posted twice about the MiSeq sequencer to come out this summer from Illumina as a major competitor to Ion Torrent for low-cost sequencing: My, Oh My, Oh MiSeq!! and Whither Ion Torrent? There was already some crowding in the low-price sequencer market with the 454 GS Jr.: First of a Torrent?

I’m wondering whether Ion Torrent can reduce library prep cost enough to stay ahead of Illumina in the low-price, benchtop sequencer market.

%d bloggers like this: