Gas station without pumps

2014 September 30

Ebola genome browser

Filed under: Uncategorized — gasstationwithoutpumps @ 21:00
Tags: , , , , ,

For the past week, I’ve been watching the genome browser team (led by Jim Kent) scramble to get together an information resource to aid in the fight against the Ebola virus.  They went public today:

We are excited to announce the release of a Genome Browser and information portal for the Jun. 2014 assembly of the Ebola virus (UCSC version eboVir3, GenBank accession KM034562) submitted by the Broad Institute. We have worked closely with the Pardis Sabeti lab at the Broad Institute and other Ebola experts throughout the world to incorporate annotations that will be useful to those studying Ebola. Annotation tracks included in this initial release include genes from NCBI, B- and T-cell epitopes from the IEDB, structural annotations from UniProt and a wealth of SNP data from the 2014 publication by the Sabeti lab. This initial release also contains a 160-way alignment comprising 158 Ebola virus sequences from various African outbreaks and 2 Marburg virus sequences. You can find links to the Ebola virus Genome Browser and more information on the Ebola virus itself on our Ebola Portal page.

Bulk downloads of the sequence and annotation data are available via the Genome Browser FTP server or the Downloads page. The Ebola virus (eboVir3) browser annotation tracks were generated by UCSC and collaborators worldwide. See the Credits page for a detailed list of the organizations and individuals who contributed to this release and the conditions for use of these data.


Matthew Speir
UCSC Genome Bioinformatics Group

2013 November 12

Long weekend, little accomplished

Filed under: Uncategorized — gasstationwithoutpumps @ 22:47
Tags: , , , , ,

I just had a 4-day weekend, in which I got little more accomplished than a usual 2-day weekend.

Much of Saturday was spent trying to use PacBio reads to improve a draft genome of  a V. cholerae strain that  I had built with 454 reads a couple of years ago.  There was no problem getting blasr to map the reads, and I could call variants with “samtools mpileup”, though that took 2 CPU days to complete.  Unfortunately, that did not tell me what I really needed to know, which was whether the orignal assembly was in the right order.  I found a couple of places where the PacBio read mapping indicated problems (either the reads all terminated their mapping at nearly the same point, or they suddenly switched from aligning very well to aligning poorly).  Unfortunately, I’ve not yet figured out a good way to automate this detection, so I’m not sure I can find all the places which might have problems.  Dips in average quality of the mpileup consensus over 50- or 100-base windows pulls out the places where the alignments get bad, but not where they suddenly stop.  Furthermore, once I’ve identified the bad regions, I still need to break the genome apart there, rebuild the bad regions from the PacBio reads that map nearby, and see if I can stitch the genome back together (probably with extra repeats that had not been resolved in the 454 assembly).  I’m considering backing off and building a new genome assembly from just the PacBio reads (after cleaning them up using PacBio2CA and the 454 reads) and the Celera Assembler.  I can then compare the genome built from the PacBio reads and the one built from the 454 reads and resolve any discrepancies. Sigh, this project keeps getting bigger, just as I think I’m almost done.

On Sunday, I did a bunch of small tasks: raked leaves and shredded them, updated the grad alumni web page, announced the Freshman Design Seminar class (which will happen winter quarter, though once again as a “Group Tutorial” to prototype the course before submitting the official paperwork), wrote a letter of recommendation for a student applying to grad schools, scanned in the flyer for “Planet of the Abes” (the recent Dinosaur Prom show), scanned 35-year-old t-shirt of mine so that I can get another copy made, updated my paper list to include the just released PNAS paper, wrote a blog post, and caught up with a lot of my e-mail (though there are still some advising e-mails that I haven’t taken care of).

The flyer,  drawn by Hunter Wallraff, for the Dinosaur Prom Improv show.  Because the edge of the drawing was not reproduced on the flyer, I had to try to add it in by hand to get something usable for the titling of the video.  I did not correct the error in the URL for westperformingarts.com

The flyer, drawn by Hunter Wallraff who holds the copyright, for the Dinosaur Prom Improv show. Because the edge of the drawing was not reproduced on the flyer, I had to try to finish the S and add an E  in by hand for “Broadway Playhouse” to get something usable for the titling of the video. I did not correct the error in the URL for westperformingarts.com

On Monday, I did a lot of grading, wrote another blog post, used the Planet of the Abes flyer to make titles for the Dinosaur Prom video, and rendered the video (tying up my laptop all night).  I also cleaned up the scan of the old t-shirt and converted it to SVG so that a new silk screen printing can be done. I’ve tried looking for the copyright holder for the design, but I have no idea how to find him (or her)—Google image searches bring up nothing similar, and there is no signature on the design or the shirt. I started working on my slides for the talk I have to give on Thursday, but did not get much done.

This morning, I responded to more e-mail, wrote another blog post, did more grading, and returned the activity monitor I’ve been wearing for the past 2 weeks to the Sleep Center. In the afternoon, I did more grading at Gayle’s Bakery in Capitola, met with my son’s consultant teacher for a couple of hours, bought the usual weekly load of soy milk (only 2.5 gallons this week), did some other grocery shopping, finished the grading, recorded the grades, cleared the rest of the advising e-mail, and compared results on group theory problems with my son.  We’re a bit behind schedule there—he’s not finished all the Chapter 1 and Chapter 2 problems I assigned, and we’re supposed to be finishing Chapter 3 this week—I’ve not even assigned Chapter 3 problems yet.

Things I wanted to do this weekend but didn’t:

  • Get the slides done for Thursday’s talk
  • Get the Program Learning Objectives written for bioinformatics
  • Get assessment plans defined and written for the Program Learning Objectives for both bioengineering and bioinformatics.
  • Create a draft of a revised curriculum for the third track in bioengineering (which also needs a new name and a clearer focus).
  • Rewrite the handout for the next programming assignment in the Bioinformatics: Models and Algorithms courss.
  • Write code for looking for regions of the Helicobacter pylori genome that are possibly swapped in the current assembly and test for which rearrangement is most consistent with our data.
  • Start testing the BitScope differential input device they sent me.
  • Start working on Chapter 3 problems in group theory.
  • Start writing a paper on the segmenter that I described in my blog 3 months ago.
  • Clear the leaves off the roof before the rains start, since the leaves form dams that keep the rain from running off into the gutters properly.

There were probably other things, but I forget what they were now.  Once the to-do list gets longer than my piece of paper can hold, things fall off it.

 

2013 May 16

Snarky critiques

Filed under: Uncategorized — gasstationwithoutpumps @ 09:07
Tags: , , ,

I just read a marvelously snarky critique of the ENCODE papers (which most of the bioinformaticians I know considered flawed in their over estimates of how much of the human genome is “functional”).  Perhaps the best of the critiques is this one: On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE.

The article accuses the ENCODE authors of several academic sins:

Oddly, ENCODE not only uses the wrong concept of functionality, it uses it wrongly and inconsistently.

Sadly, the authors of ENCODE decided to disregard evolutionary conservation as a criterion for identifying function.

Some of their comments are marvelously snarky:

According to Eric Lander, a Human Genome Project luminary, ENCODE is the “Google Maps of the human genome” (Durbin et al. 2010). We beg to differ, ENCODE is considerably worse than even Apple Maps.

The article provides solid reasoning for why the estimate that about 80% of the genome is functional is completely bogus, and provides more reasonable estimates:

Ward and Kellis (2012) confirmed that ~5% of the genome is interspecifically conserved, and by using intraspecific variation, found evidence of lineage-specific constraint suggesting that an additional 4% of the human genome is under selection (i.e., functional), bringing the total fraction of the genome that is certain to be functional to approximately 9%. The journal Science used this value to proclaim “No More Junk DNA” Hurtley 2012), thus, in effect rounding up 9% to 100%.

The ENCODE project produced a lot of good data, but some of the hype surrounding it irritated a lot of biologists and bioinformaticians, who are pleased to see the ENCODE hype so amusingly and accurately skewered.

What bioinformaticians do

Filed under: Uncategorized — gasstationwithoutpumps @ 08:33
Tags: , ,

I recently read two blog posts about what bioinformaticians do (though both claim to be about “what it takes”):

The first post is talking about a shift from “bioinformatics” to “computational biology”—that is, a shift from designing algorithms and data structures to answer biological questions to asking biological questions for which computational tools already exist.  It has quotes with some hype about job opportunities in bioinformatics, but it also has some counterpoints about more realistic views of the bioinformatics job market.  The tone of the piece overall is that bioinformatics is the best of all possible fields.

The second post has a less exalted view of bioinformatics, pointing out that most bioinformatics jobs are data wrangling.  They do say that even data wranglers can do research if they want to, which makes them better off than most wet-lab technicians.

Both posts stress the importance of programming, statistics, and knowing some biology.

 

2013 April 29

Scientists need math

Filed under: Uncategorized — gasstationwithoutpumps @ 14:28
Tags: , , , ,

At the beginning of April (but not on April Fool’s Day), the Wall Street Journal published an essay by E.O. Wilson (a famous biologist): Great Scientists Don’t Need Math. The gist of the article is that Dr. Wilson never learned much math and did well in biology, so others can do so also:

Wilson’s Principle No. 1: It is far easier for scientists to acquire needed collaboration from mathematicians and statisticians than it is for mathematicians and statisticians to find scientists able to make use of their equations.

Wilson’s Principle No. 2: For every scientist, there exists a discipline for which his or her level of mathematical competence is enough to achieve excellence.

The first principle is probably true, but is more a sociological statement than one inherent to the disciplines: applied mathematicians and statisticians welcome collaborations with all sorts of scientists and are happy to learn about and work on real problems that come up elsewhere, while biologists (particularly old-school ones like Dr. Wilson) tend not to be interested in anything outside their own labs and those of their close collaborators and competitors.

The second principle is possibly also true, though much less so than in the past.  Biology used to be a major refuge for innumerate scientists, but modern biology requires a really strong foundation in statistics, far more than most biology students are trained in. The number of positions for innumerate scientists is rapidly shrinking, while the supply of innumerate biology PhDs is growing rapidly.  In the highly competitive job market for biology research, those who follow E. O. Wilson’s advice have a markedly smaller chance of getting the jobs they desire. Of course, Dr. Wilson seems to be unaware of the decades-long oversupply of biology researchers:

During my decades of teaching biology at Harvard, I watched sadly as bright undergraduates turned away from the possibility of a scientific career, fearing that, without strong math skills, they would fail. This mistaken assumption has deprived science of an immeasurable amount of sorely needed talent. It has created a hemorrhage of brain power we need to stanch.

An undergrad degree in biology (even from Harvard) has not gotten many students much more than low-level technician jobs for most of that time (admission to grad school is the better option, as biology PhDs have been able to get temporary postdoc positions at least).  Perhaps Dr. Wilson considers a dead-end job at little more than minimum wage a suitable scientific career—many others do not.

Dr. Wilson does make one unsubstantiated claim that I agree with:

The annals of theoretical biology are clogged with mathematical models that either can be safely ignored or, when tested, fail. Possibly no more than 10% have any lasting value. Only those linked solidly to knowledge of real living systems have much chance of being used.

Biology is a data-driven science, not a model-driven science (a distinction that physicists trying to jump into the field often miss).  Most of “mathematical biology” has been an attempt to apply physics-like models in places where they don’t really fit.  But there has been a big change in the past 10–15 years, as high-throughput experiments have become common in biology.  Now mathematics (mainly statistics) is needed to make any sense out of the experimental results, and biologists with inadequate training in statistics end up making ludicrously wrong conclusions from their experiments, often claiming high significance for random noise.  To understand the data requires more than Wilson’s “intuition”—it requires a solid understanding of the statistics of big data and multiple hypotheses, as humans are very good at perceiving patterns in random noise.

I was pointed to Dr. Wilson’s WSJ essay by Iddo Friedberg’s post Terrible advice from a great scientist, which has a somewhat different critique of the essay. He accuses Wilson of “not recognizing the generalization from an outlier cannot serve as a viable model, or even an argument to support his position.”  Iddo makes several other points, some of them the same as mine—go read his post! Of course, like me, Dr. Friedberg is a bioinformatician and so sees the central role of statistics in 21st century biology.  Perhaps the two of us are wrong, and innumerate biologists will again have glorious scientific careers, but I think the odds are against it.

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 272 other followers

%d bloggers like this: