Gas station without pumps

2012 September 19

Automated assessment of protein structure prediction

Filed under: Uncategorized — gasstationwithoutpumps @ 22:04
Tags: , , ,

A former student of mine today sent me a link to preliminary results from the latest CASP competition: Automated assessment of protein structure prediction in CASP10+ROLL (Hard targets).

The CASP competition community-wide experiment is an attempt to measure progress in the field of computational prediction of protein structure from sequence.  The idea of the experiment is to distribute the sequences for proteins whose structure has not been released, but which is known or about to be known (data collected and preliminary models built from the data).  The predictors use the sequences to predict the structures and register their predictions with the organizers of CASP.  When the structures are released, the organizers compare the registered predictions with the actual structures and report who has done particularly well.  A conference is held at which the prediction groups who did particularly well on one or more aspects of prediction report how they did it.

These CASP competitions happen every two years, and I’ve been to many of them, generally doing well enough to be invited to speak. For the past few years, since I’ve had no funding, I’ve stopped development of my protein-structure prediction methods, and just maintained the old web servers that provide a free prediction service to the community.

The School of Engineering wants to bill me for the electricity and  machine-room space that the old computers use, but I have no grant to charge them to.  The bill could be considerably reduced if the machines were replaced by newer machines that were smaller, faster, and lower power, but I have no funding for that either.  If anyone has some rack-mount Linux nodes that are less than 5 years old they want to donate, say 40–50 cores with local disks for every 2–8 cores, we could probably reduce the foot print in the machine room a lot.

Although I’ve not been doing active development lately, I did enter the SAM-T06 (written in 2006, based on methods developed in 2004) and SAM-T08 (written in 2008, based on improvements to SAM-T06 tested in 2006) servers into CASP-ROLL and CASP-10. My intent was just to provide a historical baseline of old methods that could be used for measuring progress in the field.

Although official results and results for human-involved predictions will not be available until the conference in Italy, December 9–12, the server-based predictions have been informally evaluated by Yang Zhang, whose server has done very well (best by several measures) in the past few CASPs.

In Zhang’s evaluations, the SAM-T08 server did quite well on the “hard” targets (3rd best of 67 servers), despite having had no development over the past 4 years, just weekly automatic updates to its library of models.  The method was developed to find “remote homologs”—proteins that are related to the target being predicted, but not closely related.  It seems to still be doing well at that task.

On the “easy” targets, where finding homologous proteins whose structure is known is easy, the task becomes one of choosing among different homologs, getting the alignment to the homologs as accurate as possible, and (possibly) combining information from different homologs.  The SAM-T08 method is not particularly good at choosing among homologs, and generally includes a few that are a bit too distant when there are many to choose among.  As a result, among the easy targets, SAM-T08 drops to 42nd out of 67 servers in Zhang’s automated assessment.  There isn’t a huge difference on the easy targets among the top 57 or so servers by his measure, as they are all pretty much pulling up the same templates and making minor tweaks to them.  The CASP assessor will probably pull out a variety of different measures to try to make finer distinctions among the methods.

If you combine the good results for the hard targets and the almost-as-good-as-everyone-else results for the easy targets, the SAM-T08-server comes out 8th of 67 servers for all targets. The older SAM-T06 server is in the middle of the pack, at 35th out of 67.  (Note: choosing other metrics will order the servers differently—I make no claim that the “8th” place position is in any way a robust estimate of the relative quality of the many servers.)

In a way it is very heartening that without my putting in any more work, my servers still do quite well. In another way it is depressing that the protein-structure prediction field seems to have made no progress in the past 4 years (and maybe the past 6).  I guess there is still some hope that a human-assisted prediction did much better but just hasn’t been automated yet, but I’m not holding my breath.  In the past few CASPs, the best human-assisted predictions were not really human assisted, but just the best servers run for longer, perhaps with hints taken automatically from other servers.

In a way, this lack of progress reinforces my decision to leave the field of protein-structure prediction, even though I still had a lot of ideas that could have been tried.  Almost all the ideas I had would have taken a lot of work to make tiny incremental improvements, and NIH has no interest in funding the hard work it takes to make small improvements.

NIH was looking for grant proposals that promised magical leaps forward, but I don’t think that there are any magical leaps coming in the next 10 years, and I was not willing to lie about that in grant proposals.  So NIH stopped funding me, giving the money to people who were better at hyping their research.

I was getting tired of having panels reject my proposals, sometimes for bogus reasons. On one proposal, one reviewer commented that my group didn’t need the money, even though at the time I had only one or two years left on a single grant that supported 2 grad students. I guess the reviewer had me confused with a different group, that had 30 or more grad students and postdocs in it.  I’ve  never had funding for a postdoc (though I paid one for a year out of money that was budgeted for my summer salary and a grad student).

I suppose in a way I really didn’t need the money—if I hadn’t been dedicated to training grad students I could have done the work by myself without funding. I was already converting all the funds for my summer salary into grad student support, and most of the computers I used over the years were ones surplussed from other projects that would otherwise have been thrown out. I never had enough grant funding to buy my own cluster, though I did once have enough to buy a file server and UPS for it, and something like 6–10 years ago I did buy a new desktop computer for my office.  It would be nice to work on a new computers, and to get a file server that is not so old and slow, but none of the federal agencies are interested in funding small equipment grants, and it isn’t worth the effort of writing such a proposal just to get it turned down.

It would have been nice to be able to hire someone to clean up the research code and make it better documented, more distributable, and maintainable. I tried a couple of times to get grants for that, but NIH would rather see the code quietly disappear, or me to spend 2–3 years doing it for free.  I’m not going to ask again, so the code will probably fade into oblivion.

There are a lot of unpublished bits of research in the SAM-T08 server (the scoring function for H-bonds without explicit hydrogens, the scoring function for disulfide bonds, the improvements to the HMM scoring in SAM, … ), but my writer’s block kicks in whenever I try to write them up, so without a co-author they are unlikely ever to get written.

Oh well, I don’t want to think about any more—I’ll just get depressed.  Better to think about the courses I’ll be teaching and the new research collaborations I’m working on, where I can do productive work without writing f***ing grant proposals.

2011 January 31

The Assemblathon

Filed under: Uncategorized — gasstationwithoutpumps @ 03:35
Tags: , , , , , ,

There is a new bioinformatics contest: The Assemblathon.  The idea is to provide a CASP-like competition for judging the quality of genome assembly methods from next-generation sequence data.

Actually there are two contests, the Assemblathon and the de Novo Genome Assessment Project (dnGASP), which is based in Europe.

I’ve been thinking that I ought to be entering the Assemblathon, since genome assembly is one of the fields I’m planning to investigate for my sabbatical, but I have no software for assembling Illumina reads (or simulations of them, which is what the Assemblathon is using).  The deadline for submitting assemblies is Feb 6, so there is certainly no time for developing new software (even if I had time for research with my teaching load this year). My main interest is in small genomes (prokaryotes), but the Assemblathon is focused on diploid genomes about the size of a human chromosome.

Of course, the results of the Assemblathon will be coming out just in time for the Banana Slug Genomics class this spring, and we’ll certainly be looking at the results very carefully and trying to get our hands on the software that performs best, because the banana slug genome is probably at least as difficult as the Assemblathon pseudo-genome.  (It may be polyploid and it is almost certainly bigger than one human chromosome.)


2010 December 10

CASP9 at Asilomar

Filed under: Uncategorized — gasstationwithoutpumps @ 19:24
Tags: , , ,

I just got back yesterday from the CASP9 protein-structure-prediction conference at Asilomar.  Despite some rain, the stay at Asilomar was fairly pleasant.  I did not waste my entire summer working on CASP this summer, and so I had a fairly relaxed time at the conference, not needing to worry about how I did.

Merrill Hall, where most of the conference took place (lectures at one end, posters at the other).

I did have to select some posters for a “methods” session at the end of the conference. I was not given a list of posters that had been selected to present at other sessions, and so I kept ending up picking posters of people who were already presenting. I did eventually manage to find three good posters that had not already been selected.

At least half of each afternoon was free time (to make up for sessions running late into the night), and on Monday I went to the beach. I took dozens of pictures, but I think I like best this picture of a gull drinking. I tried to identify the gull from information on the web, and I believe it is a juvenile Glaucous-winged Gull, but I could well be wrong.

Coming back from the beach, I liked the way these cypress trees framed the path.

There was also a beautiful sunset on the way to dinner that I wish I had taken more time to stop and admire. It would certainly have been more worthwhile than the food.

%d bloggers like this: