Gas station without pumps

2011 September 24

On Theoretical/Computational Sciences

GMP (which is short for GeekMommyProf, not Gay Male Prof, as I initially thought), put out a call for blog entries in her blog: Academic Jungle: Call for Entries: Carnival on Theoretical/Computational Sciences, asking computational and theoretical scientists to write “a little bit about what your work entails, what you enjoy/dislike, what types of problems you tackle, what made you chose your specialization, etc.” (Deadline Sunday Sept 25)

As a bioinformatician, I certainly do computational work, but probably not in the sense that GMP means.  My computational work is not theoretical work, nor modeling physical processes using computational models.  Bioinformatics is not a model-driven field the way physics is.  (In fact, the very lumping together of theoretical and computational work implies to me that GMP is a physicist or chemist, though her blogger profile is deliberately vague about the field.)

To give an example of data-driven computational work that is not model-based, consider the example of predicting the secondary structure of a protein.  In the simplest form, we are trying to predict for each amino acid in a protein whether it is part of an α-helix, a β-strand, or neither.  The best prediction methods are not mechanistic models based on physics or chemistry (those have been terribly unsuccessful—not much better than chance performance).  Instead, machine learning methods based on neural nets or random forests are trained on thousands of proteins.  These classifiers get fairly high accuracy, without having anything in them remotely resembling an explanation for how they work. (Actually, almost all the different machine-learning methods have been applied to this classification problem, but neural nets and random forests have had the best performance, perhaps because those methods rely least on any prior understanding of the underlying phenomena.)

It is rare in bioinformatics that we get to build models that explain how things work.  Instead we rely on measuring the predictive accuracy of “black-box” predictors, where we can control the inputs and look at the outputs, but the workings inside are not accessible.

We judge the models based on what fraction of the answers they get right on real data, but we have to be very careful to keep training data and testing data disjoint. It is easy to get perfect accuracy on examples you’ve already seen, but that tells us nothing about how the method would perform on genuinely new examples.  Physicists routinely have such confidence in the correctness of their models that they set their parameters using all the data—something that would be regarded as tantamount to fraud in bioinformatics.

Biological data is not randomly sampled (the usual assumption made in almost all machine-learning theory).  Instead, we have huge sample biases due to biologists researching what interests the community or due to limitations of experimental techniques.  As an example of intensity of interest, of 15 million protein sequences in the “non-redundant” protein database, 490,000 (3%) are from HIV virus strains, though there are only a dozen HIV proteins.  As an example of limitations of experiments, there are fewer than 900 structures of membrane proteins, out of the 70,000 structures in PDB, even though membrane proteins make up 25–35% of proteins in most genomes.  Of those membrane protein structures, there are only about 300 distinct proteins (PDB has a  lot of “redundancy”—different structures of the same protein under slightly different conditions).

One of the clearest signs of a “newbie” paper in bioinformatics is insufficient attention to making sure that the cross-validation tests are clean.  It is necessary to remove duplicate and near-duplicate data, or at least ensure that the training data and the testing data have no close pairs.  Otherwise the results of the cross-validation experiment do not provide any information about how well the method would work on data that has not been seen before, which is the whole point of doing cross-validation testing.  Whenever I see a paper that gets astonishingly good results, the first thing I do is to check how they created their testing data—almost always I find that they have neglected to take the standard precautions against fooling themselves.

I came into bioinformatics through a rather indirect route: B.S. and M.S. in mathematics, then Ph.D. in computer science, then 10 years as a computer engineering professor, teaching VLSI design and doing logic minimization.  I left the field of logic minimization, because it was dominated by one research group, and all the papers in the field were edited and reviewed by members of that research group.  I thought I could compete on quality of research, but when they started rejecting my papers and later publishing parts of them under their own names, I knew I had to leave the field (which, thanks largely to their driving out other research groups, has made almost no progress in the 15 years since).  While I was looking for a new field, I happened to have an office next to a computer scientist who was just moving from machine learning to bioinformatics.  I joined him in that switch and we founded a new department.

I had to learn a lot of new material to work in my new field (Bayesian statistics, machine learning, basic biology, biochemistry, protein structure, … ).  I ended up taking courses (some undergrad, but mostly grad courses) in all those fields.  Unlike some other faculty trying to switch fields, I had no grant funding for time off to learn the new field, nor any reduction in my teaching load.  Since I have hit a dry spell in funding, I am on sabbatical this year, looking at switching fields again (still within bioinformatics, though).

If I were to do my training and career over again, I would not change much—I would switch fields out of logic minimization sooner perhaps (I hung on until I got tenure, despite the clear message that the dominant group felt that the field wasn’t big enough for competitors), and I would take more science classes as an undergrad (as math major, I didn’t need to take science, and I didn’t).  I would also take more engineering classes, both as an undergrad and grad student. I’d also take statistics much sooner (neither math nor computer science students routinely take statistics, and both should).  I’d also up the number of courses I took as a professor, to a steady one a year.  (I was doing that for a while, but for the past few years I’ve stopped, for no good reason.)



  1. What area of bioinformatics are you switching to? And since you had to teach yourself a lot of statistics, do you have good resources to recommend? I am in the same boat, since I am switching research focus, but with a 4/4 teaching load and 3 small kids, I don’t have time to take a course.

    Comment by bkm — 2011 September 26 @ 05:26 | Reply

    • Undecided what I’m switching to. See my post on my plans for sabbatical leave. Hmm, it has been 3 months, so I probably should do a progress report.

      I don’t have a book to recommend. The Bayesian statistics course I took used a draft book by David Draper that I don’t think he has published yet. It might not be suitable for self-study in any case—Draper is an excellent teacher, which may have hidden any flaws that the book may have as self-study text.

      I’ve found that I rarely have the self-discipline to learn a whole new field without a course to provide some structure. I can read on a single topic (like principal components analysis or random forests) on my own, but I need a course (either taking one or teaching one) to structure learning about a bigger topic. For quick tutorials on single topics, I’ve found the wikipedia articles to be pretty good. I’ve also contributed to wikipedia in a small way, correcting errors and improving a few articles. (I urge other professors to do this, also, but few bother to take the time, despite the fact that their students are relying on wikipedia for survey-article information.)

      I don’t know what I would do if I had a 4/4 teaching load and small kids if I needed to switch fields. That looks like a rather overwhelming load.

      Comment by gasstationwithoutpumps — 2011 September 26 @ 07:55 | Reply

  2. Thank you so very much for contributing!

    the very lumping together of theoretical and computational work implies to me that GMP is a physicist or chemist
    :-) True, my background is in physics but also electrical engineering, and I do research at the very physics end of EE (or the applied end of physics).
    While I agree that theoretical physics is largely model-based, this is no longer universal, and there are whole areas especially in condensed matter where data mining is nowadays modus operandi.
    I grouped the two together because a strong mathematical background is needed to either be a theorist or a computational scientist,
    in contrast to most bench or otherwise hands-on work. But, like any other classification, it has limited merits.

    Comment by GMP — 2011 September 26 @ 16:12 | Reply

  3. I’d appreciate help for my daughter to learn neural networks. She has taken AP CS and AB Calculus. Is this sufficient background to learn neural networks? She’s interested in applying neural networks to disease modeling, or diagnosis. What background course work should she do to prepare to learn neural network? Thank you!

    Comment by EWong — 2013 April 22 @ 21:53 | Reply

    • The main thing you need for neural networks is matrix math (linear algebra) and ability to apply chain rule in matrix computations. If she had matrices in pre-calc, she probably has enough math to do neural nets with just Calculus AB. Some presentations will assume multi-variable differential calculus (using gradients), so it might be easier to wait until then, but with sufficient motivation Calculus AB and matrix multiplication suffices.

      Comment by gasstationwithoutpumps — 2013 April 23 @ 01:03 | Reply

      • Thank much! Do you have any resources on learning neural networks you’d recommend for self study?

        Comment by EWong — 2013 April 23 @ 18:59 | Reply

        • No. There are undoubtedly many books, web sites, and video tutorials on neural nets, but I’ve not looked at any of them in the past decade, so I have nothing to recommend.

          Comment by gasstationwithoutpumps — 2013 April 23 @ 20:30 | Reply

          • Thanks for responding so promptly. Appreciate your blog posts!

            Comment by EWong — 2013 April 24 @ 21:55 | Reply

  4. […] Gasstationwithoutpumps contributed a post on his work as a bioinformatician. In contrast to, for instance, computational physics where typically one uses a computer to solve a mathematical model, his work is not model-driven but data-driven. He says “It is rare in bioinformatics that we get to build models that explain how things work. Instead we rely on measuring the predictive accuracy of “black-box” predictors, where we can control the inputs and look at the outputs, but the workings inside are not accessible.” He also talks about the research path he took to his current field. […]

    Pingback by Back to the Jungle: Theory and Computation Carnival (Repost) | xykademiqz — 2015 December 10 @ 13:41 | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: