GMP (which is short for GeekMommyProf, not Gay Male Prof, as I initially thought), put out a call for blog entries in her blog: Academic Jungle: Call for Entries: Carnival on Theoretical/Computational Sciences, asking computational and theoretical scientists to write “a little bit about what your work entails, what you enjoy/dislike, what types of problems you tackle, what made you chose your specialization, etc.” (Deadline Sunday Sept 25)
As a bioinformatician, I certainly do computational work, but probably not in the sense that GMP means. My computational work is not theoretical work, nor modeling physical processes using computational models. Bioinformatics is not a model-driven field the way physics is. (In fact, the very lumping together of theoretical and computational work implies to me that GMP is a physicist or chemist, though her blogger profile is deliberately vague about the field.)
To give an example of data-driven computational work that is not model-based, consider the example of predicting the secondary structure of a protein. In the simplest form, we are trying to predict for each amino acid in a protein whether it is part of an α-helix, a β-strand, or neither. The best prediction methods are not mechanistic models based on physics or chemistry (those have been terribly unsuccessful—not much better than chance performance). Instead, machine learning methods based on neural nets or random forests are trained on thousands of proteins. These classifiers get fairly high accuracy, without having anything in them remotely resembling an explanation for how they work. (Actually, almost all the different machine-learning methods have been applied to this classification problem, but neural nets and random forests have had the best performance, perhaps because those methods rely least on any prior understanding of the underlying phenomena.)
It is rare in bioinformatics that we get to build models that explain how things work. Instead we rely on measuring the predictive accuracy of “black-box” predictors, where we can control the inputs and look at the outputs, but the workings inside are not accessible.
We judge the models based on what fraction of the answers they get right on real data, but we have to be very careful to keep training data and testing data disjoint. It is easy to get perfect accuracy on examples you’ve already seen, but that tells us nothing about how the method would perform on genuinely new examples. Physicists routinely have such confidence in the correctness of their models that they set their parameters using all the data—something that would be regarded as tantamount to fraud in bioinformatics.
Biological data is not randomly sampled (the usual assumption made in almost all machine-learning theory). Instead, we have huge sample biases due to biologists researching what interests the community or due to limitations of experimental techniques. As an example of intensity of interest, of 15 million protein sequences in the “non-redundant” protein database, 490,000 (3%) are from HIV virus strains, though there are only a dozen HIV proteins. As an example of limitations of experiments, there are fewer than 900 structures of membrane proteins, out of the 70,000 structures in PDB, even though membrane proteins make up 25–35% of proteins in most genomes. Of those membrane protein structures, there are only about 300 distinct proteins (PDB has a lot of “redundancy”—different structures of the same protein under slightly different conditions).
One of the clearest signs of a “newbie” paper in bioinformatics is insufficient attention to making sure that the cross-validation tests are clean. It is necessary to remove duplicate and near-duplicate data, or at least ensure that the training data and the testing data have no close pairs. Otherwise the results of the cross-validation experiment do not provide any information about how well the method would work on data that has not been seen before, which is the whole point of doing cross-validation testing. Whenever I see a paper that gets astonishingly good results, the first thing I do is to check how they created their testing data—almost always I find that they have neglected to take the standard precautions against fooling themselves.
I came into bioinformatics through a rather indirect route: B.S. and M.S. in mathematics, then Ph.D. in computer science, then 10 years as a computer engineering professor, teaching VLSI design and doing logic minimization. I left the field of logic minimization, because it was dominated by one research group, and all the papers in the field were edited and reviewed by members of that research group. I thought I could compete on quality of research, but when they started rejecting my papers and later publishing parts of them under their own names, I knew I had to leave the field (which, thanks largely to their driving out other research groups, has made almost no progress in the 15 years since). While I was looking for a new field, I happened to have an office next to a computer scientist who was just moving from machine learning to bioinformatics. I joined him in that switch and we founded a new department.
I had to learn a lot of new material to work in my new field (Bayesian statistics, machine learning, basic biology, biochemistry, protein structure, … ). I ended up taking courses (some undergrad, but mostly grad courses) in all those fields. Unlike some other faculty trying to switch fields, I had no grant funding for time off to learn the new field, nor any reduction in my teaching load. Since I have hit a dry spell in funding, I am on sabbatical this year, looking at switching fields again (still within bioinformatics, though).
If I were to do my training and career over again, I would not change much—I would switch fields out of logic minimization sooner perhaps (I hung on until I got tenure, despite the clear message that the dominant group felt that the field wasn’t big enough for competitors), and I would take more science classes as an undergrad (as math major, I didn’t need to take science, and I didn’t). I would also take more engineering classes, both as an undergrad and grad student. I’d also take statistics much sooner (neither math nor computer science students routinely take statistics, and both should). I’d also up the number of courses I took as a professor, to a steady one a year. (I was doing that for a while, but for the past few years I’ve stopped, for no good reason.)