Gas station without pumps

2011 November 23

Stanford’s on-line courses reviewed

Along with 1000s of other people, I was interested by Stanford’s experiment in online computer science courses (see my posts Stanford Engineering Everywhere, Mark Guzdial doubts AI course is real).  I just read an interesting review of them by “Puzzler”: Thoughts on Programming: Review of 2011 free Stanford online classes. In brief, Puzzler believes that the database and machine-learning courses were successful and the AI course (which got the most publicity) a failure (both in content and in delivery mode).  Puzzler felt that the database course was accessible to high schoolers, but the machine learning course required more math than most high schools offered.

The machine-learning course might be a good one for my son to take next year, though, to broaden the knowledge of machine learning that he is getting this year as part of his home schooling. By then he’ll probably have enough calculus to be able to handle it—he already has enough linear algebra, I believe.

There are 12 Stanford Engineering Everywhere courses for Spring 2012: three intro to computer science, three AI (including the successful Machine Learning course), four linear systems and optimization courses from EE, and two other  CS courses.  They will also be putting a number of seminars and webinars on the web.

2011 October 11

Optimal defocus estimation—possible explanation for evolution of astigmatism

Filed under: Uncategorized — gasstationwithoutpumps @ 14:46
Tags: , , ,

 

I found this article, Optimal defocus estimation in individual natural images, rather interesting, despite being well outside my current field.

The basic idea is a simple one: compute the effect on Fourier transforms of images of defocusing by various amounts, and use modern machine learning techniques to learn filters for detecting the defocusing.  One can get a lot of extra information about the defocus if the lens system is not perfect, using astigmatism and chromatic aberration to get more precise defocus estimates. This paper gave me some insight into why astigmatism has not been eliminated by natural selection.  I had always assumed that it was either not a serious enough defect to have strong selection or that it was linked to other, desirable traits that kept it around in the population.  This paper points out a possible selective advantage to astigmatism, allowing faster focusing in low-light conditions (when the chromatic aberration signal may not be available).

The paper did not go so far as to provide fast algorithms for autofocus using the new technique, but it looks like it might be feasible.  It would be nice to have an autofocus algorithm for digital cameras that did not do so much slow movement of lenses as the current ones use.

2011 September 24

On Theoretical/Computational Sciences

GMP (which is short for GeekMommyProf, not Gay Male Prof, as I initially thought), put out a call for blog entries in her blog: Academic Jungle: Call for Entries: Carnival on Theoretical/Computational Sciences, asking computational and theoretical scientists to write “a little bit about what your work entails, what you enjoy/dislike, what types of problems you tackle, what made you chose your specialization, etc.” (Deadline Sunday Sept 25)

As a bioinformatician, I certainly do computational work, but probably not in the sense that GMP means.  My computational work is not theoretical work, nor modeling physical processes using computational models.  Bioinformatics is not a model-driven field the way physics is.  (In fact, the very lumping together of theoretical and computational work implies to me that GMP is a physicist or chemist, though her blogger profile is deliberately vague about the field.)

To give an example of data-driven computational work that is not model-based, consider the example of predicting the secondary structure of a protein.  In the simplest form, we are trying to predict for each amino acid in a protein whether it is part of an α-helix, a β-strand, or neither.  The best prediction methods are not mechanistic models based on physics or chemistry (those have been terribly unsuccessful—not much better than chance performance).  Instead, machine learning methods based on neural nets or random forests are trained on thousands of proteins.  These classifiers get fairly high accuracy, without having anything in them remotely resembling an explanation for how they work. (Actually, almost all the different machine-learning methods have been applied to this classification problem, but neural nets and random forests have had the best performance, perhaps because those methods rely least on any prior understanding of the underlying phenomena.)

It is rare in bioinformatics that we get to build models that explain how things work.  Instead we rely on measuring the predictive accuracy of “black-box” predictors, where we can control the inputs and look at the outputs, but the workings inside are not accessible.

We judge the models based on what fraction of the answers they get right on real data, but we have to be very careful to keep training data and testing data disjoint. It is easy to get perfect accuracy on examples you’ve already seen, but that tells us nothing about how the method would perform on genuinely new examples.  Physicists routinely have such confidence in the correctness of their models that they set their parameters using all the data—something that would be regarded as tantamount to fraud in bioinformatics.

Biological data is not randomly sampled (the usual assumption made in almost all machine-learning theory).  Instead, we have huge sample biases due to biologists researching what interests the community or due to limitations of experimental techniques.  As an example of intensity of interest, of 15 million protein sequences in the “non-redundant” protein database, 490,000 (3%) are from HIV virus strains, though there are only a dozen HIV proteins.  As an example of limitations of experiments, there are fewer than 900 structures of membrane proteins, out of the 70,000 structures in PDB, even though membrane proteins make up 25–35% of proteins in most genomes.  Of those membrane protein structures, there are only about 300 distinct proteins (PDB has a  lot of “redundancy”—different structures of the same protein under slightly different conditions).

One of the clearest signs of a “newbie” paper in bioinformatics is insufficient attention to making sure that the cross-validation tests are clean.  It is necessary to remove duplicate and near-duplicate data, or at least ensure that the training data and the testing data have no close pairs.  Otherwise the results of the cross-validation experiment do not provide any information about how well the method would work on data that has not been seen before, which is the whole point of doing cross-validation testing.  Whenever I see a paper that gets astonishingly good results, the first thing I do is to check how they created their testing data—almost always I find that they have neglected to take the standard precautions against fooling themselves.

I came into bioinformatics through a rather indirect route: B.S. and M.S. in mathematics, then Ph.D. in computer science, then 10 years as a computer engineering professor, teaching VLSI design and doing logic minimization.  I left the field of logic minimization, because it was dominated by one research group, and all the papers in the field were edited and reviewed by members of that research group.  I thought I could compete on quality of research, but when they started rejecting my papers and later publishing parts of them under their own names, I knew I had to leave the field (which, thanks largely to their driving out other research groups, has made almost no progress in the 15 years since).  While I was looking for a new field, I happened to have an office next to a computer scientist who was just moving from machine learning to bioinformatics.  I joined him in that switch and we founded a new department.

I had to learn a lot of new material to work in my new field (Bayesian statistics, machine learning, basic biology, biochemistry, protein structure, … ).  I ended up taking courses (some undergrad, but mostly grad courses) in all those fields.  Unlike some other faculty trying to switch fields, I had no grant funding for time off to learn the new field, nor any reduction in my teaching load.  Since I have hit a dry spell in funding, I am on sabbatical this year, looking at switching fields again (still within bioinformatics, though).

If I were to do my training and career over again, I would not change much—I would switch fields out of logic minimization sooner perhaps (I hung on until I got tenure, despite the clear message that the dominant group felt that the field wasn’t big enough for competitors), and I would take more science classes as an undergrad (as math major, I didn’t need to take science, and I didn’t).  I would also take more engineering classes, both as an undergrad and grad student. I’d also take statistics much sooner (neither math nor computer science students routinely take statistics, and both should).  I’d also up the number of courses I took as a professor, to a steady one a year.  (I was doing that for a while, but for the past few years I’ve stopped, for no good reason.)

 

2011 February 3

Bad Math, Bad Thinking by Devlin

Filed under: Uncategorized — gasstationwithoutpumps @ 22:13
Tags: , , , ,

Mathematician Keith Devlin of Stanford University rants about Body-Mass Index and DNA identification in his monthly column Bad Math, Bad Thinking: the BMI and DNA Identification Revisited.

While I have some problems with BMI as a measure, I don’t think that Devlin’s arguments are sound. Oh, the math is ok, but the interpretation is wrong.  His previous post Do You Believe in Fairies, Unicorns, or the BMI? makes a much better case against over-interpretation of BMI.  In that earlier column he pointed out that muscle, bone, and fat have different densities, and so a short, muscular person could end up with a high BMI, even if they were distinctly not fat.  But in the new column, he assumes that density is constant across all people.

The point of BMI, as Devlin pointed out in his earlier column, was to summarize trends in populations: how does weight vary with height?  The use of BMI to normalize to get a single measure of obesity is ok for populations but fails in many individual cases.  Any single measure is likely to do that.  What should be done is to gather measurements (height, weight, waist size, age, gender, and anything else believed to be relevant) and longevity numbers (or whatever else you are trying to predict) from a huge number of people, and build models using good machine-learning techniques.  The models can be tested by cross validation techniques (which the machine-learning community now understands well).  The predictive power of the models can be compared with the single predictors that MDs use (BMI, waist size, cholesterol level, … ).

Similarly, his arguments about DNA identification are ok, but he fails to explain how to interpret DNA id data properly, leaving the reader with the vague statement “That elusive, crucial concept the truth lies somewhere between those two extremes,” where the extremes are 1 in a billion and 50%.  He also pretends that the problem is one of comparing all-against-all in a database of a million DNA profiles.

More realistically, what we need to know as a society is the expected number of false positives per year, which can be estimated from the number of queries, the probability of any particular false match, and the number of profiles in the database.  It is reasonable to debate how many false positive matches are acceptable, given other protections against false accusation in the criminal justice system.  There are some inherent limitations on how accurate DNA matching can be: identical twins being one obvious case where DNA matching can result in false positives which are not accounted for in Devlin’s simplistic reasoning.

%d bloggers like this: