Gas station without pumps

2013 May 16

Snarky critiques

Filed under: Uncategorized — gasstationwithoutpumps @ 09:07
Tags: , , ,

I just read a marvelously snarky critique of the ENCODE papers (which most of the bioinformaticians I know considered flawed in their over estimates of how much of the human genome is “functional”).  Perhaps the best of the critiques is this one: On the immortality of television sets: “function” in the human genome according to the evolution-free gospel of ENCODE.

The article accuses the ENCODE authors of several academic sins:

Oddly, ENCODE not only uses the wrong concept of functionality, it uses it wrongly and inconsistently.

Sadly, the authors of ENCODE decided to disregard evolutionary conservation as a criterion for identifying function.

Some of their comments are marvelously snarky:

According to Eric Lander, a Human Genome Project luminary, ENCODE is the “Google Maps of the human genome” (Durbin et al. 2010). We beg to differ, ENCODE is considerably worse than even Apple Maps.

The article provides solid reasoning for why the estimate that about 80% of the genome is functional is completely bogus, and provides more reasonable estimates:

Ward and Kellis (2012) confirmed that ~5% of the genome is interspecifically conserved, and by using intraspecific variation, found evidence of lineage-specific constraint suggesting that an additional 4% of the human genome is under selection (i.e., functional), bringing the total fraction of the genome that is certain to be functional to approximately 9%. The journal Science used this value to proclaim “No More Junk DNA” Hurtley 2012), thus, in effect rounding up 9% to 100%.

The ENCODE project produced a lot of good data, but some of the hype surrounding it irritated a lot of biologists and bioinformaticians, who are pleased to see the ENCODE hype so amusingly and accurately skewered.

What bioinformaticians do

Filed under: Uncategorized — gasstationwithoutpumps @ 08:33
Tags: , ,

I recently read two blog posts about what bioinformaticians do (though both claim to be about “what it takes”):

The first post is talking about a shift from “bioinformatics” to “computational biology”—that is, a shift from designing algorithms and data structures to answer biological questions to asking biological questions for which computational tools already exist.  It has quotes with some hype about job opportunities in bioinformatics, but it also has some counterpoints about more realistic views of the bioinformatics job market.  The tone of the piece overall is that bioinformatics is the best of all possible fields.

The second post has a less exalted view of bioinformatics, pointing out that most bioinformatics jobs are data wrangling.  They do say that even data wranglers can do research if they want to, which makes them better off than most wet-lab technicians.

Both posts stress the importance of programming, statistics, and knowing some biology.

 

2013 April 29

Scientists need math

Filed under: Uncategorized — gasstationwithoutpumps @ 14:28
Tags: , , , ,

At the beginning of April (but not on April Fool’s Day), the Wall Street Journal published an essay by E.O. Wilson (a famous biologist): Great Scientists Don’t Need Math. The gist of the article is that Dr. Wilson never learned much math and did well in biology, so others can do so also:

Wilson’s Principle No. 1: It is far easier for scientists to acquire needed collaboration from mathematicians and statisticians than it is for mathematicians and statisticians to find scientists able to make use of their equations.

Wilson’s Principle No. 2: For every scientist, there exists a discipline for which his or her level of mathematical competence is enough to achieve excellence.

The first principle is probably true, but is more a sociological statement than one inherent to the disciplines: applied mathematicians and statisticians welcome collaborations with all sorts of scientists and are happy to learn about and work on real problems that come up elsewhere, while biologists (particularly old-school ones like Dr. Wilson) tend not to be interested in anything outside their own labs and those of their close collaborators and competitors.

The second principle is possibly also true, though much less so than in the past.  Biology used to be a major refuge for innumerate scientists, but modern biology requires a really strong foundation in statistics, far more than most biology students are trained in. The number of positions for innumerate scientists is rapidly shrinking, while the supply of innumerate biology PhDs is growing rapidly.  In the highly competitive job market for biology research, those who follow E. O. Wilson’s advice have a markedly smaller chance of getting the jobs they desire. Of course, Dr. Wilson seems to be unaware of the decades-long oversupply of biology researchers:

During my decades of teaching biology at Harvard, I watched sadly as bright undergraduates turned away from the possibility of a scientific career, fearing that, without strong math skills, they would fail. This mistaken assumption has deprived science of an immeasurable amount of sorely needed talent. It has created a hemorrhage of brain power we need to stanch.

An undergrad degree in biology (even from Harvard) has not gotten many students much more than low-level technician jobs for most of that time (admission to grad school is the better option, as biology PhDs have been able to get temporary postdoc positions at least).  Perhaps Dr. Wilson considers a dead-end job at little more than minimum wage a suitable scientific career—many others do not.

Dr. Wilson does make one unsubstantiated claim that I agree with:

The annals of theoretical biology are clogged with mathematical models that either can be safely ignored or, when tested, fail. Possibly no more than 10% have any lasting value. Only those linked solidly to knowledge of real living systems have much chance of being used.

Biology is a data-driven science, not a model-driven science (a distinction that physicists trying to jump into the field often miss).  Most of “mathematical biology” has been an attempt to apply physics-like models in places where they don’t really fit.  But there has been a big change in the past 10–15 years, as high-throughput experiments have become common in biology.  Now mathematics (mainly statistics) is needed to make any sense out of the experimental results, and biologists with inadequate training in statistics end up making ludicrously wrong conclusions from their experiments, often claiming high significance for random noise.  To understand the data requires more than Wilson’s “intuition”—it requires a solid understanding of the statistics of big data and multiple hypotheses, as humans are very good at perceiving patterns in random noise.

I was pointed to Dr. Wilson’s WSJ essay by Iddo Friedberg’s post Terrible advice from a great scientist, which has a somewhat different critique of the essay. He accuses Wilson of “not recognizing the generalization from an outlier cannot serve as a viable model, or even an argument to support his position.”  Iddo makes several other points, some of them the same as mine—go read his post! Of course, like me, Dr. Friedberg is a bioinformatician and so sees the central role of statistics in 21st century biology.  Perhaps the two of us are wrong, and innumerate biologists will again have glorious scientific careers, but I think the odds are against it.

2012 December 10

Not normal, but what is it?

Filed under: Uncategorized — gasstationwithoutpumps @ 10:53
Tags: , , ,

Robert W. Jernigan, in his “statpics” blog often posts pictures of normal (and other) distributions from the real world, and almost equally often posts pictures that someone else claimed to be normal distributions that clearly aren’t.  In the post statpics: Not Normal Either and Why, he shows a picture of cup lid stacks that someone had labeled a normal distribution, when it was not a distribution at all.  He then shows a similar-looking picture of stacks of scallop shells:

Copied from  Graphical Methods for Presenting Facts by Brinton (1914)

Copied from Graphical Methods for Presenting Facts by  Willard Cope Brinton (1914), as scanned by Google in Google Books.

He blithely asserts that this isn’t a normal distribution either, since the number of ribs is discrete, though it is a nice pictorial representation of a histogram. I wanted him to go further, though, than just saying “not normal”, to talking about what sort of distributions the scallop shells might be from. Binomial? Poisson? How do you choose a good discrete distribution when you don’t know the underlying mechanism?

As a bioinformatician, I often need to come up with distributions for modeling inherently discrete random variables, and I find the statistical literature unhelpful—there seem to be a few well-studied examples, but little or no guidance for applications outside those few classic examples.  Perhaps I’m just too ignorant in the field to be finding the right sources.

For example, I often approximate length distributions for protein or DNA sequences with lognormal distributions.

Example of a lognormal distribution fit to some DNA sequence lengths from a simulation.  Real sequence data is not quite this clean, but often also fits a lognormal distribution fairly well.

Example of a lognormal distribution fit to some DNA sequence lengths from a simulation—the fit is even better with a million samples, rather than 100000. Real sequence length data is not quite this clean, but often also fits a lognormal distribution fairly well also.
When I am looking at fitting distributions, I’m usually most interested in one of the tails (for computing p-value or E-value) rather than the peak of the distribution, so plotting probability on a log scale is very helpful.

Now, lognormal distributions are inherently continuous, and the sequence lengths are inherently discrete, but I don’t know any discrete distributions that are approximately lognormal, in the way the binomial distributions are approximately normal. A lot of sequence-length distributions appear to be approximately lognormal, but I have no idea why. Even knowing how I generated the sequences in the plot above does not give me a clue why they should fit a lognormal distribution so well.

We also often use Gumbel distributions for the null model for the results of searches, because they are a limiting distributions for “max” in the way that normal distributions are a limiting distribution for “sum”. But the Gumbel distributions are continuous, and we are often taking maxes of nonnegative-integer-valued scores—what is the appropriate discrete analog for Gumbel distributions?

If anyone reading this blog is a statistician and can provide me pointers to understandable literature on choosing discrete distributions, or can tell me what discrete distributions might be approximately lognormal, I’d appreciate knowing about them.

2012 November 10

A probability question

Filed under: Uncategorized — gasstationwithoutpumps @ 21:59
Tags: , ,

Sam Shah, one of the math teacher bloggers that I read, posted a bioinformatics-related question on A biology question that is actually a probability question « Continuous Everywhere but Differentiable Nowhere:

Let’s say you have a sequence of 3 billion nucleotides. What is the probability that there is a sequence of 20 nucleotides that repeats somewhere in the sequence? You may assume that there are 4 nucleotides (A, C, T, G) and when coming up with the 3 billion nucleotide sequence, they are all equally likely to appear.

This is the sort of combinatorics question that comes up a lot in building null models for bioinformatics, when we want to know just how weird something we’ve found really is.

Of course, we usually end up asking for the expected number of occurrences of a particular event, rather than the probability of the event, since expected values are additive even when the events aren’t independent.  So let me change the problem to

In a sequence of N bases (independent, uniformly distributed), what is the expected number of k-mers (k≪N). Plug in N=3E9 and k=20.

The probability that any particular k-mer occurs in a particular position is 4-k, so the expected number of occurrences of that k-mer is N/4k, or about 2.7E-3 for the values of N and k given. Oops, we should count both strands, so double that to 5.46E-3.

When the expected number is that small, we can use it equally well as the probability of there being one or more such k-mers. (Note: this assumes 4k ≫ N.)

Now let’s look at each k-mer that actually occurs (all 2N of them), and estimate how many other k-mers match. There are roughly 2N/4k for each (we can ignore little differences like N vs. N-1), so there are 4 N2/4k total pairs. But we’ve counted each pair twice, so the expected number of pairs is only 2 N2/4k, which is 16E6 for N=3E9 and k=20.

We have to take k up to about 32 before we get expected numbers below 1, and up to about 36 before having a repetition is surprising in a uniform random stream.

Next Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 151 other followers

%d bloggers like this: