Gas station without pumps

2013 May 16

What bioinformaticians do

Filed under: Uncategorized — gasstationwithoutpumps @ 08:33
Tags: , ,

I recently read two blog posts about what bioinformaticians do (though both claim to be about “what it takes”):

The first post is talking about a shift from “bioinformatics” to “computational biology”—that is, a shift from designing algorithms and data structures to answer biological questions to asking biological questions for which computational tools already exist.  It has quotes with some hype about job opportunities in bioinformatics, but it also has some counterpoints about more realistic views of the bioinformatics job market.  The tone of the piece overall is that bioinformatics is the best of all possible fields.

The second post has a less exalted view of bioinformatics, pointing out that most bioinformatics jobs are data wrangling.  They do say that even data wranglers can do research if they want to, which makes them better off than most wet-lab technicians.

Both posts stress the importance of programming, statistics, and knowing some biology.

 

2013 April 29

Scientists need math

Filed under: Uncategorized — gasstationwithoutpumps @ 14:28
Tags: , , , ,

At the beginning of April (but not on April Fool’s Day), the Wall Street Journal published an essay by E.O. Wilson (a famous biologist): Great Scientists Don’t Need Math. The gist of the article is that Dr. Wilson never learned much math and did well in biology, so others can do so also:

Wilson’s Principle No. 1: It is far easier for scientists to acquire needed collaboration from mathematicians and statisticians than it is for mathematicians and statisticians to find scientists able to make use of their equations.

Wilson’s Principle No. 2: For every scientist, there exists a discipline for which his or her level of mathematical competence is enough to achieve excellence.

The first principle is probably true, but is more a sociological statement than one inherent to the disciplines: applied mathematicians and statisticians welcome collaborations with all sorts of scientists and are happy to learn about and work on real problems that come up elsewhere, while biologists (particularly old-school ones like Dr. Wilson) tend not to be interested in anything outside their own labs and those of their close collaborators and competitors.

The second principle is possibly also true, though much less so than in the past.  Biology used to be a major refuge for innumerate scientists, but modern biology requires a really strong foundation in statistics, far more than most biology students are trained in. The number of positions for innumerate scientists is rapidly shrinking, while the supply of innumerate biology PhDs is growing rapidly.  In the highly competitive job market for biology research, those who follow E. O. Wilson’s advice have a markedly smaller chance of getting the jobs they desire. Of course, Dr. Wilson seems to be unaware of the decades-long oversupply of biology researchers:

During my decades of teaching biology at Harvard, I watched sadly as bright undergraduates turned away from the possibility of a scientific career, fearing that, without strong math skills, they would fail. This mistaken assumption has deprived science of an immeasurable amount of sorely needed talent. It has created a hemorrhage of brain power we need to stanch.

An undergrad degree in biology (even from Harvard) has not gotten many students much more than low-level technician jobs for most of that time (admission to grad school is the better option, as biology PhDs have been able to get temporary postdoc positions at least).  Perhaps Dr. Wilson considers a dead-end job at little more than minimum wage a suitable scientific career—many others do not.

Dr. Wilson does make one unsubstantiated claim that I agree with:

The annals of theoretical biology are clogged with mathematical models that either can be safely ignored or, when tested, fail. Possibly no more than 10% have any lasting value. Only those linked solidly to knowledge of real living systems have much chance of being used.

Biology is a data-driven science, not a model-driven science (a distinction that physicists trying to jump into the field often miss).  Most of “mathematical biology” has been an attempt to apply physics-like models in places where they don’t really fit.  But there has been a big change in the past 10–15 years, as high-throughput experiments have become common in biology.  Now mathematics (mainly statistics) is needed to make any sense out of the experimental results, and biologists with inadequate training in statistics end up making ludicrously wrong conclusions from their experiments, often claiming high significance for random noise.  To understand the data requires more than Wilson’s “intuition”—it requires a solid understanding of the statistics of big data and multiple hypotheses, as humans are very good at perceiving patterns in random noise.

I was pointed to Dr. Wilson’s WSJ essay by Iddo Friedberg’s post Terrible advice from a great scientist, which has a somewhat different critique of the essay. He accuses Wilson of “not recognizing the generalization from an outlier cannot serve as a viable model, or even an argument to support his position.”  Iddo makes several other points, some of them the same as mine—go read his post! Of course, like me, Dr. Friedberg is a bioinformatician and so sees the central role of statistics in 21st century biology.  Perhaps the two of us are wrong, and innumerate biologists will again have glorious scientific careers, but I think the odds are against it.

2012 December 10

Not normal, but what is it?

Filed under: Uncategorized — gasstationwithoutpumps @ 10:53
Tags: , , ,

Robert W. Jernigan, in his “statpics” blog often posts pictures of normal (and other) distributions from the real world, and almost equally often posts pictures that someone else claimed to be normal distributions that clearly aren’t.  In the post statpics: Not Normal Either and Why, he shows a picture of cup lid stacks that someone had labeled a normal distribution, when it was not a distribution at all.  He then shows a similar-looking picture of stacks of scallop shells:

Copied from  Graphical Methods for Presenting Facts by Brinton (1914)

Copied from Graphical Methods for Presenting Facts by  Willard Cope Brinton (1914), as scanned by Google in Google Books.

He blithely asserts that this isn’t a normal distribution either, since the number of ribs is discrete, though it is a nice pictorial representation of a histogram. I wanted him to go further, though, than just saying “not normal”, to talking about what sort of distributions the scallop shells might be from. Binomial? Poisson? How do you choose a good discrete distribution when you don’t know the underlying mechanism?

As a bioinformatician, I often need to come up with distributions for modeling inherently discrete random variables, and I find the statistical literature unhelpful—there seem to be a few well-studied examples, but little or no guidance for applications outside those few classic examples.  Perhaps I’m just too ignorant in the field to be finding the right sources.

For example, I often approximate length distributions for protein or DNA sequences with lognormal distributions.

Example of a lognormal distribution fit to some DNA sequence lengths from a simulation.  Real sequence data is not quite this clean, but often also fits a lognormal distribution fairly well.

Example of a lognormal distribution fit to some DNA sequence lengths from a simulation—the fit is even better with a million samples, rather than 100000. Real sequence length data is not quite this clean, but often also fits a lognormal distribution fairly well also.
When I am looking at fitting distributions, I’m usually most interested in one of the tails (for computing p-value or E-value) rather than the peak of the distribution, so plotting probability on a log scale is very helpful.

Now, lognormal distributions are inherently continuous, and the sequence lengths are inherently discrete, but I don’t know any discrete distributions that are approximately lognormal, in the way the binomial distributions are approximately normal. A lot of sequence-length distributions appear to be approximately lognormal, but I have no idea why. Even knowing how I generated the sequences in the plot above does not give me a clue why they should fit a lognormal distribution so well.

We also often use Gumbel distributions for the null model for the results of searches, because they are a limiting distributions for “max” in the way that normal distributions are a limiting distribution for “sum”. But the Gumbel distributions are continuous, and we are often taking maxes of nonnegative-integer-valued scores—what is the appropriate discrete analog for Gumbel distributions?

If anyone reading this blog is a statistician and can provide me pointers to understandable literature on choosing discrete distributions, or can tell me what discrete distributions might be approximately lognormal, I’d appreciate knowing about them.

2012 November 10

A probability question

Filed under: Uncategorized — gasstationwithoutpumps @ 21:59
Tags: , ,

Sam Shah, one of the math teacher bloggers that I read, posted a bioinformatics-related question on A biology question that is actually a probability question « Continuous Everywhere but Differentiable Nowhere:

Let’s say you have a sequence of 3 billion nucleotides. What is the probability that there is a sequence of 20 nucleotides that repeats somewhere in the sequence? You may assume that there are 4 nucleotides (A, C, T, G) and when coming up with the 3 billion nucleotide sequence, they are all equally likely to appear.

This is the sort of combinatorics question that comes up a lot in building null models for bioinformatics, when we want to know just how weird something we’ve found really is.

Of course, we usually end up asking for the expected number of occurrences of a particular event, rather than the probability of the event, since expected values are additive even when the events aren’t independent.  So let me change the problem to

In a sequence of N bases (independent, uniformly distributed), what is the expected number of k-mers (k≪N). Plug in N=3E9 and k=20.

The probability that any particular k-mer occurs in a particular position is 4-k, so the expected number of occurrences of that k-mer is N/4k, or about 2.7E-3 for the values of N and k given. Oops, we should count both strands, so double that to 5.46E-3.

When the expected number is that small, we can use it equally well as the probability of there being one or more such k-mers. (Note: this assumes 4k ≫ N.)

Now let’s look at each k-mer that actually occurs (all 2N of them), and estimate how many other k-mers match. There are roughly 2N/4k for each (we can ignore little differences like N vs. N-1), so there are 4 N2/4k total pairs. But we’ve counted each pair twice, so the expected number of pairs is only 2 N2/4k, which is 16E6 for N=3E9 and k=20.

We have to take k up to about 32 before we get expected numbers below 1, and up to about 36 before having a repetition is surprising in a uniform random stream.

2012 October 15

Grading taking longer than I thought

Filed under: Uncategorized — gasstationwithoutpumps @ 00:34
Tags: , , ,

It is taking much longer than I thought to do the grading—it’s almost midnight and I’m still only half done with the grading for tomorrow.

The problem is with the design of the assignment—this is the first time I’ve given the assignment to do conversion from FASTA to FASTQ and from FASTQ to FASTA, and I had not carefully thought through all the problems that might arise.  I should have given it several more weeks of thought, but I was running behind on setting up the course, largely because of the amount of time I’d been putting into the design of the circuits course.  I ended up slapping the assignment together quickly, and just checking that it was doable, not polishing the assignment handout or building a good test suite.

It is often this way with the first run of an assignment, and it is frustrating both for me and for the students.  That’s one reason I try to introduce only one or two new assignments each year, as it often takes a few passes with feedback from different groups of students to knock the rough edges off a programming assignment.

Here are some of the problems with the assignment:

  1. I did not provide a specific enough output spec to uniquely determine the output file.  This means that I had to read the output with a somewhat forgiving program to compare it to the input, not just use diff.  Writing that comparison program took time, and it still does not detect some of the things that can go wrong (it is a bit too forgiving).
  2. I relied on the students reading the Wikipedia description of the FASTA and FASTQ formats, but the description there did not really cover all the common variants (which include extra returns within sequences and quality sequences), so some students wrote very fragile parsers. I’ll have to find or write clearer input specs for next year.
  3. I did not provide a sufficient set of test files to detect problems.  I picked up a few test files from the students themselves, but I only discovered the need for several of the tests on reading the student code and thinking of ways that it didn’t quite work.  I would think up ways to break a particular student’s fragile code, then have to apply that test to everyone else’s code to be fair.
  4. I’m still not satisfied with my test suite, because the conversion task does not really test whether the input parser is working correctly.  The input parser needs to remove white space from the sequences, to provide pure DNA or protein sequences internally, but spaces and returns in the output are legal, so blind copying without proper parsing is not easily detected in this task. Some of the best conversion output came from students who had not properly represented sequences internally, which means I need to do deep reading of the code to see if there are problems, and can’t rely on the I/O test to uncover them.

For next year, I think I want to keep the input-parser assignment, but I think I need to change the task that the parser is used for to one that detects more reliably whether the students have correctly implemented the parser and that provides a unique correct output—conversion between the formats is a useful program to have, but not a good test of the parser (although I had initially thought it would be).

One of the students anonymously commented on my previous post:

The choice for the 2nd assignment was pretty unfortunate, IMHO. It was very tedious/time consuming, spec was frustrating, with limited opportunities to learn/do something interesting. I suspect most people won’t roll their own standard format parsers, but rather use a library/package.
Looking forward to more interesting homework :)

I agree with the comment about the spec—the specs for FASTA and FASTQ are frustratingly vague and what is acceptable varies from program to program. Unfortunately this is true of almost all bioinformatics formats—even the formats that have official standards often have ambiguities. Although there is pedagogic value in having students realize that format descriptions are often frustratingly incomplete and inadequate, this may not be the best place in the course to have such frustration, as students are still adapting to Python and there is too much else going on in the assignment.

The point about libraries and packages gets at a more fundamental pedagogical point of the course. Libraries and packages don’t appear by magic—someone has to write them, and part of the point of this course is to prepare people to write such libraries. Yes, lots of people have written fasta and fastq parsers before, some of which are good and some of which are incredibly fragile or disgustingly slow.  Most people don’t discover that fragility until the package breaks in the middle of some urgent project.  Having people write their own parsers for “easy” formats gives them a greater appreciation for good packages and some idea what to test in a package before relying on it.

Since many of these grad students will be going on to create new tools that do new tasks, I also want them to think about file formats in their new tools before creating yet another vaguely specified and potentially ambiguous format.  There is a great tendency for students (whether from a computer science or biology background) to slap together a format that is good enough for the example they are working on, without thinking through what would happen if people started using it for other purposes.  Both FASTA and FASTQ have this slapped-together feel, and it causes pain for bioinformaticians on a regular basis.  A lot of data wrangling involves dealing with data that doesn’t quite fit in some format and the various incompatible kluges people have used to try to force the format to represent the data anyway.

As for the time-consuming aspect, that is also partly the result of this being the first time for this assignment. My assignments are usually somewhat time-consuming, but this one was a little bigger than I thought. For next year, I’ll probably do only one of the parsers (not sure yet whether the fasta+qual or fastq is the better one to require), but require some manipulation of the sequences, like end-trimming of sequences to remove low-quality tails.  That would cut the amount of code almost in half, while providing a better test that students had gotten the semantics right.

 

« Previous PageNext Page »

%d bloggers like this: