# Gas station without pumps

## 2012 December 10

### Not normal, but what is it?

Filed under: Uncategorized — gasstationwithoutpumps @ 10:53
Tags: , , ,

Robert W. Jernigan, in his “statpics” blog often posts pictures of normal (and other) distributions from the real world, and almost equally often posts pictures that someone else claimed to be normal distributions that clearly aren’t.  In the post statpics: Not Normal Either and Why, he shows a picture of cup lid stacks that someone had labeled a normal distribution, when it was not a distribution at all.  He then shows a similar-looking picture of stacks of scallop shells:

Copied from Graphical Methods for Presenting Facts by  Willard Cope Brinton (1914), as scanned by Google in Google Books.

He blithely asserts that this isn’t a normal distribution either, since the number of ribs is discrete, though it is a nice pictorial representation of a histogram. I wanted him to go further, though, than just saying “not normal”, to talking about what sort of distributions the scallop shells might be from. Binomial? Poisson? How do you choose a good discrete distribution when you don’t know the underlying mechanism?

As a bioinformatician, I often need to come up with distributions for modeling inherently discrete random variables, and I find the statistical literature unhelpful—there seem to be a few well-studied examples, but little or no guidance for applications outside those few classic examples.  Perhaps I’m just too ignorant in the field to be finding the right sources.

For example, I often approximate length distributions for protein or DNA sequences with lognormal distributions.

Example of a lognormal distribution fit to some DNA sequence lengths from a simulation—the fit is even better with a million samples, rather than 100000. Real sequence length data is not quite this clean, but often also fits a lognormal distribution fairly well also.
When I am looking at fitting distributions, I’m usually most interested in one of the tails (for computing p-value or E-value) rather than the peak of the distribution, so plotting probability on a log scale is very helpful.

Now, lognormal distributions are inherently continuous, and the sequence lengths are inherently discrete, but I don’t know any discrete distributions that are approximately lognormal, in the way the binomial distributions are approximately normal. A lot of sequence-length distributions appear to be approximately lognormal, but I have no idea why. Even knowing how I generated the sequences in the plot above does not give me a clue why they should fit a lognormal distribution so well.

We also often use Gumbel distributions for the null model for the results of searches, because they are a limiting distributions for “max” in the way that normal distributions are a limiting distribution for “sum”. But the Gumbel distributions are continuous, and we are often taking maxes of nonnegative-integer-valued scores—what is the appropriate discrete analog for Gumbel distributions?

If anyone reading this blog is a statistician and can provide me pointers to understandable literature on choosing discrete distributions, or can tell me what discrete distributions might be approximately lognormal, I’d appreciate knowing about them.