Gas station without pumps

2012 December 10

Not normal, but what is it?

Filed under: Uncategorized — gasstationwithoutpumps @ 10:53
Tags: , , ,

Robert W. Jernigan, in his “statpics” blog often posts pictures of normal (and other) distributions from the real world, and almost equally often posts pictures that someone else claimed to be normal distributions that clearly aren’t.  In the post statpics: Not Normal Either and Why, he shows a picture of cup lid stacks that someone had labeled a normal distribution, when it was not a distribution at all.  He then shows a similar-looking picture of stacks of scallop shells:

Copied from  Graphical Methods for Presenting Facts by Brinton (1914)

Copied from Graphical Methods for Presenting Facts by  Willard Cope Brinton (1914), as scanned by Google in Google Books.

He blithely asserts that this isn’t a normal distribution either, since the number of ribs is discrete, though it is a nice pictorial representation of a histogram. I wanted him to go further, though, than just saying “not normal”, to talking about what sort of distributions the scallop shells might be from. Binomial? Poisson? How do you choose a good discrete distribution when you don’t know the underlying mechanism?

As a bioinformatician, I often need to come up with distributions for modeling inherently discrete random variables, and I find the statistical literature unhelpful—there seem to be a few well-studied examples, but little or no guidance for applications outside those few classic examples.  Perhaps I’m just too ignorant in the field to be finding the right sources.

For example, I often approximate length distributions for protein or DNA sequences with lognormal distributions.

Example of a lognormal distribution fit to some DNA sequence lengths from a simulation.  Real sequence data is not quite this clean, but often also fits a lognormal distribution fairly well.

Example of a lognormal distribution fit to some DNA sequence lengths from a simulation—the fit is even better with a million samples, rather than 100000. Real sequence length data is not quite this clean, but often also fits a lognormal distribution fairly well also.
When I am looking at fitting distributions, I’m usually most interested in one of the tails (for computing p-value or E-value) rather than the peak of the distribution, so plotting probability on a log scale is very helpful.

Now, lognormal distributions are inherently continuous, and the sequence lengths are inherently discrete, but I don’t know any discrete distributions that are approximately lognormal, in the way the binomial distributions are approximately normal. A lot of sequence-length distributions appear to be approximately lognormal, but I have no idea why. Even knowing how I generated the sequences in the plot above does not give me a clue why they should fit a lognormal distribution so well.

We also often use Gumbel distributions for the null model for the results of searches, because they are a limiting distributions for “max” in the way that normal distributions are a limiting distribution for “sum”. But the Gumbel distributions are continuous, and we are often taking maxes of nonnegative-integer-valued scores—what is the appropriate discrete analog for Gumbel distributions?

If anyone reading this blog is a statistician and can provide me pointers to understandable literature on choosing discrete distributions, or can tell me what discrete distributions might be approximately lognormal, I’d appreciate knowing about them.


Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a free website or blog at

%d bloggers like this: