Gas station without pumps

2013 December 11

Population pyramids

Filed under: Uncategorized — gasstationwithoutpumps @ 22:48
Tags: , , ,

I’ve always liked “population pyramids” as way of showing the age and gender distribution of a population.  I recently found a very nice set of population pyramids: Population Pyramid of WORLD in 2010 —  Here is a snapshot of the World population in 2010:

This is a static snapshot, but the web page it links to allows you to change which 5-year period (including projections into the future up to 2100) and the geographic region (most countries, plus several larger geographic areas, such as South America, or Southern Europe).  It also uses a mouseover so that you can read what % of population each point on the graph corresponds to, and provides a plot of total population as a function of time.

This is a static snapshot, but the web page it links to allows you to change which 5-year period (including projections into the future up to 2100) and the geographic region (most countries, plus several larger geographic areas, such as South America, or Southern Europe). It also uses a mouseover so that you can read what % of population each point on the graph corresponds to, and provides a plot of total population as a function of time.

Several countries in the developed world have “top-heavy” pyramids, with fewer children under 5 than adults in the baby-boom generation.  For example, Germany has 8.7% of its population in the 45–49 age range, but only 4.1% in the 0–4 age range. The German population is predicted to have already passed its peak population, with increased longevity not compensating for reduced birth rate (population is expected to drop to about 68% of the peak in 2005 by 2100).  Japan is in a similar situation with a 2100 population predicted to be about 66% of the 2010 peak.  Its largest cohort is older, 60–64 years old at 7.7% of the population and children 0–4 are only 4.3%.

The US has remarkable little variation from decade to decade—the “baby boom” and the “boom echo” are visible, but they are tiny ripples compared to the enormous variations in some other populations:  the peak of the baby boom is 7.3% in 45–49 year olds (you’d expect it to be 50–54 year olds, but the leading edge of the baby boom is starting to die out) and the smallest younger cohort is still 6.5% (for either 0–4 or 35–39).  The distribution of ages is predicted to get very flat, with population growth occurring mainly due to longer life spans rather than high birthrate.

Other parts of the world have the more classic “pyramid” shape that comes from high birth and death rates.  Western Africa, for example, has 17.2% of its population in the 0–4 age range, tapering down smoothly to about 2.2% in the 50–54 age range.

All the populations are expected to get more uniform with time, as death rates and birth rates drop, but I fear that this may be optimism on the part of the United Nations—major wars can carve big chunks out of the young adult population, and famines can send death rates soaring.

2013 October 10

xkcd: Null Hypothesis

Filed under: Uncategorized — gasstationwithoutpumps @ 22:10
Tags: , , , ,

I have a number of my favorite xkcd comics on the bulletin board outside my office, including this one:

(Click on the image to go to the website to get the mouseover—I don’t have it on my bulletin board, but it is worth the effort of clicking.)

Today I had a student ask me to explain the joke—more precisely, to explain what “the null hypothesis” was.  I did so, of course, explaining how the p-values that calculate how likely something is “by chance” need a  formal definition of “chance”—the null model or null hypothesis.  I even explained that all statistical tests do is to allow you to reject (or not reject) the null hypothesis—that they tell you nothing about the hypothesis you are actually testing.

Given the casual nature of the question, I did not go into detail about how important it is to choose or construct good null models—ones that contain all the explanations other than hypothesis you hope to test.  I normally spend a full lecture on that in my bioinformatics course, as well as one of the weekly homework assignments, having the students program different null models for an open reading frame that result in very different p-values for a protein-coding gene detector.

Normally, I enjoy this sort of conversation with students—I like students who are curious and who are unafraid to ask questions to clear up things that confuse them.  Today I was a little disturbed by the question, as the student had been in my office to get a signature on an approval form for a senior thesis in bioengineering.  How had the student gotten that far in our program without having learned what a null hypothesis is?  Where is the hole in our curriculum that allows that, and how can we fix it in the curriculum redesign this year?

I realize that no curriculum design can completely cure the cram-and-forget disease that infects many college students, but I did not get the impression that this was a student who had known it once but forgotten.  Rather, I had the impression that the concept was a new one, though the name might have appeared before.

On looking over the bioengineering curriculum I see that students can take a probability course without any statistics course—perhaps that is what happened here.  Unfortunately, biomolecular experimentalists have to be very familiar with null hypotheses and statistical tests, so I think we have patch the curriculum to make sure that all the students get statistics.

2013 June 5

Poisson Petals

Filed under: Uncategorized — gasstationwithoutpumps @ 22:37
Tags: , ,

Robert Jernigan, on his statpics blogs, has a couple of nice posts based on an image he posted of cherry blossoms on paving blocks:

statpics: Poisson Petals gives a cursory analysis of the mean number of petals on the large blocks, divided by the mean number of the petals on the smaller blocks.

statpics: Testing Poissonness Petals provides a much more detailed analysis of the same data (the number of petals on each block), showing first that the data is well described as a Poisson process, then that the ratio of means (1.56) is not significantly different from the ratio of the areas of the blocks (1.5) with a p-value of 0.39.

I had not see the Poissonness plot [David C. Hoaglin. A Poissonness Plot. The American Statistician. 34(3):146–149, 1980] before, but it looks like a handy technique.  Because the expected frequency of k petals on a block is N e^{-\lambda} \lambda^k k!, if we plot \ln(x_k) +\ln(k!) vs. k, we should get a straight line from a Poisson process, with the slope of the line being \ln(\lambda).  Jernigan’s data makes a very nice straight line.

His significance test does an exact computation of the probability distribution for the ratio of the means, but I didn’t quite follow how he set up the computation.  The connection I missed was between

The larger stones have a length of 9.375. The shorter, square stones have a length of 6.25, for a ratio of 1.5.


We take these as the null parameters of two independent Poisson distributions.

I’m not sure what he is taking as “the null parameters”, as the length of the paving blocks is not the λ of the Poisson process, though they are linearly related.  Is he scaling by the overall petals per area measure?

Assuming that he does some scaling like that to get two parameters \lambda_{\mbox{\scriptsize large}} = 1.5 \lambda_{\mbox{\scriptsize small}}, then the rest of his computation makes sense to me.  He gets a Poisson distribution for the total number of petals on large blocks \lambda = 21 \lambda_{\mbox{\scriptsize large}} and for the total numbers of petals on small blocks \lambda= 12 \lambda_{\mbox{\scriptsize small}}.  A simple nested loop going out to sufficiently high values would enumerate the probabilities for all pairs of petal counts, and from that the discrete probability distribution for the ratio of the means can be computed.

I wonder if a Bayesian approach with a Gamma conjugate prior would give a different posterior distribution for the ratio of the means.

2013 April 29

Scientists need math

Filed under: Uncategorized — gasstationwithoutpumps @ 14:28
Tags: , , , ,

At the beginning of April (but not on April Fool’s Day), the Wall Street Journal published an essay by E.O. Wilson (a famous biologist): Great Scientists Don’t Need Math. The gist of the article is that Dr. Wilson never learned much math and did well in biology, so others can do so also:

Wilson’s Principle No. 1: It is far easier for scientists to acquire needed collaboration from mathematicians and statisticians than it is for mathematicians and statisticians to find scientists able to make use of their equations.

Wilson’s Principle No. 2: For every scientist, there exists a discipline for which his or her level of mathematical competence is enough to achieve excellence.

The first principle is probably true, but is more a sociological statement than one inherent to the disciplines: applied mathematicians and statisticians welcome collaborations with all sorts of scientists and are happy to learn about and work on real problems that come up elsewhere, while biologists (particularly old-school ones like Dr. Wilson) tend not to be interested in anything outside their own labs and those of their close collaborators and competitors.

The second principle is possibly also true, though much less so than in the past.  Biology used to be a major refuge for innumerate scientists, but modern biology requires a really strong foundation in statistics, far more than most biology students are trained in. The number of positions for innumerate scientists is rapidly shrinking, while the supply of innumerate biology PhDs is growing rapidly.  In the highly competitive job market for biology research, those who follow E. O. Wilson’s advice have a markedly smaller chance of getting the jobs they desire. Of course, Dr. Wilson seems to be unaware of the decades-long oversupply of biology researchers:

During my decades of teaching biology at Harvard, I watched sadly as bright undergraduates turned away from the possibility of a scientific career, fearing that, without strong math skills, they would fail. This mistaken assumption has deprived science of an immeasurable amount of sorely needed talent. It has created a hemorrhage of brain power we need to stanch.

An undergrad degree in biology (even from Harvard) has not gotten many students much more than low-level technician jobs for most of that time (admission to grad school is the better option, as biology PhDs have been able to get temporary postdoc positions at least).  Perhaps Dr. Wilson considers a dead-end job at little more than minimum wage a suitable scientific career—many others do not.

Dr. Wilson does make one unsubstantiated claim that I agree with:

The annals of theoretical biology are clogged with mathematical models that either can be safely ignored or, when tested, fail. Possibly no more than 10% have any lasting value. Only those linked solidly to knowledge of real living systems have much chance of being used.

Biology is a data-driven science, not a model-driven science (a distinction that physicists trying to jump into the field often miss).  Most of “mathematical biology” has been an attempt to apply physics-like models in places where they don’t really fit.  But there has been a big change in the past 10–15 years, as high-throughput experiments have become common in biology.  Now mathematics (mainly statistics) is needed to make any sense out of the experimental results, and biologists with inadequate training in statistics end up making ludicrously wrong conclusions from their experiments, often claiming high significance for random noise.  To understand the data requires more than Wilson’s “intuition”—it requires a solid understanding of the statistics of big data and multiple hypotheses, as humans are very good at perceiving patterns in random noise.

I was pointed to Dr. Wilson’s WSJ essay by Iddo Friedberg’s post Terrible advice from a great scientist, which has a somewhat different critique of the essay. He accuses Wilson of “not recognizing the generalization from an outlier cannot serve as a viable model, or even an argument to support his position.”  Iddo makes several other points, some of them the same as mine—go read his post! Of course, like me, Dr. Friedberg is a bioinformatician and so sees the central role of statistics in 21st century biology.  Perhaps the two of us are wrong, and innumerate biologists will again have glorious scientific careers, but I think the odds are against it.

2013 April 23

statpics: Venn Disease

Filed under: Uncategorized — gasstationwithoutpumps @ 08:18
Tags: , , ,

One blog I follow is the statpics blog, in which Robert Jernigan posts pictures related to statistics (like wear patterns on doors showing the distribution of where people touch it, or examples of people abusing the notion of a bell curve).  Recently he posted the following Venn diagram from the NY Times as statpics: Venn Disease:

I was going to complain about the Venn diagram as being useless here, as it did not include the number who had none of the conditions, thus not allowing the viewer to determine the probability of each condition separately, which is essential to making any real sense of the figure (are the conditions correlated?).

I did not complain on his blog for two reasons:

  • He requires commenters  to sign in with a Google account, and I prefer to leave blog comments using my WordPress account, so that people can find my blog from the comments.
  • I went back to the original source and found that the NY Times writer or artist had not been quite so cavalier with the data—there was another circle adjacent to the Venn diagram that included all those with none of the conditions.  (I was actually quite surprised to see that Jernigan had omitted an important part of the figure, as he is usually quite sensitive about probability distributions, so truncating a figure to omit one category seems out of character for him.)

I looked a bit at the pairwise comparisons on the last page of the NYTimes article, and decided that this way of presenting the data violated many of the principles of good data presentation.

First, it takes a huge amount of space to present just 3 numbers (the pairwise comparison shows percentages for conditions A&B, A&not B, B&not A).

Second, it is not possible to look at two different comparisons at the same time.

Third, the NYTimes Venn diagrams have rather distracting pointless animation, which is not visible in the static image I copied from Jernigan’s blog.

Fourth, the Venn diagram often implies a correlation (look how often these conditions co-occur!), when the probabilities of the conditions appear to be essentially independent in many cases.  For example, Alzheimer’s and high-blood pressure co-occur in 24% of the nursing home residents in the sample, but with probabilities of 46% for Alzheimer’s and 57% for high blood pressure, one would expect about 26% to have both if they were independent conditions.

The basic point of the original story is that people in assisted living facilities have very high probabilities of a debilitating medical condition (well, duh! that’s why they’re in assisted living, and not a lower-cost housing option) and that multiple conditions are common. One of their main points is that 9% of residents of assisted-living facilities have all three of dementia, heart disease, and high blood pressure, and that “treating these patients is extremely difficult because of complicated drug regimes and numerous side effects.”

Within the assisted living population the conditions seem to be nearly independent (though that is hard to tell from the Venn diagrams—they don’t give the sizes of all the parts in the 3-variable Venn diagram, and I did not click through all the pairs to check pairwise independence from the 2-variable Venn diagrams). But that near-independence may mean that multiple conditions are more common than a naive prediction based on independence in the overall population would suggest. To determine whether the conditions are correlated, one would have to look at the whole population at a given age, rather than just at the selected population in assisted living, since that selection probably under-represents those with no debilitating conditions. (I also wonder how “assisted-living facility” is defined, since I know that the definitions are quite different in California and Colorado, with a much looser definition in Colorado that would include many of the “independent-living” facilities in California.)

Doing a proper analysis of the data would require going back to the original study, which the byline-less NYTimes article only refers to vaguely as “the study, by the National Center for Health Statistics in 2010”.   I’m not interested enough to search for that study and see whether there is enough information to see whether any of the co-occurences are really surprising.

« Previous PageNext Page »

%d bloggers like this: