One blog I follow is the statpics blog, in which Robert Jernigan posts pictures related to statistics (like wear patterns on doors showing the distribution of where people touch it, or examples of people abusing the notion of a bell curve). Recently he posted the following Venn diagram from the NY Times as statpics: Venn Disease:
I was going to complain about the Venn diagram as being useless here, as it did not include the number who had none of the conditions, thus not allowing the viewer to determine the probability of each condition separately, which is essential to making any real sense of the figure (are the conditions correlated?).
I did not complain on his blog for two reasons:
- He requires commenters to sign in with a Google account, and I prefer to leave blog comments using my WordPress account, so that people can find my blog from the comments.
- I went back to the original source and found that the NY Times writer or artist had not been quite so cavalier with the data—there was another circle adjacent to the Venn diagram that included all those with none of the conditions. (I was actually quite surprised to see that Jernigan had omitted an important part of the figure, as he is usually quite sensitive about probability distributions, so truncating a figure to omit one category seems out of character for him.)
I looked a bit at the pairwise comparisons on the last page of the NYTimes article, and decided that this way of presenting the data violated many of the principles of good data presentation.
First, it takes a huge amount of space to present just 3 numbers (the pairwise comparison shows percentages for conditions A&B, A¬ B, B¬ A).
Second, it is not possible to look at two different comparisons at the same time.
Third, the NYTimes Venn diagrams have rather distracting pointless animation, which is not visible in the static image I copied from Jernigan’s blog.
Fourth, the Venn diagram often implies a correlation (look how often these conditions co-occur!), when the probabilities of the conditions appear to be essentially independent in many cases. For example, Alzheimer’s and high-blood pressure co-occur in 24% of the nursing home residents in the sample, but with probabilities of 46% for Alzheimer’s and 57% for high blood pressure, one would expect about 26% to have both if they were independent conditions.
The basic point of the original story is that people in assisted living facilities have very high probabilities of a debilitating medical condition (well, duh! that’s why they’re in assisted living, and not a lower-cost housing option) and that multiple conditions are common. One of their main points is that 9% of residents of assisted-living facilities have all three of dementia, heart disease, and high blood pressure, and that “treating these patients is extremely difficult because of complicated drug regimes and numerous side effects.”
Within the assisted living population the conditions seem to be nearly independent (though that is hard to tell from the Venn diagrams—they don’t give the sizes of all the parts in the 3-variable Venn diagram, and I did not click through all the pairs to check pairwise independence from the 2-variable Venn diagrams). But that near-independence may mean that multiple conditions are more common than a naive prediction based on independence in the overall population would suggest. To determine whether the conditions are correlated, one would have to look at the whole population at a given age, rather than just at the selected population in assisted living, since that selection probably under-represents those with no debilitating conditions. (I also wonder how “assisted-living facility” is defined, since I know that the definitions are quite different in California and Colorado, with a much looser definition in Colorado that would include many of the “independent-living” facilities in California.)
Doing a proper analysis of the data would require going back to the original study, which the byline-less NYTimes article only refers to vaguely as “the study, by the National Center for Health Statistics in 2010”. I’m not interested enough to search for that study and see whether there is enough information to see whether any of the co-occurences are really surprising.