There was a recent report about how many students were taking AP CS exams, breaking out the information by gender, race, and state, which has been released in a few different forms. Mark Guzdial’s blog post provides pointers to the data collected by Barbara Ericson. Some of the comments provided on that post shows an appalling lack of statistical reasoning (like comparing states by subtracting percentages of different things).
So what are the interesting questions to ask of the data and how should they be handled statistically?
Most of the “geewhiz” statements are about how few people in some group or other took (or passed) the AP CS exam:
 No females took the exam in Mississippi, Montana, and Wyoming.
 11 states had no Black students take the exam: Alaska, Idaho, Kansas, Maine, Mississippi, Montana, Nebraska, New Mexico, North Dakota, Utah, and Wyoming.
Some people pointed out that some of these numbers may not be more than a small sample effect (no one took the exam in Wyoming, so having zero female test takers is not surprising). How can we best state that a number is interesting?
Generally , this is done by creating a null model—one that computes the probability of different outcomes based on everything except the hypothesis being tested. Then you look at how surprising the observed outcome is given the null model. Exactly how the null model is constructed is crucial, as all that the statistical tests tell you is how badly your null model fits the data.
What sort of mathematical model should we be using for assigning probabilities to numbers of test takers (or numbers passing the test)? One convenient one is a binomial distribution. The binomial distributions are a family of distributions over nonnegative integers with two parameters N and p. They are good for modeling the count of a number of independent events each of which occurs with some fixed probability. If we think of each high school student in a state as having some (small) probability of taking the exam, then the number of exam takers can be modelled as a binomial distribution whose N value is the number of students and p the probability that each one takes the exam. When N is large (as it would be for the number of high school students in a state) and Np is reasonably large, then the binomial distribution can be approximated by a normal distribution with mean Np and variance Np(1p), but an even better approximation is to use the Poisson distribution with mean Np, which is what I’ll use here. The probability of zero test takers: .
So all we need to set the parameters of our null model is an expected number of test takers based on everything except what we wanted to test. For example, if we wanted to test whether black test takers were underrepresented in Maine, we would need a model that predicted how many black students would take the test, perhaps using the probability that students in Maine would take the test independent of race and the fraction of students in Maine that are black. For Maine, there were 161 test takers, and 0 black test takers. I don’t know the racial mix of high school students in Maine, but Wikipedia gives the black fraction of the whole state population as 1.03%. Thus the expected number of black test takers is 1.658, and we can use as the probability of seeing zero black test takers by chance.
UPDATE: 2014 Feb 1. Some values in the following table corrected, due to clerical errors in copying from spreadsheet (I’m not sure which I hate worse, spreadsheets or HTML tables—they’re both awful formats).
state  # test takers  state % black  expected black test takers  underrep p< 

Alaska  21  4.27%  0.897  0.41 
Idaho  0.95%  

Kansas  6.15%  


Maine  161  1.03%  1.658  0.19 
Mississippi  37.3%  


Montana  0.67%  


Nebraska  4.50%  

New Mexico  2.97%  
North Dakota  1.08%  
Utah  1.27%  
Wyoming  1.29% 
Even before we do a correction for having 51 hypotheses (50 states plus District of Columbia), none of these “no black students” states shows significant underrepresentation of black students. In fact, it would have been significantly surprising if the test taker in North Dakota had been black. None of the states had so few students that a black test taker would have been surprising (except Wyoming).
One can do similar computations to show that the lack of women in Mississippi, Montana, and Wyoming is not surprising. Montana looks surprising if treated as a single hypothesis (p<0.004), but not after multiplehypothesis correction (Evalue=0.21). Even combining all three states (which increases the number of hypotheses enormously and would call for a stronger multiplehypothesis correction), the underrepresentation of women in those states is not statistically significant.
There are states that do have significant underrepresentation of women: for example, Utah had 103 test takers, only 4 of whom were women. With an expected number of about 51.5, this is p<1.4E16. Even with 51× multiple hypothesis correction, this underrepresentation is hugely significant. Looking nationwide, total counts were 5485 female test takers out of 29555 total test takers. That’s p< 1.4E1677. The highest percentage of female test takers was in Tennessee, with 73 out of 251, which is p< 2.6E7, again highly significant.
Tennessee also had a high proportion of black test takers with 25 out of 251. With an expected number of 42.12, this is p<0.003 (still significantly underrepresented). To see if black students were underrepresented nationwide, one would have to add up the expected numbers for each state and see how the actual number compared with the expected number. (I’m certain that the underrepresentation is hugely significant since even the states with high numbers of black test takers are underrepresented, but I’m too lazy to do the multiplication and addition needed.)
The case can clearly be made for female and black students being underrepresented, though pointing to the states with 0 female or 0 black test takers is not the way to do it. (From a marketing standpoint, rather than a statistical one , shouting “no black test takers in these states”, “no female test takers in these other states” may be exactly the right way to get attention, even though the real story about blacks and females is in the states where there were enough test takers to say something about them after dividing them into subgroups.)
A case could also be made for some states having far fewer CS AP test takers than others. One would need to come up with an expected number of test takers from some model (for example, by state population as a share of national population, or by number of total AP test takers in state as share of national total AP test takers). The second model would correct for statetostate differences in age distribution or in popularity of AP exam taking in general. One could also base predictions on some other STEM test, such as AP Calculus, if one wanted to control for different amounts of STEM instruction in different states.
Let’s look at the states with no black test takers again, to see if they are significantly underrepresented in CS. There were 29555 AP CS tests taken nationwide and 3,824,691 AP tests nationwide total, so we would expect the CS tests taken in a state to be 0.77% of the total for the state.
state  # CS test takers  # all test takers  expected CS test takers  p <  Evalue 

Alaska  21  4570  35.31  0.0066  0.34 
Idaho  9723  75.13  
Kansas  15339  118.53  
Maine  161  14051  108.58  0.9999  
Mississippi  9032  69.79  
Montana  4868  37.62  
Nebraska  11117  85.91  
New Mexico  13365  103.28  
North Dakota  2295  17.73  
Utah  35721  276.03  
Wyoming  2050  15.84 
Of these eleven states, eight appear to be underrepresented in CS test takers (Maine is significantly overrepresented in CS test takers). When I do the multiplehypothesis correction for having 51 different “states” (including the District of Columbia), the mild underrepresentation in Alaska and North Dakota is no longer significant, but the other nine eight are.
So the zero black AP CS test takers for the nine states can be fairly confidently attributed to the lack of AP CS test takers, and in Maine to the shortage of black students. For Alaska, the lack of black AP CS test takers is probably due to the shortage of AP CS test takers in the state.
One can generalize the techniques here to any method of predicting the mean number of students in some category, to see whether the observed number is significantly smaller than the predicted number. When the predicted number is small, even 0 students may not be statistically significant underrepresentation.