# Gas station without pumps

## 2014 January 17

### CS commenters need to learn statistics

There was a recent report about how many students were taking AP CS exams, breaking out the information by gender, race, and state, which has been released in a few different forms.  Mark Guzdial’s blog post provides pointers to the data collected by Barbara Ericson.  Some of the comments provided on that post shows an appalling lack of statistical reasoning (like comparing states by subtracting percentages of different things).

So what are the interesting questions to ask of the data and how should they be handled statistically?

Most of the “gee-whiz” statements are about how few people in some group or other took (or passed) the AP CS exam:

• No females took the exam in Mississippi, Montana, and Wyoming.
• 11 states had no Black students take the exam: Alaska, Idaho, Kansas, Maine, Mississippi, Montana, Nebraska, New Mexico, North Dakota, Utah, and Wyoming.

Some people pointed out that some of these numbers may not be more than a small sample effect (no one took the exam in Wyoming, so having zero female test takers is not surprising).  How can we best state that a number is interesting?

Generally , this is done by creating a null model—one that computes the probability of different outcomes based on everything except the hypothesis being tested.  Then you look at how surprising the observed outcome is given the null model.   Exactly how the null model is constructed is crucial, as all that the statistical tests tell you is how badly your null model fits the data.

What sort of mathematical model should we be using for assigning probabilities to numbers of test takers (or numbers passing the test)?  One convenient one is a binomial distribution.  The binomial distributions are  a family of distributions over non-negative integers with two parameters N and p.  They are good for modeling the count of a number of independent events each of which occurs with some fixed probability.  If we think of each high school student in a state as having some (small) probability of taking the exam, then the number of exam takers can be modelled as a binomial distribution whose N value is the number of students and p the probability that each one takes the exam.  When N is large (as it would be for the number of high school students in a state) and Np is reasonably large, then the binomial distribution can be approximated by a normal distribution with mean Np and variance Np(1-p), but an even better approximation is to use the Poisson distribution with mean Np, which is what I’ll use here. The probability of zero test takers: $P(0)= \binom{n}{0} p^0 (1-p)^{n-0} = (1-p)^n \approx e^{-np}$.

So all we need to set the parameters of our null model is an expected number of test takers based on everything except what we wanted to test.  For example, if we wanted to test whether black test takers were under-represented in Maine, we would need a model that predicted how many black students would take the test, perhaps using the probability that students in Maine would take the test independent of race and the fraction of students in Maine that are black.  For Maine, there were 161 test takers, and 0 black test takers.  I don’t know the racial mix of high school students in Maine, but Wikipedia gives the black fraction of the whole state population as 1.03%.  Thus the expected number of black test takers is 1.658, and we can use $e^{-1.658}$ as the probability of seeing zero black test takers by chance.

UPDATE: 2014 Feb 1.  Some values in the following table corrected, due to clerical errors in copying from spreadsheet (I’m not sure which I hate worse, spreadsheets or HTML tables—they’re both awful formats).

state # test takers state % black expected black test takers under-rep p<
Idaho 6 47  0.95%  0.086 0.447  0.92 0.64
Kansas 12 47  6.15%  0.738 2.891  0.48 0.056
Maine  161  1.03%  1.658  0.19
Mississippi 2 1  37.3%  0.746 0.373  0.47 0.69
Montana 0 11  0.67%  0 0.074  1 0.93
Nebraska 12 46  4.50%  0.540 2.070  0.58 0.126
New Mexico 7 57  2.97%  0.208 1.693  0.81 0.184
North Dakota 1 9  1.08%  0.011 0.097  0.99 0.91
Utah 11 103  1.27%  0.140 1.308  0.87 0.27
Wyoming 2 0  1.29%  0.026 0  0.97 1

Even before we do a correction for having 51 hypotheses (50 states plus District of Columbia), none of these “no black students” states shows significant under-representation of black students. In fact, it would have been significantly surprising if the test taker in North Dakota had been black. None of the states had so few students that a black test taker would have been surprising (except Wyoming).

One can do similar computations to show that the lack of women in Mississippi, Montana, and Wyoming is not surprising.  Montana looks surprising if treated as a single hypothesis (p<0.004), but not after multiple-hypothesis correction (E-value=0.21). Even combining all three states (which increases the number of hypotheses enormously and would call for a stronger multiple-hypothesis correction), the under-representation of women in those states is not statistically significant.

There are states that do have significant under-representation of women: for example, Utah had 103 test takers, only 4 of whom were women. With an expected number of about 51.5, this is p<1.4E-16. Even with 51× multiple hypothesis correction, this under-representation is hugely significant.  Looking nationwide, total counts were 5485 female test takers out of 29555 total test takers.  That’s p< 1.4E-1677. The highest percentage of female test takers was in Tennessee, with 73 out of 251, which is  p< 2.6E-7, again highly significant.

Tennessee also had a high proportion of black test takers with 25 out of 251.  With an expected number of 42.12, this is p<0.003 (still significantly under-represented).  To see if black students were under-represented nationwide, one would have to add up the expected numbers for each state and see how the actual number compared with the expected number.  (I’m certain that the under-representation is hugely significant since even the states with high numbers of black test takers are under-represented,  but I’m too lazy to do the multiplication and addition needed.)

The case can clearly be made for female and black students being under-represented, though pointing to the states with 0 female or 0 black test takers is not the way to do it. (From a marketing standpoint, rather than a statistical one , shouting “no black test takers in these states”, “no female test takers in these other states” may be exactly the right way to get attention, even though the real story about blacks and females is in the states where there were enough test takers to say something about them after dividing them into subgroups.)

A case could also be made for some states having far fewer CS AP test takers than others.  One would need to come up with an expected number of test takers from some model (for example, by state population as a share of national population, or by number of total AP test takers in state as share of national total AP test takers).  The second model would correct for state-to-state differences in age distribution or in popularity of AP exam taking in general.  One could also base predictions on some other STEM test, such as AP Calculus, if one wanted to control for different amounts of STEM instruction in different states.

Let’s look at the states with no black test takers again, to see if they are significantly under-represented in CS.  There were 29555 AP CS tests taken nationwide and 3,824,691 AP tests nationwide total, so we would expect the CS tests taken in a state to be 0.77% of the total for the state.

state #  CS test takers # all test takers expected CS test takers p < E-value
Alaska 21 4570 35.31 0.0066 0.34
Idaho 6 47 9723 75.13 6.3E-25 3.3E-4 3E-23 1.7E-4
Kansas 12 47 15339 118.53 5.95E-36 6.25E-14 3E-34 3.2E-12
Maine 161 14051 108.58 0.9999
Mississippi 2 1 9032 69.79 1.23E-27 3.5E-29 6E-26 1.8E-27
Montana 0 11 4868 37.62 4.59E-17 3.4E-7 2E-15 1.7E-5
Nebraska 12 46 11117 85.91 1.9e-23 1.7E-6 1E-21 8.8E-7
New Mexico 7 57 13365 103.28 3.7E-35 4.7E-7 2E-33 2.4E-5
North Dakota 1 9 2295 17.73 3.7E-7 0.018 2E-5 0.91
Utah 11 103 35721 276.03 2.4E-101 5.6E-23 1E-99 2.8E-21
Wyoming 2 0 2050 15.84 1.9E-5 1.3E-7 0.00096 6.7E-6

Of these eleven states, eight appear to be under-represented in CS test takers (Maine is significantly over-represented in CS test takers).  When I do the multiple-hypothesis correction for having 51 different “states” (including the District of Columbia), the mild under-representation in Alaska and North Dakota is no longer significant, but the other nine eight are.

So the zero black AP CS test takers for the nine states can be fairly confidently attributed to the lack of AP CS test takers, and in Maine to the shortage of black students.  For Alaska, the lack of black AP CS test takers is probably due to the shortage of AP CS test takers in the state.

One can generalize the techniques here to any method of predicting the mean number of students in some category, to see whether the observed number is significantly smaller than the predicted number.  When the predicted number is small, even 0 students may not be statistically significant under-representation.