# Gas station without pumps

## 2016 May 16

### DASL Updated

Filed under: Uncategorized — gasstationwithoutpumps @ 08:25
Tags: ,

Tim Erickson, a statistics teacher, announced in his blog, A Best-Case Scenario DASL Updated. Mostly improved.

The Data and Story Library, originally hosted at Carnegie-Mellon, was a great resource for data for many years. But it was unsupported, and was getting a bit long in the tooth. The good people at Data Desk have refurbished it and made it available again.

Here is the link. If you teach stats, make a bookmark:  http://dasl.datadesk.com/

It looks like there are a number of good small data sets there, suitable for toy problems in statistics classes.

## 2016 February 8

### New statistics game

Filed under: Uncategorized — gasstationwithoutpumps @ 19:56
Tags: , ,

http://guessthecorrelation.com/ is a game (with very retro graphics) to guess correlations from scatter plots. It is surprisingly difficult to do well, especially since Pearson’s r is so heavily dominated by the outliers, while our visual perception is more attuned to the group.

I’d like to thank Robert Jernigan for his post http://statpics.blogspot.com/2016/02/correlation-guessing.html that pointed me to the game.  (My current high score is 163.)

## 2015 November 16

### How scientists fool themselves

Filed under: Uncategorized — gasstationwithoutpumps @ 21:01
Tags: , ,

Nature News & Comment has just published a good comment:
How scientists fool themselves—and how they can stop

The comment goes through a number of the standard ways people fool themselves, but skirts around the most important one in modern biology: failure to correct for testing multiple hypotheses. They mention “p-hacking” as a problem, but their prescription is just “don’t do it” rather than explaining how one corrects for testing many hypotheses.

I think that the comment could have been much stronger if they had gotten some statisticians to provide the real corrective measures needed, rather than just moralizing about how people fool themselves.

## 2014 April 11

### Arthur Benjamin: Teach statistics before calculus!

I rarely have the patience to sit through a video of a TED talk—like advertisements, I rarely find them worth the time they consume. I can read a transcript of the talk in 1/4 the time, and not be distracted by the facial tics and awkward gestures of the speaker. I was pointed to one TED talk (with about 1.3 million views since Feb 2009) recently that has a message I agree with: Arthur Benjamin: Teach statistics before calculus!

The message is a simple one, though it takes him 3 minutes to make:calculus is the wrong summit for k–12 math to be aiming at.

Calculus is a great subject for scientists, engineers, and economists—one of the most fundamental branches of mathematics—but most people never use it. It would be far more valuable to have universal literacy in probability and statistics, and leave calculus to the 20% of the population who might actually use it someday.  I agree with Arthur Benjamin completely—and this is spoken as someone who was a math major and who learned calculus about 30 years before learning statistics.

Of course, to do probability and statistics well at an advanced level, one does need integral calculus, even measure theory, but the basics of probability and statistics can be taught with counting and summing in discrete spaces, and that is the level at which statistics should be taught in high schools.  (Arthur Benjamin alludes to this continuous vs. discrete math distinction in his talk, but he misleadingly implies that probability and statistics is a branch of discrete math, rather than that it can be learned in either discrete or continuous contexts.)

If I could overhaul math education at the high school level, I would make it go something like

1. algebra
2. logic, proofs, and combinatorics (as in applied discrete math)
3. statistics
4. geometry, trigonometry, and complex numbers
5. calculus

The STEM students would get all 5 subjects, at least by the freshman year of college, and the non-STEM students would top with statistics or trigonometry, depending on their level of interest in math.  I could even see an argument for putting statistics before logic and proof, though I think it is easier to reason about uncertainty after you have a firm foundation in reasoning without uncertainty.

I made a comment along these lines in response to the blog post by Jason Dyer that pointed me to the TED talk. In response, Robert Hansen suggested a different, more conventional order:

1. algebra
2. combinatorics and statistics
3. logic, proofs and geometry
5. calculus

It is common to put combinatorics and statistics together, but that results in confusion on students’ part, because too many of the probability examples are then uniform distribution counting problems. It is useful to have some combinatorics before statistics (so that counting problems are possible examples), but mixing the two makes it less likely that non-uniform probability (which is what the real world mainly has) will be properly developed. We don’t need more people thinking that if there are only two possibilities that they must be equally likely!

I’ve also always felt that putting proofs together with geometry does damage to both. Analytic geometry is much more useful nowadays than Euclidean-style proofs, so I’d rather put geometry with trigonometry and complex numbers, and leave proof techniques and logic to an algebraic domain.

## 2014 January 17

### CS commenters need to learn statistics

There was a recent report about how many students were taking AP CS exams, breaking out the information by gender, race, and state, which has been released in a few different forms.  Mark Guzdial’s blog post provides pointers to the data collected by Barbara Ericson.  Some of the comments provided on that post shows an appalling lack of statistical reasoning (like comparing states by subtracting percentages of different things).

So what are the interesting questions to ask of the data and how should they be handled statistically?

Most of the “gee-whiz” statements are about how few people in some group or other took (or passed) the AP CS exam:

• No females took the exam in Mississippi, Montana, and Wyoming.
• 11 states had no Black students take the exam: Alaska, Idaho, Kansas, Maine, Mississippi, Montana, Nebraska, New Mexico, North Dakota, Utah, and Wyoming.

Some people pointed out that some of these numbers may not be more than a small sample effect (no one took the exam in Wyoming, so having zero female test takers is not surprising).  How can we best state that a number is interesting?

Generally , this is done by creating a null model—one that computes the probability of different outcomes based on everything except the hypothesis being tested.  Then you look at how surprising the observed outcome is given the null model.   Exactly how the null model is constructed is crucial, as all that the statistical tests tell you is how badly your null model fits the data.

What sort of mathematical model should we be using for assigning probabilities to numbers of test takers (or numbers passing the test)?  One convenient one is a binomial distribution.  The binomial distributions are  a family of distributions over non-negative integers with two parameters N and p.  They are good for modeling the count of a number of independent events each of which occurs with some fixed probability.  If we think of each high school student in a state as having some (small) probability of taking the exam, then the number of exam takers can be modelled as a binomial distribution whose N value is the number of students and p the probability that each one takes the exam.  When N is large (as it would be for the number of high school students in a state) and Np is reasonably large, then the binomial distribution can be approximated by a normal distribution with mean Np and variance Np(1-p), but an even better approximation is to use the Poisson distribution with mean Np, which is what I’ll use here. The probability of zero test takers: $P(0)= \binom{n}{0} p^0 (1-p)^{n-0} = (1-p)^n \approx e^{-np}$.

So all we need to set the parameters of our null model is an expected number of test takers based on everything except what we wanted to test.  For example, if we wanted to test whether black test takers were under-represented in Maine, we would need a model that predicted how many black students would take the test, perhaps using the probability that students in Maine would take the test independent of race and the fraction of students in Maine that are black.  For Maine, there were 161 test takers, and 0 black test takers.  I don’t know the racial mix of high school students in Maine, but Wikipedia gives the black fraction of the whole state population as 1.03%.  Thus the expected number of black test takers is 1.658, and we can use $e^{-1.658}$ as the probability of seeing zero black test takers by chance.

UPDATE: 2014 Feb 1.  Some values in the following table corrected, due to clerical errors in copying from spreadsheet (I’m not sure which I hate worse, spreadsheets or HTML tables—they’re both awful formats).

state # test takers state % black expected black test takers under-rep p<
Idaho 6 47  0.95%  0.086 0.447  0.92 0.64
Kansas 12 47  6.15%  0.738 2.891  0.48 0.056
Maine  161  1.03%  1.658  0.19
Mississippi 2 1  37.3%  0.746 0.373  0.47 0.69
Montana 0 11  0.67%  0 0.074  1 0.93
Nebraska 12 46  4.50%  0.540 2.070  0.58 0.126
New Mexico 7 57  2.97%  0.208 1.693  0.81 0.184
North Dakota 1 9  1.08%  0.011 0.097  0.99 0.91
Utah 11 103  1.27%  0.140 1.308  0.87 0.27
Wyoming 2 0  1.29%  0.026 0  0.97 1

Even before we do a correction for having 51 hypotheses (50 states plus District of Columbia), none of these “no black students” states shows significant under-representation of black students. In fact, it would have been significantly surprising if the test taker in North Dakota had been black. None of the states had so few students that a black test taker would have been surprising (except Wyoming).

One can do similar computations to show that the lack of women in Mississippi, Montana, and Wyoming is not surprising.  Montana looks surprising if treated as a single hypothesis (p<0.004), but not after multiple-hypothesis correction (E-value=0.21). Even combining all three states (which increases the number of hypotheses enormously and would call for a stronger multiple-hypothesis correction), the under-representation of women in those states is not statistically significant.

There are states that do have significant under-representation of women: for example, Utah had 103 test takers, only 4 of whom were women. With an expected number of about 51.5, this is p<1.4E-16. Even with 51× multiple hypothesis correction, this under-representation is hugely significant.  Looking nationwide, total counts were 5485 female test takers out of 29555 total test takers.  That’s p< 1.4E-1677. The highest percentage of female test takers was in Tennessee, with 73 out of 251, which is  p< 2.6E-7, again highly significant.

Tennessee also had a high proportion of black test takers with 25 out of 251.  With an expected number of 42.12, this is p<0.003 (still significantly under-represented).  To see if black students were under-represented nationwide, one would have to add up the expected numbers for each state and see how the actual number compared with the expected number.  (I’m certain that the under-representation is hugely significant since even the states with high numbers of black test takers are under-represented,  but I’m too lazy to do the multiplication and addition needed.)

The case can clearly be made for female and black students being under-represented, though pointing to the states with 0 female or 0 black test takers is not the way to do it. (From a marketing standpoint, rather than a statistical one , shouting “no black test takers in these states”, “no female test takers in these other states” may be exactly the right way to get attention, even though the real story about blacks and females is in the states where there were enough test takers to say something about them after dividing them into subgroups.)

A case could also be made for some states having far fewer CS AP test takers than others.  One would need to come up with an expected number of test takers from some model (for example, by state population as a share of national population, or by number of total AP test takers in state as share of national total AP test takers).  The second model would correct for state-to-state differences in age distribution or in popularity of AP exam taking in general.  One could also base predictions on some other STEM test, such as AP Calculus, if one wanted to control for different amounts of STEM instruction in different states.

Let’s look at the states with no black test takers again, to see if they are significantly under-represented in CS.  There were 29555 AP CS tests taken nationwide and 3,824,691 AP tests nationwide total, so we would expect the CS tests taken in a state to be 0.77% of the total for the state.

state #  CS test takers # all test takers expected CS test takers p < E-value
Alaska 21 4570 35.31 0.0066 0.34
Idaho 6 47 9723 75.13 6.3E-25 3.3E-4 3E-23 1.7E-4
Kansas 12 47 15339 118.53 5.95E-36 6.25E-14 3E-34 3.2E-12
Maine 161 14051 108.58 0.9999
Mississippi 2 1 9032 69.79 1.23E-27 3.5E-29 6E-26 1.8E-27
Montana 0 11 4868 37.62 4.59E-17 3.4E-7 2E-15 1.7E-5
Nebraska 12 46 11117 85.91 1.9e-23 1.7E-6 1E-21 8.8E-7
New Mexico 7 57 13365 103.28 3.7E-35 4.7E-7 2E-33 2.4E-5
North Dakota 1 9 2295 17.73 3.7E-7 0.018 2E-5 0.91
Utah 11 103 35721 276.03 2.4E-101 5.6E-23 1E-99 2.8E-21
Wyoming 2 0 2050 15.84 1.9E-5 1.3E-7 0.00096 6.7E-6

Of these eleven states, eight appear to be under-represented in CS test takers (Maine is significantly over-represented in CS test takers).  When I do the multiple-hypothesis correction for having 51 different “states” (including the District of Columbia), the mild under-representation in Alaska and North Dakota is no longer significant, but the other nine eight are.

So the zero black AP CS test takers for the nine states can be fairly confidently attributed to the lack of AP CS test takers, and in Maine to the shortage of black students.  For Alaska, the lack of black AP CS test takers is probably due to the shortage of AP CS test takers in the state.

One can generalize the techniques here to any method of predicting the mean number of students in some category, to see whether the observed number is significantly smaller than the predicted number.  When the predicted number is small, even 0 students may not be statistically significant under-representation.

Next Page »