Our machines are designed for reliably transcribing lots of numbers; our brains aren’t.

]]>Good eyes, Glen! I did indeed make copying errors transferring the numbers from the spreadsheet to the HTML table (two awful formats). I believe I have now corrected the post to get the numbers right. The conclusions do not change substantially, but it is embarrassing to have made so many copying errors.

]]>If you are asking the question “does Illinois have under-representation?” then there is only one hypothesis being tested.

But if you are asking “Do states have under-representation?” and pull out the states with the lowest stats, then you are asking 51 different questions, and need stronger evidence to claim that the there is under-representation. It is the standard search problem that trips up a lot of biologists in modern high-throughput biology: if you examine 10,000 genes for over-expression, some of them will appear over-expressed just by chance due to the noise in the data. You need a strong enough signal to separate it from the random noise. The E-value is a (somewhat conservative) way of doing this correction—it is the expected number of states that look this bad just by chance. The E-value correction is similar to the Bonferroni correction, but easier to explain to biologists, and easier to interpret when you are looking for enrichment rather than proof (it is much easier to explain that we expected to find about 10 this good by chance than the probability that this arose by chance is 0.9999).

All the corrections you mentioned (for racial composition of the state, age distribution, socioeconomic status, …) come in the model for how many test takers you expect to find. That is often done by linear regression, though bioinformaticians often use simpler (naive Bayes) or more complicated (SVM regression, neural nets, hidden Markov models,…) ways of constructing the null model. People who use regression often forget to leave out the data for the point that they are testing, which distorts the results. Physicists tend to be particularly bad at that, because they are used to strong models where the model is much more believed than the data, and they have trouble dealing with fields where the models are weak as well as the data being noisy (like biology and sociology).

The approach I used here was a naive Bayes technique, as that is the easiest to explain and makes the assumptions very clear.

]]>Kathi

]]>