Gas station without pumps

2014 October 25

Grading based on a fixed “percent correct” scale is nonsense

Filed under: Uncategorized — gasstationwithoutpumps @ 10:12
Tags: , , , , , ,

On the mailing list for parents home-schooling high schoolers to prepare for college, parents occasionally discuss grading standards.  One parent commented that grading scales can vary a lot, with the example of an edX course in which 80% or higher was an A, while they were used to scales like those reported by Wikipedia, which gives

The most common grading scales for normal courses and honors/Advanced Placement courses are as follows:

“Normal” courses Honors/AP courses
Grade Percentage GPA Percentage GPA
A 90–100 3.67–4.00 93–100 4.5–5.0
B 80–89 2.67–3.33 85-92 3.5–4.49
C 70–79 1.67–2.33 77-84 2.5–3.49
D 60–69 1.0–1.33 70-76 2.0–2.49
E / F 0–59 0.0–0.99 0–69 0.0–1.99
​Because exams, quizzes, and homework assignments can vary in difficulty, there is no reason to suppose that 85% on one assessment has any meaningful relationship to 85% on another assessment.  At one extreme we have driving exams, which are often set up so that 85% right is barely passing—people are expected to get close to 100%.  At the other extreme, we have math competitions: the AMC 12 math exams have a median score around 63 out of 150, and the AMC 10 exams have 58 out of 150.  Getting 85% of the total points on the AMC 12 puts you in better than the top 1% of test takers.  (AMC statistics from ) The Putnam math prize exam is even tougher—the median score is often 0 or 1 out of 120, with top scores in the range 90 to 120. (Putnam statistics from The point of the math competitions is to make meaningful distinctions among the top 1–5% of test takers in a relatively short time, so questions that the majority of test takers can answer are just time wasters.
I’ve never seen the point of having a fixed percentage correct ​used institution-wide for setting grades—the only point of such a standard is to tell teachers how hard to make their test questions.  Saying that 90% or 95% should represent an A merely says that tests questions must be easy enough that top students don’t have to work hard, and that distinctions among top students must be buried in the test-measurement noise.  Putting the pass level at 70% means that most of the test questions are being used to distinguish between different levels of failure, rather than different levels of success. My own quizzes and exams are intended to have a mean around 50% of possible points, with a wide spread to maximize the amount of information I get about student performance at all levels of performance, but I tend to err on the side of making the exams a little too tough (35% mean) rather than much too easy (85% mean), so I generally learn more about the top half of the class than the bottom half.
I’m ok with knowing more about the top half than the bottom half, but my exams also have a different problem: too often the distribution of results is bimodal, with a high correlation between the points earned on different questions. The questions are all measuring the same thing, which is good for measuring overall achievement, but which is not very useful for diagnosing what things individual students have learned or not learned.  This result is not very surprising, since I’m not interested in whether students know specific factoids, but in whether they can pull together the knowledge that they have to solve new problems.  Those who have developed that skill often can show it on many rather different problems, and those who haven’t struggle on any new problem.

Lior Pachter, in his blog post Time to end letter grades, points out that different faculty members have very different understandings of what letter grades mean, resulting in noticeably different distributions of grades for their classes. He looked at very large classes, where one would not expect enormous differences in the abilities of students from one class to another, so large differences in grading distributions are more likely due to differences in the meaning of the grades than in differences between the cohorts of students. He suggests that there be some sort of normalization applied, so that raw scores are translated in a professor- and course-specific way to a common scale that has a uniform meaning.  (That may be possible for large classes that are repeatedly taught, but is unlikely to work well in small courses, where year-to-year differences in student cohorts can be huge—I get large year-to-year variance in my intro grad class of about 20 students, with the top of the class some years being only at the performance level of  the median in other years.)  His approach at least recognizes that the raw scores themselves are meaningless out of context, unlike people who insist on “90% or better is an A”.

 People who design large exams professionally generally have training in psychometrics (or should, anyway).  Currently, the most popular approach to designing exams that need to be taken by many people is item-response theory (IRT), in which each question gets a number of parameters expressing how difficult the question is and (for the most common 3-parameter model) how good it is at distinguishing high-scoring from low-scoring people and how much to correct for guessing.  Fitting the 3-parameter model for each question on a test requires a lot of data (certainly more than could be gathered in any of my classes), but provides a lot of information about the usefulness of a question for different purposes.  Exams for go/no-go decisions, like driving exams, should have questions that are concentrated in difficulty near the decision threshold, and that distinguish well between those above and below the threshold.  Exams for ranking large numbers of people with no single threshold (like SAT exams for college admissions in many different colleges) should have questions whose difficulty is spread out over the range of thresholds.  IRT can be used for tuning a test (discarding questions that are too difficult, too easy, or that don’t distinguish well between high-performing and low-performing students), as well as for normalizing results to be on a uniform scale despite differences in question difficulty.  With enough data, IRT can be used to get uniform scale results from tests in which individuals don’t all get presented the same questions (as long as there is enough overlap in questions that the difficulty of the questions can be calibrated fairly), which permits adaptive testing that takes less testing time to get to the same level of precision.  Unfortunately, the model fitting for IRT is somewhat sensitive to outliers in the data, so very large sample sizes are needed for meaningful fitting, which means that IRT is not a particularly useful tool for classroom tests, though it is invaluable for large exams like the SAT and GRE.
The bottom line for me is that the conventional grading scales used in many schools (with 85% as a B, for example) are uninterpretable nonsense, that do nothing to convey useful information to teachers, students, parents, or any one else.  Without a solid understanding of the difficulty of a given assessment, the scores on it mean almost nothing.

Blog at

%d bloggers like this: