Gas station without pumps

2014 October 25

Grading based on a fixed “percent correct” scale is nonsense

Filed under: Uncategorized — gasstationwithoutpumps @ 10:12
Tags: , , , , , ,

On the hs2coll@yahoogroups.com mailing list for parents home-schooling high schoolers to prepare for college, parents occasionally discuss grading standards.  One parent commented that grading scales can vary a lot, with the example of an edX course in which 80% or higher was an A, while they were used to scales like those reported by Wikipedia, which gives

The most common grading scales for normal courses and honors/Advanced Placement courses are as follows:

“Normal” courses Honors/AP courses
Grade Percentage GPA Percentage GPA
A 90–100 3.67–4.00 93–100 4.5–5.0
B 80–89 2.67–3.33 85-92 3.5–4.49
C 70–79 1.67–2.33 77-84 2.5–3.49
D 60–69 1.0–1.33 70-76 2.0–2.49
E / F 0–59 0.0–0.99 0–69 0.0–1.99
​Because exams, quizzes, and homework assignments can vary in difficulty, there is no reason to suppose that 85% on one assessment has any meaningful relationship to 85% on another assessment.  At one extreme we have driving exams, which are often set up so that 85% right is barely passing—people are expected to get close to 100%.  At the other extreme, we have math competitions: the AMC 12 math exams have a median score around 63 out of 150, and the AMC 10 exams have 58 out of 150.  Getting 85% of the total points on the AMC 12 puts you in better than the top 1% of test takers.  (AMC statistics from http://amc-reg.maa.org/reports/generalreports.aspx ) The Putnam math prize exam is even tougher—the median score is often 0 or 1 out of 120, with top scores in the range 90 to 120. (Putnam statistics from  http://www.d.umn.edu/~jgallian/putnam.pdf) The point of the math competitions is to make meaningful distinctions among the top 1–5% of test takers in a relatively short time, so questions that the majority of test takers can answer are just time wasters.
I’ve never seen the point of having a fixed percentage correct ​used institution-wide for setting grades—the only point of such a standard is to tell teachers how hard to make their test questions.  Saying that 90% or 95% should represent an A merely says that tests questions must be easy enough that top students don’t have to work hard, and that distinctions among top students must be buried in the test-measurement noise.  Putting the pass level at 70% means that most of the test questions are being used to distinguish between different levels of failure, rather than different levels of success. My own quizzes and exams are intended to have a mean around 50% of possible points, with a wide spread to maximize the amount of information I get about student performance at all levels of performance, but I tend to err on the side of making the exams a little too tough (35% mean) rather than much too easy (85% mean), so I generally learn more about the top half of the class than the bottom half.
I’m ok with knowing more about the top half than the bottom half, but my exams also have a different problem: too often the distribution of results is bimodal, with a high correlation between the points earned on different questions. The questions are all measuring the same thing, which is good for measuring overall achievement, but which is not very useful for diagnosing what things individual students have learned or not learned.  This result is not very surprising, since I’m not interested in whether students know specific factoids, but in whether they can pull together the knowledge that they have to solve new problems.  Those who have developed that skill often can show it on many rather different problems, and those who haven’t struggle on any new problem.

Lior Pachter, in his blog post Time to end letter grades, points out that different faculty members have very different understandings of what letter grades mean, resulting in noticeably different distributions of grades for their classes. He looked at very large classes, where one would not expect enormous differences in the abilities of students from one class to another, so large differences in grading distributions are more likely due to differences in the meaning of the grades than in differences between the cohorts of students. He suggests that there be some sort of normalization applied, so that raw scores are translated in a professor- and course-specific way to a common scale that has a uniform meaning.  (That may be possible for large classes that are repeatedly taught, but is unlikely to work well in small courses, where year-to-year differences in student cohorts can be huge—I get large year-to-year variance in my intro grad class of about 20 students, with the top of the class some years being only at the performance level of  the median in other years.)  His approach at least recognizes that the raw scores themselves are meaningless out of context, unlike people who insist on “90% or better is an A”.

 People who design large exams professionally generally have training in psychometrics (or should, anyway).  Currently, the most popular approach to designing exams that need to be taken by many people is item-response theory (IRT), in which each question gets a number of parameters expressing how difficult the question is and (for the most common 3-parameter model) how good it is at distinguishing high-scoring from low-scoring people and how much to correct for guessing.  Fitting the 3-parameter model for each question on a test requires a lot of data (certainly more than could be gathered in any of my classes), but provides a lot of information about the usefulness of a question for different purposes.  Exams for go/no-go decisions, like driving exams, should have questions that are concentrated in difficulty near the decision threshold, and that distinguish well between those above and below the threshold.  Exams for ranking large numbers of people with no single threshold (like SAT exams for college admissions in many different colleges) should have questions whose difficulty is spread out over the range of thresholds.  IRT can be used for tuning a test (discarding questions that are too difficult, too easy, or that don’t distinguish well between high-performing and low-performing students), as well as for normalizing results to be on a uniform scale despite differences in question difficulty.  With enough data, IRT can be used to get uniform scale results from tests in which individuals don’t all get presented the same questions (as long as there is enough overlap in questions that the difficulty of the questions can be calibrated fairly), which permits adaptive testing that takes less testing time to get to the same level of precision.  Unfortunately, the model fitting for IRT is somewhat sensitive to outliers in the data, so very large sample sizes are needed for meaningful fitting, which means that IRT is not a particularly useful tool for classroom tests, though it is invaluable for large exams like the SAT and GRE.
The bottom line for me is that the conventional grading scales used in many schools (with 85% as a B, for example) are uninterpretable nonsense, that do nothing to convey useful information to teachers, students, parents, or any one else.  Without a solid understanding of the difficulty of a given assessment, the scores on it mean almost nothing.

4 Comments »

  1. Several years ago my school abandoned letter grades and went to a pure percentage system. A 90% means 90%, not A or B or whatever. The kids still want to know what an A is but they are adjusting. The staff adjusted in about a week. Let the kids worry about what an A is, we just worry about helping them get the highest percentage they want to work for. Parents worry about what the colleges will do with a percentage grade. The colleges all seem to have their own conversion scales so not a problem.

    Comment by Garth — 2014 October 26 @ 07:57 | Reply

    • But “90%” doesn’t mean anything, just as “A” doesn’t mean anything. In both cases, you have to know what the questions are (or at least how difficult they are) and what standards are being applied in the grading. Substituting one meaningless measure for another is a rather pointless exercise, typical of bureaucrats.

      Colleges seem to make up their own rules for how to interpret grades from different schools, but the rules are often nonsense, bearing little relationship to the achievements of the students. Grade inflation has gotten so bad that the average incoming GPA at UCSB is 4.13 on a 4-point scale (average SAT is 1944), at UCB the numbers are 4.18 and 2071, at UCSC 3.82 and 1783. No correction is done for different grade inflation at different schools, so students from schools that give out A grades for simple attendance have a better chance of getting in than students from schools that take education seriously—a big flaw in the university admission process.

      Comment by gasstationwithoutpumps — 2014 October 26 @ 11:10 | Reply

  2. Yes, this is another example of the types of thinking/reasoning (math, engineering, CS, etc.) you’ve written about elsewhere. An ‘A’ is a somewhat arbitrary partition of a percentage scale that is a percentage of something somewhat arbitrary. It reminds me of countless comments I’ve heard in the context of investing about how $12/share stock in Company A is “more expensive” than $10/share stock in Company B. As I struggle to come up with an analogy to explain why that isn’t necessarily true in any important sense, they will often go on to conclude that the $12/share company must be more successful because it is worth more.

    I’m tempted to just shout, How long is a rope?!, but the people who need this “explanation” most wouldn’t understand it. In the case of grades, it’s, How long is 90% of a rope?

    Comment by Glen — 2014 October 31 @ 12:56 | Reply

  3. My own experience suggests that the observation at the top of your article should be pointing out one of the dangers of Honors/AP classes: you can get a 5.0 grade on a 4.0 scale for failing the class (i.e. failing the AP test). In college, or as a dual-enrolled student, you will get an F for that class on your college transcript. (I don’t know what a dual-enrolled student gets for the HS “course” when they fail the college class they are taking.)

    I’ve had mixed results with dual-enrolled students with a home-school background. Some do all of their math above Algebra I at the college and roll into calculus and physics with better preparation than kids with AP calculus credit who suddenly have to do 15 weeks of calculus in 15 weeks instead of 36+ weeks. Others struggle with the idependence of learning on their own, perhaps because someone guided their every minute, or perhaps because they don’t develop any study partners. (I don’t blame the latter on home schooling; it is common for some personality types.)

    As for grading scales, my own philosophy for introductory (gateway) classes is that my threshold for a C grade is tied closely to the score I expect of someone who gets all of the “minimum knowledge” questions perfectly correct and earns “minimum knowledge” partial credit on A-level questions. IMHO, that is easier to do if a top grade is close to 100 than if the highest grade is 50 and a C/D boundary has to be determined by curving noise. I learned that from a grad-school prof who gave comprehensive exam level problems where a B grade (but certainly not 85%) corresponded to a passing performance on the comps.

    Comment by CCPhysicist — 2014 November 2 @ 11:08 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: