A post on Overthinking my teaching by Christopher Danielson,* Those aren’t numbers, so don’t treat them as though they were!*, lead me to do some thinking about grading systems. The post itself was about converting standards-based grading (SBG) schemes into point-based schemes so that you got the “right” results when you took percentages and converted back to A, B, C, D, F grades.

Of course, I don’t understand why so many users of SBG systems insist on using numbers for their categories in the first place. I guess that the “grades-are-evil” meme has so taken root in teacher training that teachers can’t envision using letters for their category names, even when their categories correspond very closely to the original uses of the grade letters. Mr. Danielson says

Maybe you think of it this way:

4 Accomplished

3 Proficient

2 Beginning

1 Struggling

0 Missing

which looks to me exactly the same as saying

A Accomplished

B Proficient

C Beginning

D Struggling

F Missing

He then goes on to assign points for each category (A=40, B=34, C=30, D=26, F=0), to get the percentage-based gradebook he is forced to use to report grades the way he wants (he wants to make sure that students make up any missing assessments, and will fail if they don’t).

I rarely give tests these days, since the skills I’m interested in assessing are ones that take time to display (being able to write and revise detailed papers or to design and code moderately complicated programs). When I do assess students, I end up using letter grades, which have meanings more like the standards-based grading categories than percentages. When I need to average, I see no problem with assigning A+=4.3, A=4.0, A-=3.7, B+=3.3, … and averaging those values. But I teach at a university, where the individual faculty have full control over the grades. (If a student feels a grade has been unfairly given, the most anyone other than the original faculty member can do is change the course to a pass/fail or remove the course from the record—only the person who issued the grade can change it.)

The problem that I see is not with the SBG categories nor with assigning points to get the desired outcome, but with the notion of fixed meanings for percentages. I have never understood why so many educators insist that all assessments (homework, tests, quizzes, exams, … ) must be designed so that 90% right is an A, 80–90% is a B, 70–80% is a C, and 60–70% is a D.

From an information-theoretic standpoint, such an assessment provides much less information than one in which the scores are uniformly distributed over 0–100%. In fact, a test in which almost all students get 50% or more could be half as long (eliminating the easiest questions) with almost no loss of information. Or, if the test were kept the same length, it could be made less sensitive to random measurement error, by testing the material more thoroughly. The standardized tests like the SAT, ACT, and AP exams are not designed so that everyone gets at least 50% right—every question is written so that some fraction of the test takers get it wrong.

Item-response theory, one of the standard methods for designing tests, suggests that the probability of a student getting a question right follows a sigmoidal function of the student’s ability being measured, and that each question can be characterized by the difficulty of the question (the student ability level which leads to a 50% probability of getting the question right) and the slope of the probability function at that point. A test consists of a mixture of questions with different characteristics.

If all the questions are about the same difficulty, then the test will be very good at distinguishing students below that level from students above that level, but tell us very little about students who are not near the threshold. (This is good design for a certification test, which is scored on a strict pass/fail basis.) Tests designed to tell us something about all the students, however, need to have a mixture of questions of different difficulty, covering the range of abilities of the students. Requiring a median score around 85%, however, ensures that most of the questions have to be easy, so that there is a lot of information provided about how the bottom of the class is doing, and very little about how the top is doing. This fits in well with the current political insistence on focusing all educational resources on the students at the bottom (see *Debate about how schools treat gifted students*), but not with an educational model that insists that all students be taught.

I have seen one family of tests that does a good job of extracting nearly maximal information: the AMC math contests (AMC-8, AMC-10, AMC-12). Those tests are designed with very few easy questions and a few very hard ones. The average score on the AMC-8 is typically around 40%, and the median for the AMC-10 is around 55%. The tails of the distribution go out to the edges of scoring on the AMC-8, with a small number of students getting all the questions right and a smaller number getting them all wrong. (The AMC-10 has too many easy questions, and so has too few students getting the lowest scores.) Since one purpose of the test is to identify those students on the far right tail who are worth coaching for more difficult math contests, a lower mean score might be useful, to get more accurate measurement at the top end.

One possible explanation for requiring assessments to have such a high percentage right for passing is that the point of an exam for the people who make these rules it not to get useful information about the students, but to reassure students about how much they know—to bolster their self-esteem and confidence. Personally, I don’t see a lot of pedagogic value in that. Students pay more attention to their mistakes than to success on trivial problems, so they are more likely to learn from a difficult test than an easy one. (I’ve also met too many students who have a very inflated view of their abilities, partly as a result of having never been given anything remotely challenging.)

Oh, yes, the percentage thing is completely arbitrary, and so many teachers don’t get why. It’s all numbers, so it must be accurate and consistent! Hyeah, right. Arbitrary expressed in numbers is still arbitrary. (And don’t get me started on the people who ask why 80% is a B, given that an 80% success rate in the real world would be terrible, surgeons killing a fifth of their patients, etc.)

By the way, there are a couple of typos in the grade numbers: you’ve got 4.7 for an A minus, and 3.7 for a B plus, and I assume you meant something more like 3.7 for an A minus and 3.3 for a B plus.

Comment by HelenS — 2011 October 11 @ 13:37 |

Thanks for catching the typos. I thought I had fixed them before publishing, but I guess not. They should be fixed now.

Your argument about 80% not being a B goes in the opposite direction from mine. You are assuming that tasks must be so well within capabilities that failure at them is rare (1% or less), while I take the approach that during learning, students should be near the limits of what they can do, and that tests should focus on finding (and pushing) those limits. It’s true that after a student has mastered a subject thoroughly, they should be able to perform with only rare failure, but testing is not usually about finding out what students learned years earlier and can now do automatically.

There is not much point in giving a test in which most students are expected to get 99% correct—it just wastes everyone’s time.

Comment by gasstationwithoutpumps — 2011 October 11 @ 15:03 |

No, no, I totally agree with you. I was saying that the surgeon argument was silly.

Comment by HelenS — 2011 October 12 @ 11:24 |

Right. I was not reading as closely as I should have. Sorry about the misinterpretation.

Comment by gasstationwithoutpumps — 2011 October 12 @ 11:30 |

The AP exams are another familiar and respected example of a “test family” that prioritizes extracting information over hewing to “traditional notions” of grade levels. And at least every few years (at least for AP Calculus, and at least a while ago) they release free-response questions and with them the raw-score cuts for scale scores of 5,4,3,2,1.

Comment by revuluri — 2014 October 29 @ 21:13 |

[…] https://gasstationwithoutpumps.wordpress.com/2011/10/10/designing-tests/ […]

Pingback by Sabbatical leave report | Gas station without pumps — 2015 September 8 @ 22:23 |