# Gas station without pumps

## 2011 October 10

### Designing tests

Filed under: Uncategorized — gasstationwithoutpumps @ 16:26
Tags: , , , , , ,

A post on Overthinking my teaching by Christopher Danielson, Those aren’t numbers, so don’t treat them as though they were!, lead me to do some thinking about grading systems.  The post itself was about converting standards-based grading (SBG) schemes into point-based schemes so that you got the “right” results when you took percentages and converted back to A, B, C, D, F grades.

Of course, I don’t understand why so many users of SBG systems insist on using numbers for their categories in the first place.  I guess that the “grades-are-evil” meme has so taken root in teacher training that teachers can’t envision using letters for their category names, even when their categories correspond very closely to the original uses of the grade letters. Mr. Danielson says

Maybe you think of it this way:
4 Accomplished
3 Proficient
2 Beginning
1 Struggling
0 Missing

which looks to me exactly the same as saying
A Accomplished
B Proficient
C Beginning
D Struggling
F Missing

He then goes on to assign points for each category (A=40, B=34, C=30, D=26, F=0), to get the percentage-based gradebook he is forced to use to report grades the way he wants (he wants to make sure that students make up any missing assessments, and will fail if they don’t).

I rarely give tests these days, since the skills I’m interested in assessing are ones that take time to display (being able to write and revise detailed papers or to design and code moderately complicated programs).  When I do assess students, I end up using letter grades, which have meanings more like the standards-based grading categories than percentages.  When I need to average, I see no problem with assigning A+=4.3, A=4.0, A-=3.7, B+=3.3, … and averaging those values.  But I teach at a university, where the individual faculty have full control over the grades.  (If a student feels a grade has been unfairly given, the most anyone other than the original faculty member can do is change the course to a pass/fail or remove the course from the record—only the person who issued the grade can change it.)

The problem that I see is not with the SBG categories nor with assigning points to get the desired outcome, but with the notion of fixed meanings for percentages.  I have never understood why so many educators insist that all assessments (homework, tests, quizzes, exams, … ) must be designed so that 90% right is an A, 80–90% is a B, 70–80% is a C, and 60–70% is a D.

From an information-theoretic standpoint, such an assessment provides much less information than one in which the scores are uniformly distributed over 0–100%.  In fact, a test in which almost all students get 50% or more could be half as long (eliminating the easiest questions) with almost no loss of information. Or, if the test were kept the same length, it could be made less sensitive to random measurement error, by testing the material more thoroughly. The standardized tests like the SAT, ACT, and AP exams are not designed so that everyone gets at least 50% right—every question is written so that some fraction of the test takers get it wrong.

Item-response theory, one of the standard methods for designing tests, suggests that the probability of a student getting a question right follows a sigmoidal function of the student’s ability being measured, and that each question can be characterized by the difficulty of the question (the student ability level which leads to a 50% probability of getting the question right) and the slope of the probability function at that point.  A test consists of a mixture of questions with different characteristics.

If all the questions are about the same difficulty, then the test will be very good at distinguishing students below that level from students above that level, but tell us very little about students who are not near the threshold.  (This is good design for a certification test, which is scored on a strict pass/fail basis.) Tests designed to tell us something about all the students, however, need to have a mixture of questions of different difficulty, covering the range of abilities of the students.  Requiring a median score around 85%, however, ensures that most of the questions have to be easy, so that there is a lot of information provided about how the bottom of the class is doing, and very little about how the top is doing. This fits in well with the current political insistence on focusing all educational resources on the students at the bottom (see Debate about how schools treat gifted students), but not with an educational model that insists that all students be taught.

I have seen one family of tests that does a good job of extracting nearly maximal information: the AMC math contests (AMC-8, AMC-10, AMC-12). Those tests are designed with very few easy questions and a few very hard ones.  The average score on the AMC-8 is typically around 40%, and the median for the AMC-10 is around 55%.  The tails of the distribution go out to the edges of scoring on the AMC-8, with a small number of students getting all the questions right and a smaller number getting them all wrong.  (The AMC-10 has too many easy questions, and so has too few students getting the lowest scores.) Since one purpose of the test is to identify those students on the far right tail who are worth coaching for more difficult math contests, a lower mean score might be useful, to get more accurate measurement at the top end.

One possible explanation for requiring assessments to have such a high percentage right for passing is that the point of an exam for the people who make these rules it not to get useful information about the students, but to reassure students about how much they know—to bolster their self-esteem and confidence. Personally, I don’t see a lot of pedagogic value in that.  Students pay more attention to their mistakes than to success on trivial problems, so they are more likely to learn from a difficult test than an easy one.  (I’ve also met too many students who have a very inflated view of their abilities, partly as a result of having never been given anything remotely challenging.)