Gas station without pumps

2012 June 3

Planning ahead for tests

Filed under: home school — gasstationwithoutpumps @ 22:22
Tags: , , , , ,

I was entering some of next year’s events on my calendar for planning purposes (like reserving the classroom I’ll be teaching in for the fall—we use a seminar room that is not under the control of the Registrar, so faculty are responsible for using Google Calendar to reserve the room).

I was particularly interested in figuring out when my son would be taking various scheduled exams: pSAT/NMSQT, SAT, ACT, SAT Subject tests, AP tests, … I’ll probably have to make sure that the tests get ordered sufficiently far in advance, and those deadlines tend to sneak up on me.

I had no trouble finding when the ACT tests are for the next two years (ACT Registration : Test Dates in the U.S., U.S. Territories, and Canada), but the SAT schedule was a little harder to find.  As usual, the College Board search box was useless, but Google found the information on Test Dates & Deadlines, though those are still “anticipated dates”. It seems that the College Board is not willing to commit too strongly to a schedule. I think I want my son to take both an ACT test and an SAT test during the year. I’ve heard that the writing components are quite different, and I don’t know which one he’ll do better on. It might also be best to do these tests early in the year, before his schedule gets too packed. The end of April and beginning of May was crazy this year, so (other than the AP exams) we’d like to stay away from extras around then.

The 2013 AP Exam Dates seem more definite, which is too bad, because they’ve scheduled two of the tests my son was planning to take (Computer Science A and Spanish Language) for the same time slot. That means he’ll have to request a late testing slot to resolve the conflict (taking the CS exam on May 23).

The pSAT is also scheduled for the next 2 years, and this taking of the test that will be the one that counts for National Merit Scholarships. Unless he is very ill on the test dates, I expect he’ll do fairly well—he only missed one question on the pSAT this year.

2012 May 5

Grading scales

Filed under: Uncategorized — gasstationwithoutpumps @ 21:43
Tags: , , , ,

I have seen a lot of teacher blogs and forum posts in which some percentage⇒grade scale is taken as obvious or mandated.  These generally follow a simple pattern, with A, B, C, and D each taking up 10% of the range, and <60% being an F (see this web page, for example).  There is nothing sacred about this way of distributing grades—it is simply a traditional way in the US for designing tests so that most questions are very easy.

The consequence of this test design strategy is that most of the questions on the test tell you nothing about most of the students, and only a tiny fraction of the questions distinguish the good students from the excellent ones.  From an information-theoretic standpoint, this sort of test design only makes sense if all you care about is separating the failures from the rest of the students—otherwise it is a really stupid design choice.

Not all tests are designed in this stupid way.  For example, the AP tests, which are intended to group the test takers into 5 groups (5, 4, 3, 2, and 1, roughly corresponding to the traditional meanings of A, B, C, D, and F grades), uses a scale in which getting about half the points will result in a “3”.  For example, here are the scores that I’ve been told are needed for the released 2008 AP Biology test (in the new scoring system, which requires students to guess stuff they don’t know, as there is no correction for random guessing):

  1. 0—57   0%–38%
  2. 58–68   39%–45%
  3. 69–80   46%–53%
  4. 81–94   54%–63%
  5. 95–150  63%–100%

Note that there is a symmetry here, with about as many points allocated for 1s as for 5s (rather than the 6-to-1 ratio of F-to-A in the traditional US scale).  There are still a lot of easy questions, but there are also a lot of hard ones.  (Each AP test is scaled differently, as the difficulty of the questions is not precisely matched, so calibration is done to try to make the final 5-point scores have comparable meanings from year to year.) Knowing any 2/3 of the AP Bio material thoroughly is enough to get a 5, as is knowing all of the material pretty well, but knowing less than half the material will result in a failure.

Getting a 4 on an AP exam is not a matter of making one or two careless mistakes (as a B often is on traditionally scaled exams), but of not being able to answer a substantial fraction of the questions.

Although I like the AP scale much better than the traditional US scale, I don’t think that it is an optimal design.

Most people are interested in the border between 2s and 3s (since that is where many colleges put the no-credit/credit boundary), but placement in college courses may depend on thresholds above that (the 3-4 or 4-5 thresholds).  No one much cares about accurate placement of the 1-2 threshold, as both are considered failing (like an F or a D).  If I were designing a test to be used the way AP scores are used, I’d want a lot of questions to be near each of the difficulty levels associated with the important boundaries, with the 2-3 boundary getting the most attention.  I think that such a test would result in a score distribution more like

  1. 0%–5%
  2. 5%–30%
  3. 30%–60%
  4. 60%–80%
  5. 80%–100%

Note that this would reduce the ability of the test to distinguish at the top end slightly, but those distinctions are being lost when the raw score is converted  to the 5-point scale anyway.

Such a test would have fewer easy questions than the current AP tests, and either fewer hard questions or fewer total questions (depending on whether the current difficulty of getting a 5 is in the difficulty of the questions or simply in the time pressure of the test).

Of course, it may be quite difficult to design such a test, as it is difficult to identify “easy” questions to eliminate.  I suspect that the exam writers have already discarded all the questions that most students scoring a 3 or better get right. A very informative question will split the test takers in to two roughly equal-sized groups, but different questions will split the test takers differently.  The problem is that different students have learned different subsets of the many different types of knowledge and skills being tested.  (Note: I’ve simplified a bit here, as the 50-50 split is only maximally informative if the scores for the questions are independent of each other—we actually want to select conditional independent questions, conditioned on what we are actually trying to measure, which can result in needing questions at easier and harder levels than the 50-50 level.)

The premise of having a single scale score is that there is one thing being measured—in the case of AP Bio tests, mostly likely this “construct” is what fraction of the required material has been learned.  To estimate this for a population of students who have learned different random subsets to different levels of mastery, you need to sample widely across all the material.  A weak student may have learned a small portion of the material very well, but that should not be sufficient to pass the test. Although questions that are easy for almost everyone can be eliminated, you will still have a lot of questions that many of the weak students get right, because the questions happen to be in the subset of the material that they learned.

In that case, you can end up with a scale like the one this AP Bio test uses, where you need to get 45% of the points to pass. Using more difficult questions and lowering the threshold (as I proposed earlier in this post) would mean that students with a deeper understanding of a smaller fraction of the material could pass.  On tests that cover a very cohesive body of material (like Calculus or Physics C: Mechanics), my idea might be good, but for survey courses of broader, more disparate fields (like AP Bio or any of the history tests), it may make more sense to test for wider coverage and less depth, as they currently do.

2012 April 22

New Yorkers opposed to nonsense stories

Filed under: Uncategorized — gasstationwithoutpumps @ 05:54
Tags: , , , ,

Anemona Hartocollis reports in the NY Times about a Daniel Pinkwater story used in New York’s standardized English tests: Standardized Testing Is Blamed for Question About a Sleeveless Pineapple.

The story itself is a good nonsense story, and only one of the questions should be tricky for 8th graders (asking for the motivation of the characters in the story).  If you want to see whether students can read and understand text, using a nonsense story is an excellent strategy, since students can’t rely on prior knowledge to answer the questions.  (That is unlike a reading test my son took in kindergarten, which relied on prior knowledge of who Amelia Earhart was.)

True to form, New Yorkers were up in arms about the questions “And by Friday afternoon, the state education commissioner had decided that the questions would not count in students’ official scores.”

Antitesting advocates have decided to make this story into a symbol for their mission of eliminating testing.  Because the story is a good one, with a memorable punch line (“Pineapples don’t have sleeves”), I’m sure the antitesting advocates will succeed in getting their message out.  Unfortunately, I think that they are attacking the wrong target, and their efforts are likely to make tests worse, not better.

If any reading that is amusing and memorable is going to be attacked by activists, and the education commissioner is going to throw out any questions about such reading, then the tests are going to be populated with the driest, most boring passages the test makers can find.  This is not the direction I want testing to go.

I suspect that this story and set of questions, which got students discussing the story after the test, may have been the most effective literature prompt that the students have gotten all year.

Quite frankly, I think that education commissioner, John B. King Jr, has acted here as a spineless politician, throwing out all the questions regardless of whether they are measuring what the test is supposed to be measuring.  Of course, it gets his name before the voters in a way that makes him look like he is defending the integrity of the tests, when what he is actually doing is pandering to the loudest and most ignorant activists. This sort of brainless “leadership” is what has made our government so dysfunctional in the past few decades.

My take on this flap is that the story and questions did a very good job of determining what the test writers were asked to determine—whether students could read and interpret literature that they had not previously seen.

Deborah Meier, a founder of a “progressive” school, appears very opposed to close reading of the text, believing that only very vague questions (like “Is this a spoof? Is it intended to make sense?”) are reasonable.  That is an appropriate question for a 4th grade test, perhaps, but I expect more of 8th graders than that sort of superficial question that could be answered without reading most of the story.

I wonder whether the anti-testing advocates actually read the story, or just decided that “nonsense” on tests is bad.  Certainly Ms. Meier’s comment that the story is “an outrageous example of what’s true of most of the items on any test, it’s just blown up larger” does not suggest that she understood what the testing is supposed to be measuring.  She just wanted to be outraged, and people who want to be outraged find any excuse for it.

2011 October 10

Designing tests

Filed under: Uncategorized — gasstationwithoutpumps @ 16:26
Tags: , , , , , ,

A post on Overthinking my teaching by Christopher Danielson, Those aren’t numbers, so don’t treat them as though they were!, lead me to do some thinking about grading systems.  The post itself was about converting standards-based grading (SBG) schemes into point-based schemes so that you got the “right” results when you took percentages and converted back to A, B, C, D, F grades.

Of course, I don’t understand why so many users of SBG systems insist on using numbers for their categories in the first place.  I guess that the “grades-are-evil” meme has so taken root in teacher training that teachers can’t envision using letters for their category names, even when their categories correspond very closely to the original uses of the grade letters. Mr. Danielson says

Maybe you think of it this way:
4 Accomplished
3 Proficient
2 Beginning
1 Struggling
0 Missing

which looks to me exactly the same as saying
A Accomplished
B Proficient
C Beginning
D Struggling
F Missing

He then goes on to assign points for each category (A=40, B=34, C=30, D=26, F=0), to get the percentage-based gradebook he is forced to use to report grades the way he wants (he wants to make sure that students make up any missing assessments, and will fail if they don’t).

I rarely give tests these days, since the skills I’m interested in assessing are ones that take time to display (being able to write and revise detailed papers or to design and code moderately complicated programs).  When I do assess students, I end up using letter grades, which have meanings more like the standards-based grading categories than percentages.  When I need to average, I see no problem with assigning A+=4.3, A=4.0, A-=3.7, B+=3.3, … and averaging those values.  But I teach at a university, where the individual faculty have full control over the grades.  (If a student feels a grade has been unfairly given, the most anyone other than the original faculty member can do is change the course to a pass/fail or remove the course from the record—only the person who issued the grade can change it.)

The problem that I see is not with the SBG categories nor with assigning points to get the desired outcome, but with the notion of fixed meanings for percentages.  I have never understood why so many educators insist that all assessments (homework, tests, quizzes, exams, … ) must be designed so that 90% right is an A, 80–90% is a B, 70–80% is a C, and 60–70% is a D.

From an information-theoretic standpoint, such an assessment provides much less information than one in which the scores are uniformly distributed over 0–100%.  In fact, a test in which almost all students get 50% or more could be half as long (eliminating the easiest questions) with almost no loss of information. Or, if the test were kept the same length, it could be made less sensitive to random measurement error, by testing the material more thoroughly. The standardized tests like the SAT, ACT, and AP exams are not designed so that everyone gets at least 50% right—every question is written so that some fraction of the test takers get it wrong.

Item-response theory, one of the standard methods for designing tests, suggests that the probability of a student getting a question right follows a sigmoidal function of the student’s ability being measured, and that each question can be characterized by the difficulty of the question (the student ability level which leads to a 50% probability of getting the question right) and the slope of the probability function at that point.  A test consists of a mixture of questions with different characteristics.

If all the questions are about the same difficulty, then the test will be very good at distinguishing students below that level from students above that level, but tell us very little about students who are not near the threshold.  (This is good design for a certification test, which is scored on a strict pass/fail basis.) Tests designed to tell us something about all the students, however, need to have a mixture of questions of different difficulty, covering the range of abilities of the students.  Requiring a median score around 85%, however, ensures that most of the questions have to be easy, so that there is a lot of information provided about how the bottom of the class is doing, and very little about how the top is doing. This fits in well with the current political insistence on focusing all educational resources on the students at the bottom (see Debate about how schools treat gifted students), but not with an educational model that insists that all students be taught.

I have seen one family of tests that does a good job of extracting nearly maximal information: the AMC math contests (AMC-8, AMC-10, AMC-12). Those tests are designed with very few easy questions and a few very hard ones.  The average score on the AMC-8 is typically around 40%, and the median for the AMC-10 is around 55%.  The tails of the distribution go out to the edges of scoring on the AMC-8, with a small number of students getting all the questions right and a smaller number getting them all wrong.  (The AMC-10 has too many easy questions, and so has too few students getting the lowest scores.) Since one purpose of the test is to identify those students on the far right tail who are worth coaching for more difficult math contests, a lower mean score might be useful, to get more accurate measurement at the top end.

One possible explanation for requiring assessments to have such a high percentage right for passing is that the point of an exam for the people who make these rules it not to get useful information about the students, but to reassure students about how much they know—to bolster their self-esteem and confidence. Personally, I don’t see a lot of pedagogic value in that.  Students pay more attention to their mistakes than to success on trivial problems, so they are more likely to learn from a difficult test than an easy one.  (I’ve also met too many students who have a very inflated view of their abilities, partly as a result of having never been given anything remotely challenging.)

2011 July 25

Testing insanity

Filed under: Uncategorized — gasstationwithoutpumps @ 14:15
Tags: , , ,

John T. Spencer has just posted Testing Insanity: Amount of Days Spent Testing containing pie charts about allocation of the scarcest resource for teachers: instructional time.  (For grammar mavens out there, I point out that “amount” should only be used with uncountable nouns—I assume he mixed “amount of time” with “number of days”.)

If his numbers are correct, and I have no reason to doubt them, his school spends fully 28% of their instructional time on testing, not counting the time wasted on test prep.  That seems excessive.

For comparison, I computed how much time is spent on testing at the university.  Comparisons between middle school and college courses are always misleading, because of the difference in how student time is structured.  A middle-school student is expected to spend 30–35 hours a week at school and another 5–6 hours a week on homework, while a college student is expected to spend 9–10 hours a week in classes and another 30–40 hours on homework.  This reversal of time allocation makes comparisons of homework loads and in-class time allocation tricky (and is often the hardest adjustment for new college students to make).

With that caveat, here is my calculation.  A typical 5-unit course at UCSC has 35 hours of lecture plus 3 hours of final exam.  Some courses have discussion sections as well, but these are usually optional and only lightly attended—they can be regarded as supplementary help rather than primary instruction time for most classes (for classes with mandatory discussion sections, the class time increases from 35 to about 46 hours).  A lot of faculty give up one or two lectures for mid-term exams, that is 1.17, 1.75, 2.33, or 3.5 hours.  So the highest exam load for a course is 6.5 hours out of a total of 38 hours, or 17%.  Quizzes and clicker questions could bring that as high as 25%, but I don’t thing that John Spencer was including quizzes and teacher-generated assessments in his count—just exams imposed from outside.

Many faculty, including me, see exams as a poor way of assessing what students have learned in a course—particularly in courses intended to teach skills like computer programming, electronic design, lab skills, writing, or research skills.  For these courses, projects, term papers, and programming assignments done outside of class time are the primary assessments, and little or no class time is used for assessment.  (See Skills at the Center for more discussion of teaching based on skills rather than on testable factoids.)

I do see a need for standardized exams to let parents and colleges know how much the kids are really learning, but the cost in instructional time has gotten ridiculously large.  Continual testing is no substitute for teaching and learning.

Next Page »

%d bloggers like this: