On one of the mailing lists I read, someone asked
Can anyone explain to me how a raw score of 75 and a raw score of 59 are both 800’s on the scale for the physics test? That seems a huge spread. I see similar stuff on other tests, but nothing spread quite this far.
One can find the distribution of scale scores as percentiles on the College Board web site at http://media.collegeboard.com/digitalServices/pdf/research/SAT-Subject-Tests-Percentile-Ranks-2013.pdf, but finding information about raw scores is harder. The College Board says
The raw score is converted to the College Board 200- to 800-point scaled score by a statistical process called equating. Equating adjusts for slight differences in difficulty between test editions and ensures that:
- A student’s score does not depend on the specific test edition she took.
- A student’s score does not depend on how well others did on the same edition of the test.
I’ve found raw-score-to-scaled-score conversions for one version of the SAT test at http://media.collegeboard.com/digitalServices/pdf/research/SAT-RAW-Score-to-Scaled-Score-Ranges-2013.pdf, but I’ve not found them for SAT 2 tests, and I don’t know whether the person had access to more data than I’ve been able to find on the College Board site, or just had a couple of data points for students who both got a standard score of 800.
The scale score on the SAT subject test, like other scale scores, is intended to have the same meaning from year to year, despite differences in the underlying test questions. Initially, tests are written so that questions span a range of difficulty, with some easy questions and some hard ones.Depending on the purpose of the test, the questions may cluster around a particular level of difficulty—if the test is intended as a pass/no-pass test, the questions hover around the pass threshold. Think of a driving exam, where the questions are intended to separate those who can drive safely from those who can’t. There is no point to asking esoteric questions that even good drivers can’t answer, nor trivial ones that even bad drivers do well at.
When the point is to spread students out without a single boundary (as for college admissions), there needs to be a wider diversity of difficulty of questions. You need some that are so difficult that few get them, and some that are so easy that few miss them, and everything in between. There need to be several difficult questions, to compensate for students randomly guessing correctly on one difficult question.
Because the scale scores are supposed to mean the same from year to year, and the scales are arbitrarily capped at 200 and 800, drift in student education or who takes the exams can make students pile up at one end of the scale or the other, even if initially the end scores were very rare. That has happened in math (level one has few students with 800, level 2 has 9% of students getting 800), physics (8% of students get an 800), and most of the language exams (when mostly native speakers take the exam—for Chinese, lots of students get an 800). Student selection seems to play more of a role than education (people who expect to do poorly don’t pay to take the test), so the bottom end of the scale is rarely used, while students pile up at the top end.
A more sensible system would not cap the scale scores (Lexiles, for example, are not capped) so that drifts in student population do not pile up people at one point on the scale. Many of the underlying SAT subject tests have the ability to measure more at the top end, but the political or marketing (not scientific) decision to cap scores at 800 limit the utility of the tests for making distinctions.
Note: the ACT has similar artificial score limitations and suffers the same sort of unnecessary ceiling, though it is not clear whether the underlying questionshave the ability to make distinctions at the top end—when you get down to differences of one or two questions right, you are looking at noise, not signal.