# Gas station without pumps

## 2010 August 29

### Sustained performance and standards-based grading

Filed under: Uncategorized — gasstationwithoutpumps @ 09:16
Tags: , ,

I have been looking at the latest fad sweeping education, standards-based grading (SBG), and trying to see if it something I should incorporate in my own grading practices.

My first post on SBG looked at some of the assumptions and guiding principles of SBG, concluding that it looked like a good idea if you took a reductionist view of education, where you could split your objectives into separately assessable standards.

My second post on SBG looked at the unspoken assumption that assessment is cheap, something that is not the case in many of my classes.

Another problem I have with SBG is that for a lot of standards, the goal is sustained performance, not one-shot success. It isn’t enough to get a comma right once—you have to get have to get almost every comma right, every time you write. Similarly, it isn’t enough to use evidence from primary and secondary sources and cite them correctly once—you have to do it in every research paper you write.

If you forget to include the “sustained performance” or “automaticity” (to use a buzzword that elementary math teachers seem fond of) components of the standards, you get a sloppy implementation that reinforces the do-it-once-and-forget-it phenomenon that makes students unable to do more advanced work.

SBG aficionados believe in instantaneous noise-free measures of achievement.  If a student takes a long time before they “get it”, but then demonstrate mastery, that’s fine.  This results in the practice of replacing grades for a standard with the most recent one.  I think that is ok, as long as the standard keeps being assessed, but if you stop assessing a standard as soon as students have gotten a good enough score (which seems to be the usual way to handle it), then you have recorded their peak performance, not the best estimate of their current mastery.  Think about the fluctuations in stock prices:  the high for the year is rarely a good estimate of the current price, even if the prices have been generally going up.

If you want to measure sustained performance, you must assess the same standard repeatedly over the time scale for which you want the performance sustained (or as close as you can come, given the duration of the course and the opportunity costs of assessment).  The much-derided average is intended precisely for this purpose: to get an accurate estimate of the sustained performance of a skill.

SBG tries to measure whether students have mastered each of a number of standards, under the assumption that mastery is essentially a step function (or, at least, a non-decreasing function of time). Under this assumption, the maximum skill ever shown in an assessment is a good estimate of their current skill level. There is substantial anecdotal evidence that this is a bad assumption: students cram and forget.  Indeed, the biggest complaint of university faculty is that students often seem to have learned nothing from their prerequisite courses.

Conventional average-score grading makes a very different assumption: that mastery is essentially a constant (that students learn nothing).  While cynics may view this as a more realistic assumption, it does make measuring learning difficult.  One of the main advantages of averages is that they reduce the random noise from the assessments, but at the cost of removing any signal that does vary with time.

The approach used in financial analysis is the moving-window average, in which averages are taken over fixed-duration intervals.  This smooths out the noisy fluctuations without eliminating the time-dependent variation.  (There are better smoothing kernels than the rectangular window, but the rectangular window is adequate for many purposes.)  If you look at a student’s transcript, you get something like this information, with windows of about a semester length.  Each individual course grade may be making the assumption that student mastery was roughly constant for the duration of the course, but upward and downward trends are observable over time.

Can SBG be modified to measure sustained performance?  Certainly the notion of having many separate standards that are individually tracked is orthogonal to the latest-assessment/average-assessment decision.  Even student-initiated reassessment, which seems to be a cornerstone of SBG practice, is separate from the latest/average decision, though students are more likely to ask for reassessment if it will move their grade a lot, and the stakes are higher with a “latest” record. Student-initiated reassessment introduces a bias into the measurement, as noise that introduces a negative fluctuation triggers a reassessment, but noise that introduces a positive fluctuation does not.

Perhaps a hybrid approach, in which every standard is assessed many times and the most recent n assessments (for some n>1) for each standard are averaged, would allow measuring sustained performance without the assumption that it is constant over the duration of the course.  If the last few assessments in the average are scheduled, teacher-initiated assessments, not triggered by low scores, then the bias of reassessing only the low-scorers is reduced.

1. You’re dead on in all of your points. I’ve always felt unsatisfied by the practice of taking the high score. This year I’m going to try using a grade program that uses a “power law formula” to predict the likely score on the next assessment.

One argument for keeping the high score is dependent on the math history of the school. If you’re dealing with a population that has had little actual success in math, hitting that high mark is a great first step towards the kind of rigor you’d like to see in your students. I’ve had so many students that gave up before being able to complete a skill successfully once, that only having to hit the target once is a much better carrot than any vague promises of later success.

Comment by hillby — 2010 August 31 @ 21:53

2. fascinating analysis. Your analysis is simple math, but I keep being surprised by the reliance on quantitative metrics without thinking through all these issues. It’s a naive belief that once you have a number it somehow becomes a magically precise and accurate measure of the complicated phenomenon (learning or whatever).

Comment by bj — 2010 September 2 @ 02:44

3. I would be really interested to learn about that “power law formula” that hillby is talking about. I know that as a Geometry teacher and a remediation teacher for the state mandated test I feel like almost all of the material in math builds on itself therefore it is continuously covered and assessed but that may just be me. I am planning on assessing regularly on previously covered topics separate from assessing over the new topics to see if my students are retaining the information. If they don’t do as well then I’m planning on putting in their new score and always keeping in the most recent score. You bring up a very valid topic and I’m looking forward to hearing everyone else’s opinions.

Comment by SarahKM — 2010 September 4 @ 19:35

4. I’m not sure precisely what hillby meant by “power law formula”. Usually a “power law” is a function of the form $f(x) = a x^ b$, but predicting the next assessment grade should be based on the previous several assessments. There are many different regression formulas that can be used, depending on your theory about what the assessments are measuring.

The most common predictor is simply the average of previous assessments, which assumes that the kids have a certain level of skill and diligence, and that this level does not change significantly from assessment to assessment, but that there is a noise component added to the assessment. (If you assume multiplicative noise instead of additive noise, you can use work with the log of the scores or use geometric means, but 0s in data make the multiplicative model a very poor one.)

Another predictor that can be used is a “decaying average”, which assumes that kids’ skill levels change, but slowly, and that the assessments are a noisy reflection of that slowly changing level. A decaying average of a time-series $d(t)$ is just a weighted average: $f(t) = \frac{\sum_{\tau. It is usually computed as $f(t+1) \leftarrow a f(t) + (1-a) d(t)$. This is probably a more realistic model, especially as assessments are really measuring a multi-dimensional skill space, and assessments that are close in time tend to measure skills that are closely related. A very fast decay ($a \approx 0$) is the same as taking the most recent score.

One can fit other curves (power laws, polynomials, …) to the data, but there is no reason to believe that these will do a better job at prediction than the decaying average. They are usually based on the assumption of a lot of structure in the data and low noise, neither of which are particularly true of grades.

One can also build models that try to do factor analysis of the data, modeling the assessment space as multi-dimensional. This can be useful for figuring out what happened at the end of the course—how the various skills cluster—but does not do much good for predicting the score on the next assessment, unless you know where in the multi-dimensional space the assessment fits.

Comment by gasstationwithoutpumps — 2010 September 5 @ 08:19

5. I’ve been thinking about an “average the last n” approach for a couple of months. You mention it cursorily; do you have anything else to say about it?

You also mention the cost of assessment, and indeed in my classroom a full 20% of our time was given to taking tests (written, solve-for-x kinds of tests). What amount of class time do you think is reasonable to spend entirely on assessment?

Your example about commas illustrates an important point about the validity of assessments used to grade students. A single sentence, perfectly commaed, might not indicate that a student has total mastery of commas. Analogously, a student who can calculate 2+3 may or may not be able to calculate 2.5+1/2, and may not even be able to calculate 5+10. Who knows? But, at some point, most teachers decide they have enough data to make a judgement. SBG does not address the issue of validity or reliability at all; it is only a different way of organizing the learning targets for which teachers are aiming.

I agree with you, I think, that some assessments may be so volatile or unreliable that we (the teachers) won’t be convinced by a single score. However, if I could assess a student’s skill with commas in a single assessment, I think that would be better than averaging multiple assessments that were administered in drastically different time periods (more than a month apart). If a student didn’t use a single comma on any of the first 8 papers of the year, or if he used a comma after every single word on all 8, but punctuated the 9th paper perfectly, I would be inclined to say that he mastered the comma and send him off with top marks (at least in commas).

Comment by Riley — 2010 September 19 @ 21:43

6. If things were as clear cut as switching from no commas to prefect comma use, then single-point measures would be fine. What I see in writing and in programming is usually incomplete mastery, where there is a gradually decreasing frequency of error over time.

Since my courses are short (only 10 weeks with 35 lecture hours and supposedly 100 hours outside class for reading and doing course work), I don’t have the luxury of dedicating a lot of time to assessing and re-assessing students.

In fact, I’ve given up on assessments that are purely to assess (a.k.a. exams): they take too much time away from learning. Instead, I assess students on the work they do while they are learning (programs and papers they have written). The primary reason for this choice is that this comes much closer to what I want assessed: their ability to write programs, do research, and write up the programs and the research. I am not so concerned whether they have gotten all the details of Markov chains or of specific models for DNA sequences, as whether they have the ability to move from the description of a stochastic model to an implementation and to use that implementation to test a hypothesis.

So in one important sense, all my assessments are applied to all my real goals for the course. I do not rely on any one assignment to assess these skills, but on the cumulative effect of several assignments (6 programs and 3 writing assignments). I do weight the later assignments more heavily than the earlier ones, in recognition that some of the earlier assignments are warmups for the later ones.

The only way I could do SBG in this course is to give multiple grades for each assignment, corresponding to the different things I’m looking for. I’m think of doing this, but there is no way that “programming skill” or “writing skill” are going to be measurable by the just the most recent assignment. Some of the other things being tested (understanding of Markov chains, for example) will only be covered by one assignment.

Comment by gasstationwithoutpumps — 2010 September 19 @ 22:08

7. If you don’t think you could trust a single measurement, then you have to average. But how can you include a measurement from the very first programming assignment in your course in your calculation of a student’s skill at the END of your course, if your course is intended to improve programming skill? A student who comes in with a low skill level is doomed from the start. And maybe he’s doomed to a C anyway, because starting with a low skill level correlates to ending with a low skill level, but shouldn’t he have the chance to learn and slough off that early assessment? Why does that early assessment even matter!?

When you’re giving him a final grade in “programming skill,” how would you feel about only averaging the last three programs instead of all 6?

Still, I go back and forth on this myself. Is a student less likely to retain this programming skill _next_ year if she only did half of the programs, and isn’t practice almost as important as understanding at a single moment?

Comment by Riley — 2010 September 21 @ 06:40

8. […] idea of throwing out old assessment scores.  The most convincing criticism I’ve read is at GSWP (alternate link, scroll to bottom):SBG aficionados believe in instantaneous noise-free measures of […]

Pingback by Letting Go of the Past « Point of Inflection — 2010 September 21 @ 08:40

9. Well, I’d have a lot more students failing the course if I graded on only the last 3 assignments. The assignments get harder during the quarter, and grades tend to drop correspondingly. If a C means that the student sort of gets the point, then getting B,B-,C+,C,C,F probably deserves a C rather than a D (which turns into an F, since we don’t have Ds for final grades).

The course is not supposed to be improving programming skill—they are supposed to have that as a prereq. Many don’t, but some of them manage to acquire enough skill to do ok on the programs. As a gateway class to the grad program, the course needs to evaluate students on stuff they were supposed to have learned already elsewhere. That is, to a large extent my job is to assess the programming skills that they arrive with (or can acquire very quickly) while teaching them the content that is supposed to be new. Thus the entire 10-week class can be regarded as roughly a single time point for programming skill. The six programming assignments of gradually increasing complexity help pin down exactly where their skills lie. The same for writing skill on the written assignments—the class is not a writing class, but students are assessed on their writing skills and given feedback so that they can improve those skills. To a large extent, the programming and writing aspects of the class are feedback to let students know if they need to work on these skills on their own, with help, or in other classes.

The ostensible content of the course (models and algorithms for bioinformatics) are assessed with separate topics in each assignment. Only the prereq skills (programming and writing) are really repeated throughout the quarter. Some concepts (like the notion of a null model and E-values) are repeated, and I might be able to pull those out for SBG-style evaluation, but it gets difficult sometimes to figure out exactly why a student is failing on a more complicated assignment. SBG assumes that diagnostic testing with easily separable standards is easy, and I have not found this to be the case for the material I teach.

Comment by gasstationwithoutpumps — 2010 September 21 @ 08:51

10. […] presents Sustained performance and standards-based grading posted at Gas station without pumps, saying, “I’ve been thinking about implementing […]

Pingback by TEACHING|chemistry» Blog Archive » Standards Based Grading Gala #3 — 2010 October 21 @ 06:23

11. […] Teachers frequently bemoan the fact that students don’t seem to be interested in learning, but just in getting points. Teachers try to find ways to make grading schemes more meaningful, so that students will care more about learning. Currently fashionable is Standards-Based Grading, which is good for a reductionist analysis of topics, but not so strong on synthesis. SBG also has trouble measuring sustained performance. […]

Pingback by Experience Points for classes « Gas station without pumps — 2010 October 23 @ 22:51

This site uses Akismet to reduce spam. Learn how your comment data is processed.