Value-added teacher ratings

2010 August 16

Value-added teacher ratings

Filed under: Uncategorized — gasstationwithoutpumps @ 15:18
Tags: education, teaching, value-added

Yesterday the LA Times published a story about assessing all the LAUSD teachers by a value-added approach. They will be publishing a database of rankings for 6,000 elementary school teachers later this month, after giving teachers a brief period in which to comment on their rating.

Unfortunately, the article does not give the technical details of how the “value added” is calculated. There are hints that it is done by looking at the average difference between the achieved score for the students and the expected score for the same students based on prior achievement scores (perhaps using percentiles). It makes some difference what scale the scores are on and what the ceilings are for the tests.

For low-ceiling tests (as most state tests are), a teacher of gifted students who are already hitting the maximum scores on the tests will always be seen as having no value added, because the students can’t show their improvement.

Quite predictably, the teachers’ union is outraged over the LA Times publishing the data, going so far as to call for a boycott of the newspaper, though the Times appears to have acquired the data quite legally through Freedom of Information Act requests. The union is worried that teachers will be penalized for poor performance on the rating system, and that the seniority-based system on which teacher promotions have historically been made is in danger. They do have a good point that the tests used are a far from adequate measure of how much learning has taken place, but they are much better than having no measure of student performance, and much better than the cronyism of personal evaluations by the principal, which is the current system. Tests before and after an intervention (teaching the students for a year) are the standard way to determine whether an intervention is successful.

One reasonable critique of the method used is the assumption of causality: “Teacher A’s students on average advance more than Teacher B’s students” implies “Teacher A is a better teacher than Teacher B”. This is a reasonable assumption, but is not guaranteed to be true. There are a lot of reasons why a class may perform better or worse that are not related to the teacher. Still, the approach of averaging over at least three years and only looking at large differences should eliminate a lot of the one-time artifacts and minor statistical fluctuations. Systematic biases in the assignment of students (for example, if one teacher gets a lot of hard cases and other gets a lot of teacher-pleasers) can certainly distort the picture.

Teacher bloggers will soon be ranting and raving over this move by the LA Times (see, for example, Rational Mathematics). I expect most will question the validity of the tests, as that is the easiest target. Personally, I think that the non-random nature of the selection of students for each class is likely to be a bigger source of error. The response in Education Week‘s Teacher Beat is more measured, pointing out some of the other conclusions (which are well supported by other studies), such as that who the teacher is matters more than which school, and that paper qualifications have little correlation with effectiveness (measured in this value-added way).

I see one other danger, and that is that any ranking system will always put someone on top and someone on the bottom. If there are huge differences in teacher effectiveness (as there seem to be between the extremes), this is not a major problem, but the risk of amplifying small differences in effectiveness to large differences in rank (particularly in the middle of the pack) is high.

Does anyone have more detailed information about the methods used in the LA Times ranking system? I suppose the thing to do is to contact “Richard Buddin, a senior economist and education researcher at Rand Corp., who conducted the statistical analysis as an independent consultant for The Times.” The RAND staff page for him gives contact information for him, as well as his CV. His credentials certainly look good, and this is not the first time he has analyzed student performance to measure the effectiveness of teachers.

Comments (31)

31 Comments »

This statement,
“They do have a good point that the tests used are a far from adequate measure of how much learning has taken place, but they are much better than having no measure of student performance, and much better than the cronyism of personal evaluations by the principal, which is the current system.”
is an empirical question, and requires analyzing the data.

I’ve recently been immersing myself in this literature, though I’ve only read a few papers so far. One I thought was interesting was a paper presented at the annual meetings of the American Economic Association by Hanushek and Rivkin (2010). The confounds include the inability to fully correct for student differences (even within a classroom), the significant variation among teachers’ “Value Added Measurements” (VAMs) over different years, the relatively narrow range of teacher quality differences, and the huge range of student differences. Even the strongest a data-driven advocates of VAM measures fully accept that teacher differences only contribute to small differences in student success, compared to other student, environment, and family characteristics. The advocates still want to use VAM measures because it’s a potentially controllable characteristic about schooling, but we have to realize that intervention effects are small compared to other effects.

I am spending my time slogging through some of the papers (they are meaningful studies to look at). I have no particular ideological interest, but I do worry about the use of high stakes testing. In some other countries, high stakes testing is rigidly used to evaluate teachers. Even the best studies rely on low-stakes testing to examine teacher effectiveness differences. Using those tests in a high stakes way would provide strong incentives for teacher’s to game the test in a way that might substantially break the meaningfulness of relatively small teacher effectiveness measures.

I hadn’t realized the LA Times was going to make a rankable database available — even the strongest advocates don’t suggest that one can rank order teachers (especially across different schools) based on VAM scores. The precision of the measurements is simply not high enough, and, as you point out there are variety of ceiling effects and confounds (Schochet & Chiang, July 2010, IES NCEE report).

I would very much like to see you consider the VAM measures in more depth. I think it’s an interesting statistical problem (even ignoring the more conceptual problems like the observer effect of knowing how the tests will be used).

Comment by bj — 2010 August 16 @ 20:32 | Reply
Oh, there’s a lot of good information (skewed towards supporting VAM measures) at the Urban Institutes “CALDER” project. I haven’t heard anything specific about the methods being used LA, but there are good descriptions of the kinds of calculations being used to calculate VAM scores.

Comment by bj — 2010 August 16 @ 20:33 | Reply
I’m happy to send along some bibliographies I’m working from, if you are interested.

Comment by bj — 2010 August 16 @ 20:36 | Reply
Technical details for their study are at http://www.latimes.com/media/acrobat/2010-08/55538493.pdf. We get the dead-tree version, and page AA1 had an FAQ article that I didn’t find online in a quick look. The FAQ included the PDF link.

Comment by Yves — 2010 August 16 @ 21:17 | Reply
- Yves, thanks for that link! I’ve downloaded it and have read it.
  
  I was still wondering what scale was used for the student scores. Was this a raw points on the test or a standardized scale? The linearity assumption of Buddin’s model is an assumption that it is equally easy to
  move a score up by a fixed amount at any point in the scale, so the details of how that scale is computed is important. Is there any evidence that the raw data used satisfied that assumption? How was it checked?
  
  The simple linear model used assumes that students can move up or down, but for students near the floor or ceiling of the test instrument, this assumption is dubious. Since students are sometimes assigned to teachers based on the students being near one end or the other of the achievement spectrum, this would certainly distort the measurement of teacher effectiveness.
  
  Buddin looked for patterns relating the estimates of teacher effectiveness and prior student achievement, but only within the context of his linear model. Strong effects may not appear in his analysis, because the
  effects may be far from linear, but examining the top 5% and the bottom 5% of classes (based on prior student achievement) might be revealing.
  
  Buddin did some looking at “better prepared” students based on including the average “lagged” score as teacher-specific observable (that is the average score of the students in the year before they were in the teacher’s class), but that still assumes linearity and the effects in the middle may be quite different
  than the effects on the tails.
  
  Buddin did also look at the proportion of gifted students assigned to the teacher. Of course, given how random the “gifted” designation is in most school districts, it might have been better to use the “lagged” scores to select the top 10%, rather than the LAUSD gifted label.
  
  Comment by gasstationwithoutpumps — 2010 August 16 @ 21:28 | Reply
  - Buddin sent me e-mail explaining that the scale used was “z-scores by grade and by year. This is fairly standard practice in value-added analysis.” While I believe him that it is standard, I have my doubts about its reasonableness in a linear model. Since the z-score is a linear transform, it does not change the assumption that adding a point to the score is equally easy at all score levels.
    
    Buddin also said that the is not much evidence for ceiling or floor effects on the CST exams, which surprises me, because my son certainly hit the ceilings. He did give me a citation for a paper by Koedel and Betts:
    http://ideas.repec.org/p/umc/wpaper/0807.html
    which in turn points to the actual paper at
    
    Click to access wp0807_koedel.pdf
    
    The abstract says
    Value-added measures of teacher quality may be sensitive to the quantitative properties of the student tests upon which they are based. This paper focuses on the sensitivity of value-added to test-score-ceiling effects. Test-score ceilings are increasingly common in testing instruments across the country as education policy continues to emphasize proficiency-based reform. Encouragingly, we show that over a wide range of test-score-ceiling severity, teachers’ value-added estimates are only negligibly influenced by ceiling effects. However, as ceiling conditions approach those found in minimum-competency testing environments, value-added results are significantly altered. We suggest a simple statistical check for ceiling effects.
    
    On reading the paper, I find that it is a simulation study done by simulating ceiling effects on a data set from an exam that showed no ceiling effects. Their abstract summarizes their results well: once you get down to minimum-competency sets (like the state tests), the value-added measures become rather bogus. Their simulation also does not take into account the small number of all-gifted classes, which would not throw off the overall validity of the value-added measure, despite grossly mismeasuring the value-added of the teachers of those classes. They used Stanford-9 tests, whose scale is deliberately designed to be as linear as possible (with a 1 point gain meaning more or less the same thing across the entire score range). They observed that the San Diego Unified School District has only modest amounts of sorting students by prior achievement (that is, almost no gifted classes), so their results may not generalize to districts that do placement by achievement.
    
    Comment by gasstationwithoutpumps — 2010 August 17 @ 10:11 | Reply
I don’t know if I’ll have time to read “bibliographies” worth of material. One or two carefully selected articles would be nice, particularly review articles that analyze several studies (critical analysis, not just cherry-picking to support a pet theory).

I’ve seen a number of statements that gifted students gain the least each year, and always been bothered by the statements, because they reflect a limitation of the measurement technique. Gifted students may learn a lot in a year, but show no improvement on a low-ceiling standard achievement test. Rarely does this obvious flaw in the measurement method seem to bother the people repeating the claim. I worry that “value-added measurements” may have similar major flaws that are not being discussed adequately.

The LA Times claimed that their data showed the teacher effect to be bigger than most of the confounding variables, but they did not give us any data or analysis method for us to judge the validity of their claims. Journalists are not used to providing rigorous, data-supported arguments—their job is more to report what others have told them. It will be interesting to see whether the LA Times can pull off anything more than a political stunt with their analysis of the LAUSD data. Perhaps Buddin will publish a more detailed analysis for the scientific community (unless his contract with LA Times prohibits that). [The technical paper that Yes provided the link to is probably all we’re going to get—it is about as detailed as most scientific papers, though it does show some signs of not having been peer-reviewed.]

I agree that high stakes on a test can result in distortion of what is tested (I understand that there have been some big cheating scandals in Texas). On the other hand, huge numbers of low-stakes tests result in students not giving a damn about the tests, and inaccurate measurements for a different reason.

Comment by gasstationwithoutpumps — 2010 August 16 @ 21:26 | Reply
Here’s the FAQ article that contains the link to the PDF cited above: http://www.latimes.com/news/local/la-me-qanda-20100816,0,4120439.story

Comment by Yves — 2010 August 16 @ 21:31 | Reply
Saw another interesting post at http://schoolfinance101.wordpress.com/2010/07/28/rolldice/
which discusses the dangers of using value-added measurements for teacher firing, as well as expressing concern about the non-random placement of students (which strikes me as a a major source of error).

The non-random placement of students could, perhaps, be controlled for in the model by including a term for the improvement made in the preceding year by the student. Of course, this limits the applicability of the model even more, since you’d need 2 years of prior testing for the student, not just one.

Comment by gasstationwithoutpumps — 2010 August 16 @ 22:54 | Reply
I’ll update if I find the “one or two” carefully selected articles. I’ve found some that I think might qualify, but I need broader reading before I can know that these are indeed good critical reviews. I have an annotated bibliography being provided by our school district (which is hoping to use VAM measures in the next teacher’s contract), but that one is clearly “cherry-picked.” So, I’m working through those articles and the articles those articles cite to get the big picture.

My guess is that I’m going to find that there are methodological constraints that make it iffy to use these measures accept in the broadest terms (i.e. group teachers into big groups), and that used in those broad terms they aren’t going to provide useful information for decision making. In other words, the same problem that we end up having with many diagnostic differences, that even what appear to be relatively small error rates (both type 1 & 2) end up making the test useless for diagnostic purposes, even if it might be interesting for research purposes. The administrators are trying to develop VAM scores for diagnostic purposes and that’s a tough standard.

I agree that the “rolled dice” (linked as a similar post at your blog) seems like a good article. There are articles using the repeated years measures that seem informative: “The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy” from the CALDER group. The bottom line from that summary says, for example, that “if bonuses were allotted to teachers ranked in the top 20% based on VAM, at most 1/3rd would get bonuses in two years in a row.”

And, there’s a cute paper, the “Rothstein effect” that examines the predictability of 4th grade scores based on the VAM scores of the 4th graders 5th grade teachers (i.e. an effect that cannot be a measure of true value added by the teacher but instead occurs other non-measured correlations in students).

I’m having difficulty figuring out the peer review standard in the field — it’s different than in the sciences, where anonymous reviews + journal publication are the standard, and relying on that standard might mean ignoring meaningful articles. Some of these studies are eventually published in journals but they also seem to have something more similar to the open comment archives they have in physics (but mitigated by the political/policy implications of the work). So still finding my way.

Comment by bj — 2010 August 17 @ 09:32 | Reply
Seems like you could ask Buddin your linearity question — he may have answer justifying his linearity assumption throughout the testing scale.

The other questions you arise are more complicated, but the first should be something he’s considered.

Comment by bj — 2010 August 17 @ 09:50 | Reply
Since 1999 I’ve had a layman’s anthology of value-added material on my website at http://www.tagpdx.org/tvaas.htm

It’s a page on my website for local parents of talented and gifted students.

Comments are welcome

Margaret

Comment by Margaret DeLacy — 2010 August 17 @ 11:04 | Reply
- Thanks for the links to the TVAAS (Tennessee Value Added Assessment System) reports.
  
  I particularly liked http://www.sas.com/resources/asset/Response_to_Criticisms_of_SAS_EVAAS_11-13-09.pdf which addresses many of the concerns I have with value-added estimation.
  
  They claim that only one of the tests they have used (one that has been discontinued) had a ceiling problem. I don’t know if this is because the tests really have high ceilings, or because they had few classes in which many students were near the ceiling. If there is a bias against grouping by achievement in placement, then there may not be any clusters of gifted kids that would show the the ceiling effect clearly.
  
  Comment by gasstationwithoutpumps — 2010 August 17 @ 17:29 | Reply
By z-scores, does Buddin mean the score up to 600 that parents get? From my son’s data, it seems to me that CST scores have big step sizes, at least near the ceiling. For example, missing one question out of 75 on the 5th grade language arts exam gets you a score of 543 out of 600 (maybe depends on what question). Doesn’t seem like that step size could continue linearly.

Comment by Yves — 2010 August 17 @ 21:30 | Reply
- A Z-score is a linear scaling of another score (often the raw score) by subtracting off the mean and dividing by the standard deviation. The result is a score whose mean is zero and whose standard deviation is 1. Z-scores are often treated as if they were normally distributed, but this is only valid if the original data was normally distributed. Their use is discouraged in bioinformatics, because we rarely deal with normally distributed data (it is much more common for us to be dealing with stuff that follows a Gumbel distribution, which has a much fatter tail, with as about as much stuff above 6 standard deviations as a normal distribution has above 3 standard distributions).
  
  The problem with Z-scores for value-added computations is that they don’t change the shape of the underlying distribution, which may not be suitable for linear modeling. The TVAAS studies were done with a scale designed to be as linear as possible, which I don’t think is the case with CST scores that Buddin used. Rescaling them with Z-scores does nothing other than make sure that the score and the lagged score have the same mean and standard deviation, which aids slightly in interpreting the numbers, but does nothing to improve the power of the linear model.
  
  Comment by gasstationwithoutpumps — 2010 August 17 @ 22:45 | Reply
Our district is exploring value-added testing with the MAP test from NWEA, which is an adaptive test with material from various grade levels. Unfortunately (a) they’re not using the data very well, and (b) the MAP doesn’t appear to be as reliable as one would like (a great pity as it seemed a promising test design). One thing the use of the MAP has brought out very clearly is that teachers don’t want to be judged on out-of-level material, especially above-level material that they can’t be expected to teach. Some teachers were actually upset that some of their elementary-level students had gotten far enough to get questions about trig or sonnets. One teacher wrote “This test ensures that each child will feel like a failure as it takes them to the level that they do fail at to assess their ability.” (Long discussion at http://saveseattleschools.blogspot.com/2010/03/open-thread.html?showComment=1268283400702#c7122908847552940214)

Comment by Helen — 2010 August 18 @ 09:12 | Reply
- I actually like adaptive tests. The only ones my son has had are the Scholastic Reading Inventory tests, which put his reading level as high high-school level in 3rd grade. Grade-level tests are pretty useless for testing gifted students.
  
  The adaptive tests do give a much more accurate view of what students’ achievement levels are, but are not really suitable for measuring whether teachers have been effective at teaching what they are supposed to teach. As for the students feeling like a failure—that depends on what they are told about the test ahead of time. If the adaptive nature of the test is described (that it keeps trying to adjust the level until the student is getting about half right), the students will not feel like failures. If this is a major problem, the adaptation strategy could be modified to aim for a level where students get 80% right, though this would result in longer testing sessions for the same amount of information from the test.
  
  Comment by gasstationwithoutpumps — 2010 August 18 @ 10:34 | Reply
I have a paper for you now: http://www.brookings.edu/papers/2006/04education_gordon.aspx.
It details a plan to implement a VAM measure in teacher hiring. It uses data from LAUSD. A takehome point comes from the graphs in Figure 1 & 2, which suggest that percentile rank of average student (in a classroom, corrected for baseline scores, demographics, and program participation) is a much better predictor of the performance of next years’ classes than the credentialing of the teacher. Another, broader paper, on education reform in general is the following: http://www.brookings.edu/papers/2007/0214education_bendor.aspx

I’d be intrigued to see if you find the paper convincing.

I’ve skimmed the teacher evaluation paper, and my gut instinct is that there is circularity in the evaluation that they’re not resolving. In Figure 2, the assumption is that after the correction for known student characteristics, that any remaining differences in performance can be attributed to the teacher. I’ve been trying to figure out how to validate that assumption, and I think I’ve figured out an experiment I’d suggest: Let’s try keeping the students the same, and changing the teachers. So, divide all the classrooms into quartiles based on years 1, and then assign the teachers from the top quartile to teach the students who were in bottom quartile classes, and the teachers from the bottom quartile to teach the students in the top quartile (this would mean keeping the students constant over two years). It also presumes that teachers can be reassigned, but they could within a school (if not across schools).

Another point that jumps out at me is the dismissal of evidence that credentialling produces a 1% positive effect on student performance (because there’s big within group variation), but then support of firing the bottom quintile of teachers because this would create a net 1.2% positive effect on student performance). This kind of logic leads me to question whether these evidence-based practitioners are really practicing evidence-based decision making, as scientists usually do (or at least hope to).

As I slog through the literature, on VAM scores & on reform measures, I’m becoming concerned that this is a broader problem, an internal unwillingness to actually accept the evidence, when the evidence doesn’t come out right for the reasonable hypotheses they designed (for example, small schools, charters, or vouchers as three examples where the evidence has been mixed or negative). I find I can detect this problem internally within both of those papers (from a policy group at the Brookings Institute). In spite of arguing for evidence-based practices, the folks writing these papers seem to be surprisingly unwilling to accept the evidence to change their views when it comes in (instead, they’re waiting for “better” evidence that supports their hypotheses). I’m beginning to understand Dianne Ravitch — she seems to be an evidence based education policy maker who has actually let the evidence sway her opinions about the appropriate reforms to make. These systems are complex, and sometimes very reasonable hypotheses (vouchers, letting parents choose schools for their children) don’t play out the way they should (just like lowering blood sugar with medication doesn’t always prevent the adverse consequences of high blood sugar). When that happens, we have to give up our hypotheses and move on, and I find the two papers don’t indicate that these authors are sufficiently susceptible to the evidence of falsification.

(OK, long thoughts, but sometimes I just need to write them down. Helen, BTW, I am also in your school district, and these papers are the ones that Tim Burgess linked to in his Crosscut article).

Comment by bj — 2010 August 23 @ 10:53 | Reply
I read most of the Gordon paper but almost none of the Bendor paper. Both are policy papers with pointers to research papers, but no research in the policy papers themselves. Neither is a critical review of the literature, but just statements of what they believe should be done, based on the papers they choose to cite. Given that they have very specific policies they want to recommend, it is hard to believe that they did not cherry-pick the studies to cite.

I’m not all that interested in reading policy papers—I want to read scientific review papers, which look at most of the studies in the field and try to come up either with a consensus result (a meta-analysis) or point out the methodological differences that lead to conflicting results.

Comment by gasstationwithoutpumps — 2010 August 23 @ 11:50 | Reply
“I’m not all that interested in reading policy papers—I want to read scientific review papers, which look at most of the studies in the field and try to come up either with a consensus result (a meta-analysis) or point out the methodological differences that lead to conflicting results.”

Dan Goldhaber and Michael Hansen, Assessing the Potential of Using Value-Added Estimates of
Teacher Job Performance for Making Tenure Decisions, Calder working paper, revised February 2010 concludes that using Value-Added Metrics to assess teachers and determine tenure is a reasonable option because it is a better predictor than “observed” characteristics such as degree. Their analysis is based on asking what would happen if the bottom 25% of teachers were not retained. Their study is based on a large set of data from North Carolina. However, what they don’t explain is that the bigger the percentage of teachers who are fired, the smaller the effect would be, because as you go up the VAM ladder you get more and more effective teachers. So presumably, firing the bottom 1%, 5% or even 10% would have a bigger effect on the affected students and wreak less havoc on teacher morale.

I’d describe the result as interesting, but it still isn’t at the level of a meta-analysis. I’ve added it to my collection.

By the way, Helen, thanks for the link to the Seattle discussion of MAP. You deserve a medal for patience!

Margaret

Click to access CALDERWorkPaper_31-updated.pdf

Comment by Margaret DeLacy — 2010 August 23 @ 16:49 | Reply
Hello Gas Station, BJ, Margaret and all,

I think some congratulations are due all around here– to the amount of careful, logical thought represented here, and to me, who managed to read through and comprehend most of your comments at 6:30 AM EST without coffee. (I’m getting up to make some now.)

I am a 7th grade teacher in upstate New York with a mission: synthesizing scientific material for lay folk. I wondered if all of you could help me do this for VAM. (I blog at the website listed above– had the honor this year of being cited by the Washington Post for the blog, just to provide some evidence for gravitas.)

There’s an an incredible amount of important statistical detail in your comments, and I wonder if you wouldn’t mind responding (also in the comments) with any version of this information you would find appropriate for, say, an intelligent senior in high school. Dan Willingham’s video provides a nice baseline

http://seattletimes.nwsource.com/html/health/2002257729_healthlaughter01.html

but your comments move beyond his six criticisms, and I’d like to capture them as well.

Thanks for any assistance you can give.

Comment by Dina — 2010 August 26 @ 04:10 | Reply
- I can’t speak for the others, but I’m willing to help you and your students understand the statistical analysis, to the extent that we can follow it. We don’t have access to the original data or the analyses of it, just to the publicly released reports about those analyses.
  
  Your link to the article “Laugh yourself skinny” seems irrelevant.
  
  Did you mean
  http://voices.washingtonpost.com/answer-sheet/daniel-willingham/willingham-the-huge-problem-wi.html
  http://voices.washingtonpost.com/answer-sheet/daniel-willingham/willingham-3-key-factors-in-te.html
  
  Willingham correctly points out that an assumption of value-added analysis is that the students are assigned at random, but neglects to talk about how the use of the student’s former test scores controls for this effect reasonably well (except at the extremes of highly-gifted classes or special classes for students far below average intelligence).
  
  He does not analyze value-added studies, but just states flat-out that they are inadequate and that teachers can evaluate themselves better. He may be correct, but blatant assertion is not a particularly convincing argument.
  
  Comment by gasstationwithoutpumps — 2010 August 26 @ 09:48 | Reply
Whoops! Sorry, all. Posted an old link on my clipboard. As I said, reading and writing before coffee. Try this one.

Comment by Dina — 2010 August 26 @ 09:53 | Reply
- Thanks for the corrected link. I don’t generally watch videos for information, as the information transfer rate is usually so low (50-6o wpm, about 1/5th to 1/10th my reading speed) and noisy. Videos are good when there is motion that needs to be conveyed (like explaining how a molecular motor works), but for simple verbal communication, I finding writing a far superior medium.
  
  Has Willingham written anything contentful on value-added evaluation of teaching? The two Washington Post blogs/articles were not ones that lead me to think he had much to add.
  
  [Update: 27 Aug 2010. I watched the video. It had very little content, and could have been delivered in one paragraph of text.]
  
  Comment by gasstationwithoutpumps — 2010 August 26 @ 11:33 | Reply
“I am a 7th grade teacher in upstate New York with a mission: synthesizing scientific material for lay folk. I wondered if all of you could help me do this for VAM.”

Margaret replies: I’m impressed to learn that you are doing this AND teaching middle school–my daughter is a teacher and it is a more-than-full-time job.

I’m a mom, not a researcher (in this area) and NOT a statistician. I compiled some research but I don’t generate it and don’t pretend to have the answers, so my comments aren’t what you will need to engage or refute an expert.

There is a very clear explanation of Value-Added assessments –as opposed to status-based assessments–in a paper by real experts at the Northwest Evaluation Association. The NWEA developed the computer-adaptive MAP test that is used by many districts. I have no connection to them except that they are located nearby. I’m not sure of the reading level, but I’ve given it to politicians so it should be accessible for high-schoolers! They will find it more relevant to their own lives than many topics.

Individual Growth and School Success.
http://www.nwea.org/research/national.asp

I’ll come out of the closet and share some of my thoughts about this:

IMHO the biggest downsides to Value-Added measurement are:

(1) most tests are closely pegged to grade level resulting in anomalies at the ends of the score distribution (this is a mistake made by NCLB)
(2) many teachers in most districts don’t teach tested or testable subjects. Testing writing skill or social studies is much harder than testing math or reading, and testing art, music, geology, theater etc. is hopeless.
(3) test information is least informative (for a variety of reasons) where we need it most: in early elementary and in High School.
(4) proper use requires complex formulas, creating a “black box” that resists informed criticism and is hard for the public to understand
(5) everything hangs on the consistency of RIT intervals from level to level (and they aren’t)
(6) VAM is based on real performance, not on ideal performance. Sometimes that means ratifying an unacceptable status quo and
(7) anything can be ruined by school districts.

My biggest gripe is that VAM formulas usually presume that high-achieving students will make lower gains than other students, thus producing a self-fulfilling prophecy. Schools where these students make minuscule gains still get a pass when they shouldn’t–because other schools do an equally bad job. Appropriate programs for these students can produce much higher gains but such programs are so rare they don’t get factored into the formulas.

My own state managed to produce a complete and utter mess when it tried to incorporate progress-based data into its reporting system.

Despite all this, VAM, done well, can be a very useful component of a district’s accountability process for evaluating teachers, schools, curriculum, and programs as long as it uses 3-year moving averages, not one-year scores.

The LA Times shouldn’t publish teachers’ names–especially not if they are attached to just one year’s scores–it’s too much of a random walk and it may well discourage good teachers from going there. School scores are fair game. Schools are ranked by their average test scores (not VAM–just regular scores) all the time. As the Times pointed out, that is grossly inaccurate and unfair. VAM is at least less unfair–and as the Times showed, it can reveal some real gems in low-income neighborhoods.

I feel that standardized student assessment has a place in education. In the past, too many schools graduated illiterate and innumerate children and got away with it. Results do count and they last a lifetime: I don’t want a compassionate and lovable surgeon who amputates the wrong limb and I don’t want kind and lovable teachers who don’t actually teach children what they will need to know.

If the Times articles increase public understanding of assessment, on the whole that’s a good thing.

Margaret

Comment by Margaret DeLacy — 2010 August 26 @ 22:26 | Reply
- The Washington State experiments with value-added testing using the MAP tests, which use a uniform scale that spans grade levels, done with some of the best psychometric methods available. The computer-based testing there automatically adjusts the difficulty of the questions based on the student responses, so the problems of ceilings and floors on tests are not too serious, and the scores are based on the difficulty of the individual test questions, not a simple percent correct.
  
  The CST tests used for the LA Times studies are purely grade-level tests. According to the post-test guide, “CST scale scores for the same content area may not be compared across grades because CST scale scores are not vertically scaled, or scaled across grades.” This means that the CST scales are not quite the sort of linear scale assumed by the value-added linear model that was used. The difference in scale score from 2nd to 3rd grade does not mean the same thing as the difference in scale score from 3rd to 4th grade, so you can only compare “value-added” between teachers who are teaching the same grade level. Buddin did do this correctly, reporting that teachers were only compared within a grade level. The scales are attempting to be linear within a grade level, but the crude scoring mechanism used (translation from percent correct, with no adjustment for varying difficulty of the different items) limits the ability to make a reasonable scale.
  
  One interesting result of the LA Times study was that school-based effects were much smaller than teacher-based effects, so reporting how much value-added there is for a school is rather useless: it tells a parent almost nothing about how their child would do at the school. Reporting individual teacher scores is more informative, but is also much noisier data: the numbers may be way off for any particular teacher, even if the overall scoring is fairly predictive of future classes for the teachers.
  
  The LA Times study, like many such studies did not test the predictive value of the models. Each teacher was assigned a “value-added” score based on a few years of tests, but so far as I can see, no study was done here of how reliable the teacher score is. Other studies have done those tests, and found that the value assigned to the teachers have a rather low reliability (sorry, I’ve lost that citation).
  
  What comes up consistently in all studies is that the “teacher effect” is large (who the teacher is matters) and that the conventional measures of teacher qualification are nearly useless. The measurement of individual teachers is pretty noisy though, and I would be very reluctant to see hiring and firing decisions made based on such a noisy measure. (One-time bonuses are less problematic, as awards and prizes are always somewhat random, and a lottery for bonuses is not so terrible, as long as there is some correlation between who gets the bonuses and who deserves them.)
  
  Comment by gasstationwithoutpumps — 2010 August 27 @ 06:27 | Reply
One of the better critiques I’ve read of the LA Times study is at SchoolFinance101

Comment by gasstationwithoutpumps — 2010 August 29 @ 17:55 | Reply
I should have checked the link before hitting “send”. The Northwest Evaluation Association report “Individual Growth and School Success” has been moved to:

Click to access Individual_Growth_and_School_Success_Complete_Report_0.pdf

The short version is here:

Click to access Individual_Growth_and_School_Success_Exec_Summary_0.pdf

and other reports from the same group are here:
http://www.nwea.org/our-research/national-education-research

Margaret

Comment by Margaret DeLacy — 2010 August 31 @ 14:06 | Reply
[…] Value-added teacher ratings August 2010 28 comments, 414 views. This post, prompted by an LA Times article, was one of many in the blogosphere about the advantages and disadvantages of value-added teacher ratings. I was fortunate enough to get some thoughtful comments from viewers, which made for an interesting discussion without the knee-jerk responses of much of the blogging on the subject. […]

Pingback by 2010 in review « Gas station without pumps — 2011 January 2 @ 12:52 | Reply
[…] Value-added teacher ratings […]

Pingback by Blogoversary « Gas station without pumps — 2011 June 5 @ 10:51 | Reply
[…] Value-added teacher ratings […]

Pingback by Second Blogoversary « Gas station without pumps — 2012 June 2 @ 18:15 | Reply

RSS feed for comments on this post. TrackBack URI

Gas station without pumps

2010 August 16