How Teachers are Graded

Table 1. How teacher value-added has fared: a timeline
Can student test scores measure a teacher’s impact? Amid controversy, researchers assess teacher value-added models.

Teachers matter. Primary and secondary education is one of the most important human capital investments Americans will ever make, and teachers are crucial in the quality of that education. But how do we measure teacher effectiveness, and exactly how much do teachers affect their students’ long-term future?

One measure that evaluates teachers based on improving student achievement, the “value-added” model, has become prevalent in recent years. It is even mandated by a majority of states in teacher evaluations. Nevertheless, value-added models are controversial, and teachers’ organizations and other critics have challenged them in public debate and sometimes even in court. Interestingly, this pushback has coincided with a growing body of research suggesting that value-added models, when used carefully, are potentially valuable.

This policy brief provides an overview of recent research on value-added models, explaining how they are measured, their strengths and weaknesses, and recent work linking them to students’ career success. Some recent research suggests that having a good teacher for just one year can add as much as $7,000 to a student’s lifetime earnings.

I’ll also look at how value-added models have recently been applied and how they may best be used in the near future.

How value-added models work
Value-added models are designed to measure a teacher’s impact on student learning. Most often, standardized test scores are the measure of learning, because tests are objective and simple to administer and interpret. This is common enough that sometimes the term “value-added” is used synonymously with test score gains, although the value-added model can be applied to other student outcomes, too.

At their most basic, value-added models predict students’ test scores (or other outcomes) based on everything known about the students except their teacher—most important, their previous test scores. Demographic data and school and district information are also used, subject to availability. A student’s actual test scores are compared with predicted scores, and the difference between the two is the teacher’s value-added. In other words, value-added is intended to answer the question: Do the teacher’s students perform better or worse than expectations?

Are value-added models reliable?
There are a few natural questions about such a measure. Before discussing how to apply it, we first want to know that it works. Is it precise enough to measure teaching effectiveness? And is it biased against or in favor of certain types of teachers?

Let’s look at the precision question first. One way to study this is to measure how well teachers’ value-added scores in one year predict their value-added the next year. Many studies have examined this year-to-year relationship, and a typical estimate of the correlation between improved scores and teacher value-added is about .40. This is not considered very strong, and at first it may seem that value-added is too influenced by luck to be useful. But there are at least two reasons to persist. One is that most easily observable teacher characteristics, such as whether the teacher holds an advanced degree or the teacher’s experience, are even less helpful at predicting effectiveness (with the exception that most brand-new teachers are less effective in their first couple of years). The second is that luck starts to even out over time, and value-added models become more useful when averaging teacher scores over a period of three or more years.

A 2010 Brookings Institution report on teacher evaluation made this point by comparing the correlation in year-to-year value-added to measures in other contexts. My favorite example is that the level of correlation is about the same as that in year-to-year comparisons of Major League Baseball players’ batting averages. No baseball fan would make a definitive statement about a hitter’s merit based solely on his batting average in one year. At the same time, that fan wouldn’t ignore the player’s average, either. If the player consistently hits well for several seasons, we believe that his record reflects his real talent. In the same way, it would not be appropriate to say that Mrs. Smith is a better math teacher than Mr. Jones because she had a higher value-added score in one year. But if Mrs. Smith’s students do well for three or four years in a row, we can be more confident that she is doing something right. At least, we can be confident as long as the measure is not systematically biased.

Are value-added models biased?
A second question is whether our models are actually able to distinguish teaching from other inputs to learning. The reason value-added models include as much student and school information as possible is to try to separate all other effects from the teacher’s impact. But economists are always worried about accounting for possible bias. We know that rational actors make decisions based on all the relevant information they have, and we also know that sometimes, decision-makers have information that an outside researcher cannot see.

In the case of value-added models, one worry is that school administrators do not assign students randomly to teachers. Researchers can see some of this—for example, if Mrs. Smith teaches an advanced-track class, we would know that those students had higher previous test scores, and the model would account for that. But there is a lot we never see. Suppose the principal knows that Mr. Jones is a better disciplinarian than Mrs. Smith and makes sure their classes are mostly the same, but the biggest troublemaker in the grade always ends up in Mr. Jones’s classroom. Then Mr. Jones’s assignment is always tougher than it looks to an outside researcher, and Mrs. Smith’s is always easier. If we are not careful, we might end up concluding that Mr. Jones is a poor teacher, even though it is actually a positive attribute of his that makes the data look that way. There has been a lively academic debate over the past decade about how much bias in teacher value-added might be due to students being sorted in unobservable ways.

One encouraging piece of evidence comes from the Measures of Effective Teaching (MET) project, funded by the Bill & Melinda Gates Foundation. In the 2010–11 school year, hundreds of Grade 4–9 classes at six large school districts across the U.S. were randomly assigned to teachers participating in the MET study at their schools. The project was explicitly designed to get rid of any link between unobservable characteristics and which teachers that students were assigned to. This allowed researchers to compare teacher value-added estimates from 2010–11 with value-added measures for the same teachers in 2009–10 when there was no randomization. The researchers found year-to-year changes similar to findings in other studies that did not randomize. This suggests that any selection on unobservable characteristics is modest, and traditional value-added measures do not suffer from large biases.

The MET project also used classroom video observations and student surveys to supplement achievement gains from test scores in evaluating teacher effectiveness. This was partly to address a third common critique of value- added models, which is that they encourage teaching to the test. MET researchers found incorporating these alternate measures along with value-added based on test scores created a better overall measure. The combined score was more stable over the two years. It was also better at predicting student outcomes from measures other than state test scores, such as self-reported student effort and enjoyment of class.

As states and school districts increasingly use value-added models in teacher evaluations, it is especially important to think about using multiple measures in order to limit the incentive to manipulate test scores. Relying on test scores occasionally leads to cheating, but more commonly it leads to over-emphasis on test preparation at the possible expense of other classroom activities. This may indeed increase scores, but then value-added is no longer measuring teaching quality as intended. In economics, this concept is known as Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

Do high value-added teachers lead to good economic outcomes?
A key question, then, is whether high value-added teachers affect economic outcomes. If Mrs. Smith is really good at teaching to the test, she might get a high value-added rating, but she might not be helping her students in the long run. If she helps them understand mathematical principles or improves their study habits, however, it could help them in future classes and ultimately could make a difference in their lives. This is what both parents and policymakers (not to mention economists) ultimately care about.

The best evidence we have so far on this point comes from economists Raj Chetty and John Friedman of Harvard University and Jonah Rockoff of Columbia University. They were given access to (anonymous) student data that matched administrative school data in Grades 4–8 to federal tax records. In a pair of studies published in 2014, they found that students of high value-added teachers are more likely to attend college and have higher adult wages. Their results have been subjected to the same arguments of potential statistical bias discussed earlier in this brief, but the validity of their primary estimation strategy has since been defended by others who replicated the economists’ value-added results using different data.

Their most interesting estimate is that if a student has a teacher one standard deviation better than average in one year, that student’s earnings at age 28 are about $300 higher. (One standard deviation above average would put the teacher in the 84th percentile of performance.) Since the database includes over 600,000 observations, this change is very statistically significant. The authors, as well as many others, have since spent a lot of effort trying to characterize the size of that effect.

The first way they did this was to assume that the earnings gain at age 28 was characteristic of the gain at each point during the student’s career. An extra $300 represents an increase of about 1.4 percent over mean earnings of American 28-year-olds, so what would be the present value of 1.4 percent of expected earnings for an entire career? They calculated a benchmark that one year of exposure to a high-value teacher is worth about $7,000 in future wages for each student in her class. The returns for a class of 20–30 students might then be $140,000–$210,000, which is a massive public return on investment compared with a teacher’s salary or bonuses that might help retain excellent teachers.

A second benchmark compares this with other types of wage returns. Labor economists have estimated that an additional year of schooling is worth about a 9-percent-a-year increase in wages. That’s more than six times larger than the 1.4 percent effect of the better teacher. Looked at another way, it suggests that having a high- value teacher is like having a 16 percent longer school year. This is admittedly a back-of-the-envelope analysis, but it helps contextualize the research findings and provides us with a sanity check. It may sound strange to think of making more money because of a great sixth-grade teacher, but “how much more did I learn than usual when I had a great teacher?” is an easier question to think about.

Putting value-added models to work in schools
If value-added models are a useful measure of effective teaching and if they have long-term impacts, what next? A natural answer is that if we truly believe that value- added models are good and meaningful measures, we should try to incorporate them into teacher evaluations and decision-making. This has been the position of many politicians on both sides of the aisle. Spurred by the U.S. Department of Education, 30 states as of 2014 had policies or legislation that required teacher evaluations to make significant use of value-added models. To put it mildly, this has caused a great deal of controversy, with objections both from those who think that value-added models promote bad incentives for teachers and those who do not believe that value-added models produce good estimates.

One particular flashpoint came in 2010 when the Los Angeles Times published online the value-added scores for thousands of Los Angeles elementary school teachers. The newspaper was careful to say the scores “capture only one aspect of a teacher’s work,” and it included both the number of students the rating was based on and the level of uncertainty in the estimate. Still, with only one measure to focus on, the temptation to overstate its importance can be irresistible, and many teachers were understandably upset.

The publication of these numbers also called attention to their flaws. For example, many teachers in “departmentalized” programs were the teacher of record for multiple subjects but in fact taught only one subject in the department they supervised. Because of the limitations of the data, these teachers had value-added scores posted for subjects they did not actually teach. This possibility was actually noted in the Times’s FAQ and technical documentation (and it would be known to any principal who was using teacher value-added in evaluations), but it still looks bad and may have undermined the credibility of the analysis to casual observers.

This is one example of a larger backlash that has led to a recent pullback in public policies that assign importance to value-added models. In 2010, the Department of Education was very clear about its preference for value- added models, making value-added approaches a criterion in evaluating over $400 million worth of its Teacher Incentive Fund grants. In contrast, a 2015 fact sheet from the DOE admitted that over-testing is a problem and encouraged state discretion in how heavily to weight standardized test results in teacher evaluations. In recent years, teachers in several states have brought lawsuits against state policies that use value-added models in teacher evaluations, further casting doubt on the implementation of these policies.

As research advances, is value-added in decline?
This has created a strange situation where value-added- based policies are in retreat at the same time that evidence for the models’ potential usefulness is mounting. A review paper published in the Economics of Education Review last year characterized recent research as strikingly consistent in finding that students stand to gain from the use of value-added models in personnel policies. This growing body of work rejects one longtime concern with value- added models, that they are by their nature biased or too imprecise to be useful.

Researchers and critics may actually be moving in the same direction in addressing the other major concern, which is how to combat the distortion in test scores brought on by the use of value-added models in teacher evaluation. A balance must be found where the information in value-added models can be used without distorting teachers’ incentives enough that they prioritize test scores to students’ detriment. States are keeping value-added models in their teacher evaluation practices but lowering their weight or delaying their implementation. For example, the state of Ohio, which previously required 50 percent of a teacher’s evaluation to be based on student growth measures such as value-added, decided last August not to use value-added ratings from 2014–15 or 2015–16.

In the long run, value-added will still be used, but its weight will be reduced, and school districts will be able to allocate some of that weight to alternative measures such as student surveys or peer reviews. At the same time, studies like the MET project are helping us better understand how several different types of measures can be used together. These papers and policies both intend to provide a well-defined, transparent measure that better serves teachers and students.

Table 1. How teacher value-added has fared: a timeline