# Predictive effects of teachers and schools on test scores, college attendance, and earnings

See allHide authors and affiliations

Contributed by Gary E. Chamberlain, August 20, 2013 (sent for review April 18, 2013)

## Significance

This study measures the predictive effect of teachers on adult outcomes. The data are based on elementary and middle school classrooms, grades four through eight. For a classroom, there is an average score based on a math or reading test given near the end of the school year. There are also later outcome measures for that classroom. These measures include the fraction of the classroom that is attending college at age 20 and the average earnings of the classroom at age 28. The predictive effects are based on observing multiple classrooms with the same teacher.

## Abstract

I studied predictive effects of teachers and schools on test scores in fourth through eighth grade and outcomes later in life such as college attendance and earnings. For example, predict the fraction of a classroom attending college at age 20 given the test score for a different classroom in the same school with the same teacher and given the test score for a classroom in the same school with a different teacher. I would like to have predictive effects that condition on averages over many classrooms, with and without the same teacher. I set up a factor model that, under certain assumptions, makes this feasible. Administrative school district data in combination with tax data were used to calculate estimates and do inference.

The outcome data are based on elementary and middle school classrooms, grades four through eight. For a classroom, there is an average score based on a math or reading test given near the end of the school year. There are also later outcome measures for that classroom. These measures include the fraction of the classroom that is attending college at age 20 and the average earnings of the classroom at age 28. The classrooms can be grouped by schools, and, within a school, can be grouped by teacher.

The goal of the paper is to provide predictive effects of teachers and schools on these outcomes. For example, predict the fraction of a classroom attending college at age 20 given the test score for a different classroom in the same school with the same teacher and given the test score for a classroom in the same school with a different teacher. Or predict the fraction of a classroom attending college at age 20 given the fraction attending college for a different classroom with the same teacher and given the fraction attending college for a classroom in the same school with a different teacher. I would like to have predictive effects that condition on averages over many classrooms, with and without the same teacher. I set up a factor model that, under certain assumptions, makes this feasible. Then I can define teacher and school factors based on test score data and measure the predictive effect of the teacher factor on college attendance. More directly, I can define teacher and school factors based on the college attendance data and measure the predictive effect of the teacher factor on college attendance.

These predictive effects can be based on residuals, where first we form predictions based on observed variables (*X*) such as class size, years of teacher experience, lagged test scores, and parent characteristics. The residuals are the prediction errors. Then the teacher and school effects that I measure in these residuals correspond to unmeasured (latent) variables or, more precisely, to the parts of those latent variables that are not predictable using the observed variables in *X*. I am interested in these latent variables because they may be related to unmeasured characteristics of teachers that have a causal effect on outcomes, in the sense of unmeasured inputs in a production function. After setting up the factor model, I discuss how it could be related, under random assignment assumptions, to a production function.

Rivkin et al. (1) noted that students and parents refer often to differences in teacher quality and act to ensure placement in classes with specific teachers. Existing empirical evidence, however, does not find a strong role for measured characteristics of teachers—such as teacher experience, education, and test scores of teachers—in the determination of academic achievement of students. This lack of a strong role for measured characteristics motivates interest in unmeasured characteristics of teachers that have a causal effect on academic achievement. Related literature on estimating teacher effects on test scores includes refs. 2⇓⇓⇓⇓⇓⇓⇓–10. A typical finding is that a 1-SD increase in the teacher factor corresponds to an increase in individual scores on the order of 0.1, where the units are SDs in the distribution of scores for individual students.

In the Tennessee Student/Teacher Achievement Ratio experiment, known as Project STAR, children entering kindergarten were randomly assigned to class types, which were randomly assigned to teachers. The random assignment was within schools (e.g., ref. 11). It may be plausible to assume that the double random assignment of students and teachers applied also to specific classrooms (12, 13). Chetty et al. (13) were able to obtain data on later outcomes for these children, such as college attendance and earnings, which could be combined with the test score data in Project STAR. These data make it possible to study classroom effects (including teacher effects and peer effects) on later outcomes. The advantage of the random assignment is that prekindergarten characteristics of children are not correlated within a kindergarten class. In the STAR data, however, each kindergarten teacher is only observed teaching a single kindergarten class, making it difficult to separate out the part of the classroom effect due to the teacher. A strength of the data used in my paper is that teachers are observed in multiple classrooms. However, we do not have the random assignment, so there is a concern that within a classroom, there is correlation across the students in characteristics that existed before the class. A teacher effect may in part reflect sorting of students to teachers, with persistent differences across teachers in characteristics of the students entering their classes. A motivation for using residuals is that it is more plausible to make random assignment assumptions within a school when working with residuals. I recognize that the available control variables may not be adequate to justify “as if” random assignment within schools; for example, the parent characteristics do not include parents’ education. Nevertheless, it is useful to ask what would be identified under within-school random assignment, and that analysis provides some guidance in presenting and interpreting the predictive effects.

To anticipate my results, using the full set of controls in *X*, when the factors are constructed using test score data, the predictive effect on college attendance of a 1-SD increase in the teacher factor is 0.13 percentage points. When the factors are constructed using data on college attendance, the predictive effect of a 1-SD increase in the teacher factor is 0.79 percentage points. Under the assumption (for residuals) of random assignment of students and teachers within schools, the 0.79 estimate has a structural interpretation based on a production function, and the 0.13 estimate provides a lower bound.

## Methods

Let denote outcome *h* for classroom *j* in school *i*. Let denote a vector of predictor variables such as class size, years of teacher experience, and an average of test scores from a previous year for members of the classroom. We shall work with residuals of the form , where is defined to solve a prediction problem, which will be discussed below. Let denote the vector formed from the outcome residuals for classroom *j* in school *i*. Components of are the residuals based on outcomes such as classroom average test score (*ts*), the fraction of the classroom attending college at age 20 (*co*), and the average earnings of the classroom at age 28 (*ea*).

I treat the schools as if they were a random sample from some unknown distribution, so that the schools are exchangeable. I only use a school *i* if there is at least one pair of classrooms with the same teacher and at least one pair of classrooms with different teachers. Within school *i*, form the set of classrooms such that for each one there is at least one other with the same teacher. Assign equal probability to each of these classrooms, choose one at random, and denote it by *A*. Assign equal probability to each of the other classrooms that have the same teacher as *A*. Choose one at random and denote it by *B*. Assign equal probabilities to all of the classrooms that have teachers different from that of classroom *A*. Choose one at random and denote it by *C*. The prediction problems I consider fit into the following framework:where *g* is a given function. For examplewith equal to the residual corresponding to attending college at age 20 and equal to the residual corresponding to the test score. Then, *θ* gives the coefficients in the (weighted) minimum mean-square-error linear predictorAn alternative could use the absolute value of the error instead of the squared error in Eq. **2**, in which case *θ* would give the coefficients in the (weighted) minimum mean absolute error linear predictor. The nonnegative scalar *W*_{i} allows for a weight in forming the moments. unless school *i* has at least two classrooms with the same teacher and at least two classrooms with different teachers, so that the random vector is well defined. The nonzero values of *W*_{i} could, for example, be the number of classrooms in school *i* with teachers who have at least two classrooms.

My estimator for *θ* is a sample counterpart of the minimization problem in Eq. **1**. To make this explicit, let denote the positive integers, and let denote the set of classrooms in school *i*. For each classroom , there is a teacher, denoted by . We can partition into subsets with the same teacher: , where . Use iterated expectations to evaluate the expectation in Eq. **1** and simplify notation by dropping the *i* subscript:The outer expectation corresponds to our treatment of the schools as a random sample from some unknown distribution [so that is independent and identically distributed from some unknown distribution]. We shall evaluate explicitly the inner expectation, which is over classes within the same school, given outcomes for each of the classes. Conditional on :with .where |*s*| denotes the number of elements in the set *s*, so that |*s*_{t}| is the number of classes taught by teacher *t*. Only condition on values for *t* such that . Only condition on values for *s* such that there is at least one pair of classrooms with different teachers, so that .

Apply iterated expectations:

Now we can use these results to form our estimator. Let be defined to solve a prediction problem such asThe sample analog for Eq. **4** isproviding the estimated residuals . The sample analog for Eq. **1** is*Computation* shows how the computation simplifies in a special case, which includes Eqs. **2** and **3**. For inference, I shall use bootstrap methods, based on treating the schools as a random sample from some unknown distribution. This approach does not impose any structure on the covariances within a school.

Within a school, we can form a partition of the classrooms, , for example by subject and grade. We can apply our analysis separately within each cell of the partition. It may be useful to have a compact summary of the results. One way to do this is to define for each cell of the partition. Assign a nonnegative weight to cell *l* in school *i*, which is zero unless contains at least one pair of classrooms with the same teacher and one pair of classrooms with different teachers. For the nonzero values, we could useOnly use a school *i* if . If , form the set of classrooms in such that for each one there is at least one other with the same teacher. Assign equal probability to each of these classrooms, choose one at random, and denote it by . Assign equal probability to each of the other classrooms in that have the same teacher as . Choose one at random and denote it by . Assign equal probabilities to all of the classrooms in that have teachers different from that of classroom . Choose one at random and denote it by . [ is undefined if ]. The new prediction problem is

### Factor Model.

These predictive effects condition on a single score for a different classroom with the same teacher and a single score for a classroom with a different teacher. A factor model can provide predictive effects that condition on averages over many classrooms, with and without the same teacher, and can provide a limit as the number of such classrooms tends to infinity. This factor model has the advantage of getting rid of the noise that comes from using data on only a few classrooms. Let denote unmeasured characteristics of the teacher for classroom *A* in school *i*, and let denote unmeasured characteristics of the school. Definewhere is a given function. For example, could select a component of and raise it to the power *n*: . Assume thatThis assumption follows from the assumption of exchangeability across schools and the random selection of classrooms *A*, *B*, and *C*.

Assume that and are independent conditional on the latent variables and . A motivation for this assumption is that, without conditioning on additional information, we can regard the random variables corresponding to different classrooms for the same teacher (within a school) as exchangeable. If they could be embedded in an infinite sequence of exchangeable random variables, then conditional independence would follow from de Finetti’s theorem (14). A richer analysis could exploit additional information, such as the temporal ordering of the classrooms for a given teacher, where patterns of serial correlation could emerge. I shall not pursue that here. The conditional independence implies thatLikewise, assume that and are independent conditional on , which implies thatNote thatso that , which implies thatTherefore, we can obtain the moments and from

Let *M* be a subset of . Note thatTherefore, the slope coefficients in the linear predictor can be obtained from and for .

### Production Function.

There are connections between the factor model and a production function, under random assignment assumptions. To be specific, consider the college attendance outcome , and let *g* denote the production functionThe inputs and are, as above, unmeasured characteristics of the teacher and the school for classroom *A* in school *i*. There is an additional input, , which corresponds to unmeasured characteristics of the students in classroom *A*. Simplify notation by writing the function asLet denote the domain of the input arguments. We shall condition on and consider counterfactual outcomes as varies over . At any such point, is a random variable withStill conditioning on , consider counterfactual outcomes as varies over , averaging over the conditional distribution of given :There is a structural function interpretation for : within a school with , we can obtain potential expected output for various assigned values of the teacher input , holding constant the distribution of classroom characteristics (at the conditional distribution of given ).

If, within schools, students and teachers are randomly assigned to classrooms (as in Project STAR), then is independent of conditional on . In that caseIn addition, random assignment within schools implies that and are independent conditional on , so thatAs above, define the factor . We have shown that if students and teachers are randomly assigned to classrooms within schools, thenproviding a connection between this factor and the production function. A motivation for the choice of *X* in forming residuals *U* is to make this random assignment assumption more plausible.

Because the random assignment assumption is within schools, I am interested in the variation in expected output that corresponds to the variation in the teacher input within a school. A convenient summary measure isAs above, define the factor . ThenandLikewise, a convenient summary measure for cross-school variation isWith random assignment only within schools, does not have a structural interpretation.

Now suppose that data on later outcomes are not (yet) available for a teacher, but data on test scores for multiple classrooms with that teacher are available. How can we connect to the test score data? Define the factorswhere is a given function, such as . Then the linear predictor of given these factors equals the linear predictor of :This result provides a connection between the production function for and a linear predictor based on factors derived from test scores.

This linear predictor is flexible in that we can choose a variety of functions in defining the factors. This flexibility suggests finding a lower bound on the mean square error for linear prediction of from factors based on test scores. For notation, useDefineThenThis result implies the following lower bound:The second minimization is over (square-integrable) functions and . Under suitable assumptions, we can construct a sequence of functions so that as .

Note thatwhich implies that[because and are uncorrelated with for and ]. Therefore,In the empirical work, I shall focus on and onwhich has a structural role in providing a lower bound for . A convenient summary measure based on cross-school variation isWith random assignment only within schools, does not have a structural interpretation.

## Empirical Results

The work of Chetty et al. (15) is the first to measure teacher effects on later outcomes such as college attendance and earnings. They combine two databases: administrative school district records and information on those students and their parents from US tax records. The school records are for a large, urban school district, covering the school years 1988–1989 through 2008–2009 and grades 3–8. Test scores are available for English language arts and for math from spring 1989 to spring 2009. The scores are normalized within the year and grade to have a mean of 0 and SD of 1. The student records are linked to classrooms and teachers. Individual earnings data are obtained from W-2 forms, which are available from 1999 to 2010. College attendance is based on 1098-T forms, which colleges and other postsecondary institutions are required to file for reporting tuition payments and scholarships for every student.

Chetty et al. conducted most of their analysis of long-term impacts using a dataset collapsed to class means. This dataset with class means was used to obtain the results below. is the average test score for the class. is the percent of the classroom attending college at age 20, and is the average earnings of the classroom at age 28, expressed in 2010 dollars.

I shall use (weighted) minimum mean square error linear predictors, as in Eqs. **2** and **3**. The partition in Eq. **1′** is by subject (math and reading) and grade (4–8), giving cells, with weights as in Eq. **5**. In the lower grades, students may have the same teacher for math and reading, so putting math and reading classes in separate cells helps to ensure that different classes do not have students in common. Likewise, different classes could have students in common because, for example, there is overlap between a fourth grade class in one year and a fifth grade class in the following year. We avoid this overlap by only making comparisons for classrooms within the same subject and grade.

There are 118,439 classrooms in 917 schools. Of these schools, 866 satisfy the condition that . Consider the linear predictor for college attendance in Eq. **3**:If includes only a constant , then the estimates (with SEs in parentheses) areDropping the quadratic terms, the coefficients (SEs) are 13.86 (0.38) on and 7.97 (0.31) on . I shall rely on the factor model (below) for my discussion of the magnitudes of predictive effects.

The coefficients *θ* are defined as solutions to the minimization problem in Eq. **1′**. The minimized value of the objective function provides a population value for mean square error. Likewise, there is a mean square error using just a constant to form the linear predictor . Let denote the ratio of these mean square errors, so that gives the proportional reduction in mean square error due to including a quadratic in and a quadratic in in the linear predictor for . The estimate (with SE) is .

Now let be the baseline control vector used by Chetty et al. It was chosen following previous work, in particular that of Kane and Staiger (6). It includes the following classroom-level variables: school year and grade indicators, class-type indicators (honors, remedial), class size, indicators for teacher experience, and cubics in class and school-grade means of lagged test scores in math and English each interacted with grade. It also includes class and school-year means of the following student characteristics: ethnicity, sex, age, lagged suspensions, lagged absences, and indicators for grade repetition, special education, and limited English. This baseline control vector giveswith . Dropping the quadratic terms, the coefficients (SEs) are 1.26 (0.21) on and 0.86 (0.16) on .

The controls matter a lot. This sensitivity relates to the difficulty in attaching causal interpretations to these results. This point has been emphasized in Rothstein (16). The issue has been addressed in Kane and Staiger (6), using a dataset with random assignment of teachers to classrooms, and in Chetty et al. (15), who look at effects based on changes in teaching staff.

These results condition on a single score for a different classroom with the same teacher and a single score for a classroom with a different teacher. I would like to have predictive effects that condition on averages over many classrooms, with and without the same teacher, and consider a limit as the number of such classrooms tends to infinity. This goal is feasible under the assumptions of the factor model. For notation, let

where denotes characteristics of the teacher of classroom *A*, and denotes characteristics of the school of classroom *A*. As in the production function discussion, I construct an index corresponding to variation in teacher inputs within a schooland use it to obtain a predictive effect in SD units:Likewise, I construct an index corresponding to variation across schoolsand use it to obtain a predictive effect in SD units:

With the baseline controls in *X*, the factor model estimates givewith predictive effectsand . An SD increase in the teacher factor based on the test score index implies a predicted increase in college attendance for each student in class *A* of 0.16 percentage points. If *X* includes only a constant, then this estimate increases from 0.16 percentage points to 5.81 percentage points.

Thus far, we used a (quadratic) function of the test score in predicting college attendance. We can also use college attendance for other classes, and the factor model provides a way to condition on averages over many classrooms, with and without the same teacher. For notation, letThen corresponds to an average of over many classrooms other than *A* that share a teacher with *A*, and corresponds to an average of over many classrooms that do not share a teacher with *A* but are in the same school. The optimal linear predictor for college attendance isWith the production function interpretationThe predictive effects in SD units are

With the baseline controls in *X*, the factor model estimates imply the predictive effectsand . An SD increase in corresponds to an increase of 0.99 percentage points for college attendance of class *A*. It is clear that basing the predictions for college attendance just on the test scores loses a great deal of information.

In parallel with the optimal linear predictor of college attendance, the optimal linear predictor for the test score isThe predictive effects areWith the baseline controls in *X*, the estimates areand . An SD increase in corresponds to a predicted increase in score for each student in class *A* of 0.087, where the score units are SDs in the distribution of scores for individual students.

Now consider using the quadratic specification in Eq. **3** to obtain a linear predictor for , the residuals corresponding to earnings at age 28:With the baseline controls in *X*, the estimates arewith . Dropping the quadratic terms, the coefficients (SEs) are 688 (269) on and 308 (176) on . These results are based on fewer classrooms, 14,236 instead of 118,439, because only some of the students reached the age of 28 by 2010. There are 524 schools, of which 364 satisfy the condition that .

For notation in the factor model, letwhere the factors are based on the test score, as in Eq. **7**. With the baseline controls in *X*, the factor model estimates givewith predictive effectsAn SD increase in the teacher factor based on the test score index implies a predicted increase in earnings of $186. This estimate, however, lacks precision. This lack of precision becomes more serious when I try to define a teacher factor based directly on the earnings data, and I shall not pursue that here.

Chetty et al. linked students to their parents by finding the earliest 1040 form from 1996 to 2010 on which the student was claimed as a dependent. They constructed an index of parent characteristics by using fitted values from a regression of test scores on mother’s age at child’s birth, indicators for parent’s 401(k) contributions and home ownership, and an indicator for the parent’s marital status interacted with a quartic in parent’s household income. A second index is constructed in the same way, using fitted values from a regression of college attendance on parent characteristics. Repeating the analysis above with these two measures of parent characteristics added to the baseline control vector gives the following predictive effects for college attendance based on test scoreswhich are somewhat lower than the results above using the baseline controls. The predictive effects for earnings areCompared with the results using the baseline controls, the teacher effect of $196 is about the same (before: $186), but the school effect of $282 is substantially lower (before: $400).

With the parent characteristics added to the baseline control vector, the predictive effects for college attendance based on the college attendance of other classes areand . The teacher effect is reduced from 0.99 to 0.79 percentage points. There are substantial reductions in the school effect and in . The predictive effects for test scores based on the test scores of other classes areand . Here the results are not affected by adding the parent characteristics.

I have repeated the analysis without using the quadratic terms, so that the linear predictor for conditions on and , dropping and . With the baseline controls in *X*, this givesTherefore, the teacher effect is still 0.16 percentage points. (The school effect is lower: 0.74 vs. 1.19 percentage points.)

Now consider a partition in Eq. **1′** just by subject (math and reading) instead of by subject and grade. There are cells with weights as in Eq. **5**. With the baseline controls in *X* and without using the quadratic terms, this givesThis partition gives a substantially higher teacher effect: 0.30 vs. 0.16 percentage points (and a lower school effect). I prefer the estimates that partition on subject and grade.

Finally, consider predictive effects in the factor model that do not partial on the school. Therefore, in predicting college attendanceWith the baseline controls in *X*, without the quadratic terms, with the partition on subject and grade, this givesThe predictive effect on college attendance of 0.51 percentage points is considerably larger than the effect based on within school variation: percentage points. I prefer the estimate of 0.16 percentage points.

## Conclusion

With the baseline controls, using the factor model, an SD increase in the teacher factor based on test scores has a predictive effect on college attendance of 0.16 percentage points. With parent characteristics added to the baseline controls, the predictive effect is 0.13 percentage points. These estimates are lower bounds on the predictive effect of an SD increase in the teacher factor (*G*_{co}) based directly on college attendance. With the baseline controls, the predictive effect for *G*_{co} on college attendance is 0.99 percentage points. The *R*^{2} estimate is 0.13, whereas basing the predictions just on test scores gives an *R*^{2} estimate of 0.01. The teacher effect of 0.99 percentage points could reflect skills that are relevant for college attendance but are not measured by the test scores. These skills could be some combination of skills students bring to the class (not captured in *X*) and skills developed during the class, in part due to the contribution of the teacher. With the parent characteristics added to the baseline controls, the predictive effect is 0.79 percentage points.

The factor model provides a predictive effect for individual test scores of a 1-SD increase in the teacher factor (*G*_{1}) based directly on test scores. This effect is 0.087, where the units are SDs in the distribution of scores for individual students. This result is not affected by adding the parent characteristics to the baseline controls. The result is consistent with the related literature (1⇓⇓⇓⇓⇓⇓⇓⇓–10), where a typical finding is that a 1-SD increase in the teacher factor corresponds to an increase in individual scores on the order of 0.1 SDs (in the distribution of scores for individual students).

## Computation

Suppose that Eq.**1** has the following form:where and are given functions. For example, and is a quadratic polynomial. Then *θ* satisfies the linear equationNow suppose that the components of have the formThis form holds if is a polynomial. In this case, the expectations in Eq. **9** require evaluating terms of the formwhere , , , and the *q*s are given functions. The sample analog for a term of this form is[with, for example, ]. The triple sum over can be simplified as

## Acknowledgments

I thank Alberto Abadie, Moshe Buchinsky, Raj Chetty, John Friedman, Bryan Graham, Guido Imbens, Maximilian Kasy, Michael Rothschild, and Jesse Rothstein for comments and discussions. I thank Raj Chetty for assistance in implementing the methods developed here to estimate teacher effects on the administrative data constructed at the Internal Revenue Service.

## Footnotes

- ↵
^{1}E-mail: gary_chamberlain{at}harvard.edu.

This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected in 2011.

Author contributions: G.E.C. wrote the paper.

The author declares no conflict of interest.

## References

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵Kane T, Staiger D (2008) Estimating teacher impacts on student achievement: An experimental evaluation.
*NBER Working Paper*(National Bureau of Economic Research, Cambridge, MA). - ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Billingsley P

- ↵Chetty R, Friedman J, Rockoff J (2011) The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood.
*NBER Working Paper*(National Bureau of Economic Research, Cambridge, MA). - ↵

## Citation Manager Formats

## Article Classifications

- Social Sciences
- Economic Sciences