*Subject*: Re: "Value-added" testing schemes
*From*: Sherman Dorn
*Date*: Tue, 23 Feb 1999 15:02:17 -0500

Monty Neill wrote February 9, 1999:

> Please post references to this list -- I have not seen detailed critiques of

> the specific Tenn. plan. "Generic" critiques apply -- testing all kids, that

> the test is all multiple-choice and is norm-referenced (TN uses a CTB test, I

> think a version of CTBS but maybe the CAT) and all the consequences that flow

> from these points. But it would be good to have a detailed critique, if any

> have been done. Monty

And, two weeks later, I finally get around to responding. To remind

those who may have forgotten the lapsed conversation, William Sanders is

a statistician at the University of Tennessee-Knoxville, and for the

past seven years he has had a contract with the state of Tennessee

toconduct what he calls value-added analysis of annual state test

results (the scale scores from the norm-referenced part of the state

assessments). He uses what he describes as a mixed system of modeling

different "effects" of teacher, school, and system using longitudinal

test scores of students. I'll talk about the statistics a bit later,

but for now I'll note that he asserts that, by including only prior

scores, one can effectively dismiss concerns anybody might have that

test score gains are associated with poverty, ethnicity, sex, etc. --

all one needs to know is the "teacher effect" or "school effect" and

compare it with national norms. His evidence, as far as I can gather,

is ecological (e.g., he looks at school effects -- really, school

residuals -- and graphs it against proportion white), and while the

relationship is much more muted than with raw scores, at least with the

few graphs I've seen there still is an association. (As far as I can

gather, Sanders has not published results in a peer-review journal -- he

apparently has a chapter in a new Jason Millman-edited volume on

high-stakes testing, which I haven't yet read.)

As Harvey Goldstein, who has done considerable work on value-added

statistics in Britain, would say, this is no holy grail of

accountability. For many years, people have criticized the use of

scale-score test results for evaluating schools, and the argument that

we should look at gains is conceptually appealing -- so much so that

when Anthony Bryk was discussing his own version of value-added analysis

at last year's meeting of the American Educational Research Association,

three prominent policy folks -- Jennifer O'Day, Priscilla Wohlstetter,

and Susan Fuhrman -- all talked in glowing terms about the possibilities

of using his "productivity" analysis. Then Bryk returned to the podium

and explained how the data analysis seemed to "crumble" at every turn

and how tentative it was. (Afterwards, I spoke with one of the

statisticians on Bryk's staff, and he explained that they had thrown out

outliers and some other scores that were problematic for the

statistics. You can do that with research but not with official data.)

My main concern with value-added testing is not the concept of looking

at student growth -- that makes considerable sense in general. Rather,

I worry that, as in Tennessee, legislators will see it as a technocratic

miracle for accountability. In Tennessee, in particular, the details of

the statistics are obscured in the black box of "let us explain how to

interpret the results." And, having some statistical training (though

sociological and demographic, not psychometric) and a good foundation in

math, I know how much devil is in the detail.

One could point out much that's problematic with Tennessee's specific

system, and even a relatively uncritical review by R. Darrell Bock and

Richard Wolfe (A Review and Analysis of the Tennessee Value-Added

Assessment System, Nashville, Tennessee Comptroller of the Treasury,

1996) found several. Primary among them is that the confidence

intervals around estimates of school effects are too large to rank

schools with any precision. This is similar to the conclusion reached

by Harvey Goldstein, et al., in "A Multilevel Analysis of School

Examination Results," Oxford Review of Education 19 (1993), 425-433, and

is consistent with graphs I've found in a 1997 "Graphical Summary"

produced by the Tennessee value-added contract office (available on the

web at http://www.shearonforschools.com/summary/GRAPH-SUM.HTML) that

show that, with confidence intervals, one can only statistically

distinguish the very highest from the lowest ranking teachers. The

point bears repeating for anyone who wishes to rank schools based on

value-added analysis: you can't do sports-league type rankings that

have any statistical meaning.

A second problem with Sanders' activities is his devil-may-care approach

to the actual content of the test and whether a test measures anything.

He has created a private organization, Educational Value-Added

Assessment Services, which now offers his value-added assessment to

private schools (though Independent School Counsel in Atlanta). The

website I found (http://www.isc-erh.com/vaas.html) says that at the

heart of the data needed for value-added assessment is "At least three

years of annual Standardized Testing Report Data (Test version

indicated)," with the following footnote: "Any nationally normed

achievement test will suffice. Examples include California, Iowa,

Metropolitan, Stanford, and ERB." The relationship between the test and

the curriculum offered by the school is apparently irrelevant; all you

need for meaningful conclusions is a database of norm-referenced test

scores. Some of the criticisms Bock and Wolfe directed at the Tennessee

system revolved around the carelessness with which Sanders had treated

the different forms of the assessment. (I do not know if NCME ethics

guidelines on tests includes guidelines on the statistical analysis of

those tests.)

Then there are the "general" criticisms which Monty Neill mentions

above: any value-added analysis that seeks to track individual children

in each year using annual testing requires a substantial investment of

time (and money) in the actual testing and the use either of scaled

scores or norm-referenced test results. (Measurement folks here can say

whether one can construct a scale from criterion-reference test items.)

There is little research in general on the performance of children with

disabilities on these tests and nothing I am aware of on whether

value-added analysis of test results on children with disabilities have

any meaning. The most that Harvey Goldstein can say about the

accountability use of value-added assessment (which is called "school

effectiveness research" in England) is that it can point to some schools

or teachers to investigate further. As research, the statistics are

fascinating. As high-stakes testing, value-added analysis threatens to

put the statistics into a black box that no one can peer into.

Additional references:

Harvey Goldstein, "Methods in School Effectiveness Research," School

Effectiveness and School Improvement 8 (19970: 369-95.

Harvey Goldstein's web page at http://www.ioe.ac.uk/hgoldstn/ which

includes explanations of multi-level modeling and papers for downloading

--

Yours,

Sherman Dorn

University of South Florida

http://www.coedu.usf.edu/~dorn

