[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: "Value-added" testing schemes

Monty Neill wrote February 9, 1999:

> Please post references to this list -- I have not seen detailed critiques of
> the specific Tenn. plan. "Generic" critiques apply -- testing all kids, that
> the test is all multiple-choice and is norm-referenced (TN uses a CTB test, I
> think a version of CTBS but maybe the CAT) and all the consequences that flow
> from these points. But it would be good to have a detailed critique, if any
> have been done. Monty

And, two weeks later, I finally get around to responding. To remind
those who may have forgotten the lapsed conversation, William Sanders is
a statistician at the University of Tennessee-Knoxville, and for the
past seven years he has had a contract with the state of Tennessee
toconduct what he calls value-added analysis of annual state test
results (the scale scores from the norm-referenced part of the state
assessments). He uses what he describes as a mixed system of modeling
different "effects" of teacher, school, and system using longitudinal
test scores of students. I'll talk about the statistics a bit later,
but for now I'll note that he asserts that, by including only prior
scores, one can effectively dismiss concerns anybody might have that
test score gains are associated with poverty, ethnicity, sex, etc. --
all one needs to know is the "teacher effect" or "school effect" and
compare it with national norms. His evidence, as far as I can gather,
is ecological (e.g., he looks at school effects -- really, school
residuals -- and graphs it against proportion white), and while the
relationship is much more muted than with raw scores, at least with the
few graphs I've seen there still is an association. (As far as I can
gather, Sanders has not published results in a peer-review journal -- he
apparently has a chapter in a new Jason Millman-edited volume on
high-stakes testing, which I haven't yet read.)

As Harvey Goldstein, who has done considerable work on value-added
statistics in Britain, would say, this is no holy grail of
accountability. For many years, people have criticized the use of
scale-score test results for evaluating schools, and the argument that
we should look at gains is conceptually appealing -- so much so that
when Anthony Bryk was discussing his own version of value-added analysis
at last year's meeting of the American Educational Research Association,
three prominent policy folks -- Jennifer O'Day, Priscilla Wohlstetter,
and Susan Fuhrman -- all talked in glowing terms about the possibilities
of using his "productivity" analysis. Then Bryk returned to the podium
and explained how the data analysis seemed to "crumble" at every turn
and how tentative it was. (Afterwards, I spoke with one of the
statisticians on Bryk's staff, and he explained that they had thrown out
outliers and some other scores that were problematic for the
statistics. You can do that with research but not with official data.)
My main concern with value-added testing is not the concept of looking
at student growth -- that makes considerable sense in general. Rather,
I worry that, as in Tennessee, legislators will see it as a technocratic
miracle for accountability. In Tennessee, in particular, the details of
the statistics are obscured in the black box of "let us explain how to
interpret the results." And, having some statistical training (though
sociological and demographic, not psychometric) and a good foundation in
math, I know how much devil is in the detail.

One could point out much that's problematic with Tennessee's specific
system, and even a relatively uncritical review by R. Darrell Bock and
Richard Wolfe (A Review and Analysis of the Tennessee Value-Added
Assessment System, Nashville, Tennessee Comptroller of the Treasury,
1996) found several. Primary among them is that the confidence
intervals around estimates of school effects are too large to rank
schools with any precision. This is similar to the conclusion reached
by Harvey Goldstein, et al., in "A Multilevel Analysis of School
Examination Results," Oxford Review of Education 19 (1993), 425-433, and
is consistent with graphs I've found in a 1997 "Graphical Summary"
produced by the Tennessee value-added contract office (available on the
web at http://www.shearonforschools.com/summary/GRAPH-SUM.HTML) that
show that, with confidence intervals, one can only statistically
distinguish the very highest from the lowest ranking teachers. The
point bears repeating for anyone who wishes to rank schools based on
value-added analysis: you can't do sports-league type rankings that
have any statistical meaning.

A second problem with Sanders' activities is his devil-may-care approach
to the actual content of the test and whether a test measures anything.
He has created a private organization, Educational Value-Added
Assessment Services, which now offers his value-added assessment to
private schools (though Independent School Counsel in Atlanta). The
website I found (http://www.isc-erh.com/vaas.html) says that at the
heart of the data needed for value-added assessment is "At least three
years of annual Standardized Testing Report Data (Test version
indicated)," with the following footnote: "Any nationally normed
achievement test will suffice. Examples include California, Iowa,
Metropolitan, Stanford, and ERB." The relationship between the test and
the curriculum offered by the school is apparently irrelevant; all you
need for meaningful conclusions is a database of norm-referenced test
scores. Some of the criticisms Bock and Wolfe directed at the Tennessee
system revolved around the carelessness with which Sanders had treated
the different forms of the assessment. (I do not know if NCME ethics
guidelines on tests includes guidelines on the statistical analysis of
those tests.)

Then there are the "general" criticisms which Monty Neill mentions
above: any value-added analysis that seeks to track individual children
in each year using annual testing requires a substantial investment of
time (and money) in the actual testing and the use either of scaled
scores or norm-referenced test results. (Measurement folks here can say
whether one can construct a scale from criterion-reference test items.)
There is little research in general on the performance of children with
disabilities on these tests and nothing I am aware of on whether
value-added analysis of test results on children with disabilities have
any meaning. The most that Harvey Goldstein can say about the
accountability use of value-added assessment (which is called "school
effectiveness research" in England) is that it can point to some schools
or teachers to investigate further. As research, the statistics are
fascinating. As high-stakes testing, value-added analysis threatens to
put the statistics into a black box that no one can peer into.

Additional references:

Harvey Goldstein, "Methods in School Effectiveness Research," School
Effectiveness and School Improvement 8 (19970: 369-95.

Harvey Goldstein's web page at http://www.ioe.ac.uk/hgoldstn/ which
includes explanations of multi-level modeling and papers for downloading


Sherman Dorn
University of South Florida

To unsubscribe from the ARN-L list, send command SIGNOFF ARN-L