“While graduation decisions were not a consideration when the NECAP program was designed, the NECAP instruments are general achievement measures that are reliable at the student level”
First of all, it is interesting to speculate why such a letter would be sent at this particular time, well after setting the policy requiring the use of NECAP for graduation decisions. I speculate that the letter was requested to reassure a restive Board of Regents, but that is just my guess.
Still, if this is intended as reassurance from Measured Progress, it can only be read as tepid. First, the letter acknowledges that the NECAP was never designed to measure the learning of individual students. It was, instead, designed as a general achievement measure. Unspoken is the reality that, if the NECAP had been designed to measure the learning of individual students, it would have been designed much differently. But, that question, which drags in issues of test validity, was not asked and was not addressed.
There is not a word about test validity in the letter. That is, there is no claim that the test provides information that predicts “college and career” readiness any better than a large number of other contending measures: grades, recommendations, work or leadership experience, portfolios, senior projects, or socio-economic background.
Actually, test scores track socio-economic background so closely that it would be difficult to do a good job of distinguishing the two in a validity study.
So, there is no claim in the letter that the test is more useful than information that is already available. But there is the important claim that the test is reliable at the student level. And, after all, it is the reliability of the NECAP score that contributes so much to its attraction– that attraction being the simplicity of reducing a complex history of learning into two numbers–one for reading and one for math. After all, what could be more objective that a single number? Like the current balance of a bank account, this number tells us how much reading and math the student knows.
But the test score number is not like the current balance of a bank account, which is an exact number. Instead, it is an estimate of how much a student knows. Part of the test score is what the student really knows—the true score–and part of the test score is the mistakes the student makes—getting something wrong he/she really knows, or getting something right that he/she really does not know. These mistakes create error in the test score–the more error in the test score, the less reliable it is.
When testing companies like Measured Progress talk about reliability, they talk about the reliability of the test. They mean that, using different analytical techniques, they can tell how much measurement error the test contributes to the score of a student.
Using a camera as an analogy, this is like telling someone how much the lens distorts a picture. In photography, where the subject doesn’t contribute distortion to the picture, this is all you need to know. If, to pick a number, the test is reliable at the .85 level for students, that means that .15, or 15% of the test score is error.
The usual way to deal with the error is to turn it into an error band around the reliable portion of the score. Thus, when RIDE creates a cut-score for graduation, it puts an error band around it and takes the score at the bottom of the error band as the cut-score. Voila, fair and true cut scores!
But in testing, the person tested has long been acknowledged as a source of distortion, or variation, or measurement error (see Thorndike, 1951). Beyond the test itself, the person tested contributes random variation based on “health, motivation, mental efficiency, concentration, forgetfulness, carelessness, subjectivity or impulsiveness in response and luck in random guessing”.
If you ask teachers, parents, or anyone else who actually knows students, one of the first things they bring up is how differently students behave from day to day. They worry about whether a student will have a good day or a bad day when they take the NECAP. They assert as commonplace knowledge that the same student can get very different scores on the same test on different days. This kind of variation is called test-retest error.
Yet there is no reporting on this source of measurement error in the NECAP Technical Report. Partly, this is because getting test-retest reliability entails serious logistical problems—large numbers of students need to take parallel forms of a test in a relatively short period of time. It’s difficult and prohibitively expensive.
But recent improvements in techniques for analyzing tests (Boyd, Lankford & Loeb, 2012) have changed this and, all of a sudden, we can begin to understand the reliability of students when they take “general achievement measures”, i. e., standardized achievement tests.
To return to our camera analogy, in addition to understanding how much distortion the lens produces, we can now begin to understand how much distortion the object of observation causes. Now, instead of one layer of error, we have two layers of error and they impact each other as multipliers. If, for example, the lens is .85, or 85%, reliable, and the subject is also .85, or 85%, reliable, the total reliability is .85 X .85, or .72.
Reliability of .72 means that more than a quarter of the score (28%) is error. In other words, taking the student into account, the test is a lot less reliable than we thought it was when we only took the test into account. As the authors cited above report:
“we estimate the overall extent of test measurement error is at least twice as large as that reported by the test vendor…”
The test referred to by the authors– developed by CTB-McGraw Hill–is very similar to the NECAP.
All of this casts stronger doubt on the wisdom making the NECAP a graduation requirement. Not only is the NECAP flawed in the several ways discussed in this column before—it discourages students, victimizes the weaker students in the system, constricts curriculum, and degrades teaching and learning–but one of its chief virtues, its reliability, is seriously oversold.
Underestimating test reliability is bad for a student graduation requirement, but we should also consider the impact on the whole accountability structure: teacher assessments are based not on just one student test, but several, so increases in unreliability puts the evaluation system in doubt. Likewise, accountability associated with schools—the measures defining Priority Schools and, school progress and gap closing, to name a few. The whole house of cards is now exposed to a stiff breeze.