National AssessmentAnnex 1
AN OVERVIEW
OF THE
EVIDENCE
1. Measurement error and the problems with
overlaying levels onto marks
This does not refer to human error or mistakes
in the administration of tests but to the issue of intrinsic measurement
error. Contemporary standards in the US lead to the expectation
that error estimates are printed alongside individual results:
such as " . . . this person has 3,592 (on a scale going to
5,000 marks) and the error on this test occasion means that their
true score lies between 3602 and 3582 . . ..". Is this too
difficult for people to handle (ie interpret)? In the current
climate of increasing statistical literacy in schools, it should
not be. Indeed, results could be presented in many innovative
ways which better convey where the "true score" of someone
lies.
Error data are not provided for national tests
in England, and both the Statistics Commission and commentators
(eg Wiliam, Newton, Oates, Tymms) have raised questions as to
why this international best practice is not adopted.
Of course, error can be reduced by changing
the assessment processeswhich most often results in a dramatic
increase in costs. Note "reduce" not "remove"the
latter is unfeasible in mass systems. For example, double marking
might be adopted and would increase the technical robustness of
the assessments. However, this is impractical in respect of timeframes;
it is already difficult to maintain existing numbers of markers,
etc. Error can be reduced by increased expenditure but is escalating
cost appropriate in the current public sector policy climate?
One key point to bear in mind is that one must
avoid a situation where the error is significantly less than the
performance gains which one is expecting from the system, and
from schoolsand indeed from teachers within the schools.
Unfortunately, 1-2% improvement lies within the bounds of errorget
the level thresholds wrong by two marks either way (and see the
section on KS3 Science below) and the results of 16,000 pupils
(ie just over 2%). could be moved.
Measurement error becomes highly significant
when national curriculum levels (or any other grade scale) is
overlaid onto the scores. If the error is as above, but a cut
score for a crucial level is 48 (out of 120 total available marks)
then getting 47 (error range 45-49) would not qualify that person
for the higher level, even though the error means that their true
score could easily be above the level threshold. In some cases
the tests are not long enough to provide information to justify
choosing cut-scores between adjacent marks even though the difference
between adjacent marks can have a significant effect on the percentages
of the cohort achieving particular levels. There are problems
with misclassification of levels applied. Wiliam reports that
"it is likely that the proportion of students awarded a level
higher or lower than they should be because of the unreliability
of the tests is at least 30% at Key Stage 2 and may be as high
as 40% at Key Stage 3".
Criterion referencing fails to work well since
question difficulty is not solely determined by curriculum content.
It can also be affected by "process difficulty" and/or
"question or stimulus difficulty", (Pollitt et al).
It is also difficult to allocate curriculum to levels since questions
testing the same content can cover a wide range of difficulty.
It is believed that error could be communicated
meaningfully to schools, children, parents and the press, and
would enhance both intelligence to ministers and the educational
use of the data from national tests.
The current practice of overlaying levels onto
the scores brings serious problems and it is clear that the use
of levels should be reviewed. One key issue: consider the following:

Both Child B and Child C are Level 5. But in
fact Child A and B are closer in performance, despite A being
level 4 and B being Level 5. Further, if Child A progresses to
the position of Child B over a period of learning, they have increased
by one level. However, if Child B progresses to the same position
as Child C, they have progressed further than Child A over the
same time, but they do not move up a level. Introducing sub-levels
has helped in some ways (4a, 4b etc) but the essential problem
remains.
2. QCA practice in test development
Pretesting items is extremely helpful; it enables
the performance characteristics of each item to be established
(particularly how relatively hard or easy the item is). This is
vital when going into the summer levels setting exerciseit
is known what is being dealt with in setting the mark thresholds
at each level. But subject officers and others involved in the
management of the tests have had a history of continuing to change
items after the second pretest, which compromises the data available
to the level setting process, and thus impacts on maintaining
standards over time. In addition, the "pretest effect"
also remains in evidencelearners are not necessarily as
motivated when taking "non-live" tests; they may not
be adequately prepared for the specific content of the tests;
etc. This places a limit on the pre-test as an infallible predictor
of the performance of the live test.
3. Borderlining
The decision was taken early in national assessment
to re-mark all candidates who fall near to a level threshold.
QCA publish the mark range which qualifies children to a re-mark.
However, the procedure has been applied only to those below the
threshold and who might move up, and not to those just above,
who might move down. This has had a very distorting effect on
the distributions. Although done in the name of fairness, the
practice is seriously flawed. For years, arguments around changing
the procedure or removing borderlining completely foundered on
the fact that this would effect a major (downward) shift in the
numbers gaining each level, and therefore could not be sanctioned
politically. A poorly-designed and distorting practice therefore
continued. Current practice is unjustifiable and would not be
sanctioned in other areas of public awarding (eg GCSE and A/AS).
It has now been agreed between QCA and the DfES
that borderlining will be removed in 2008, when the marking contract
passes from Pearson's to the American ETS organization. At this
point, a recalibration of standards could be effected to mask
the effect of correction and this standard could be carried forward,
or a clear declaration could be made on how removal of borderlining
affects the "fairness" of the test and has resulted
in a change in the numbers attaining a given level. First identified
by Quinlan and Scharaskin in 1999, this issue has been a long-running
systemic problem. Again, it is a change in practice (alongside
changes in the accessibility of the tests, in inclusion of mental
arithmetic etc) which compromises the ability of the tests to
track change in attainment standards over time.
4. Fluctuations in Science at KS3
At Levels 6 and above, standards of attainment
have moved up and down in an implausible fashion:
|
2005 | 37
| (% of children gaining Levels 6 and 7) |
2004 | 35
| |
2003 | 40
| |
2002 | 34
| |
2001 | 33
| |
|
The movement over the three year period 2002-04 has involved
a 6% increase followed by a 5% decreasea movement of 11%
over two years. This is implausible, and points to problems in
the tests and level setting, and not to a real change in underlying
standards or in the cohort taking the tests. Significantly, when
interviewed on causes, officials and officers gave very different
explanations for the effectin other words, the true cause
has not been established with precision.
5. The Massey Report and Tymms' analysis
The Massey report used highly robust method to triangulate
national tests 1996-2001 and yields solid evidence that attainment
standards have risen over that period, but not to the extent in
all subjects and all key stages that has been argued by DfES and
ministers. Tymms' less robust method and research synthesis suggests
broadly the same. Massey made a series of recommendations, some
of which have been adopted by QCA, such as equating a number of
years' tests and not just the preceding year. However, the absence
of a consistent triangulation method and the failure to adopt
the Massey recommendation that standards should be held for five
years and then publicly recalibrated has not been adopted.
6. Ofsted's over-dependence on national test outcomes
The new Ofsted inspection regime is far more dependent on
the use of national assessment data than previously. This delivers
putative economies since Ofsted feels it can better identify problematic
and successful schools, and can use the data to target areas of
schoolseg weak maths departments, or poor science etc.
The revised regime is broadly welcomed by schools, and has a sound
emphasis on each school delivering on its stated policies. But
the regime fails to acknowledge the weaknesses of the data which
lie at the heart of the pre-inspection reports, and which guides
Ofsted on the performance of schools. The greatly increased structural
dependence on data which is far less accurate than is implied
is problematic. The new regime delivers some valuable functionsbut
the misapprehension of the real technical rigour of the assessment
data is a very serious flaw in arrangements.
7. Assessment overload accusations whilst using many other
non-statutory tests
This is an interesting phenomenonthe optional tests
are liked, the statutory tests are frequently disliked (QCA).
KS2 score are mistrusted (ATL). The use of "commercial"
CAT tests and CEM's tests (MIDYIS etc) is widespread. CAT scores
are trusted by teachers because the results are more stable over
time in comparison with national curriculum tests; this reflects
the different purpose of the respective instruments. Children
say "I did SATs today" when they do a statutory Key
Stage test. They also frequently say that when they have taken
a CAT test. There is widespread misunderstanding of the purpose
of the range of tests which are used. QCA was lobbied over a five-year
period to produce guidance on the function of different testsnot
least to clarify the exact purpose of national testing. However,
no such guidance has been produced. As a result of this, the arguments
regarding "over-testing" are extremely confused, and
adversely muddy the waters in respect of policy.
8. Is the timing right?
Changing the timing of the tests would require a change in
primary legislation. However, it is an enhancement of testing
which should be considered very seriously. In the final report
of the Assessment Review Group in Wales, Daugherty (2004) recommends
that "serious consideration should be given to changing the
timing of Key Stage 3 statutory assessment so that it is completed
no later than the middle of the second term of Year 9". The
Group believed the current timing to be unhelpful in relation
to a process that could, in principle, inform, and that, "one
source of information that would be of use potentially to pupils
and their parents is not available until after the choice of pathway
for Year 10 and beyond has been made". There are also implications
for the potential use of Key Stage 1 and 2 data for transition
between phases. "School ownership"taking the
outcomes very seriously in managing learningwould be likely
to increase in this re-scheduling of the tests.
9. The reliability of teacher assessment
Particularly in the high stakes context of performance tables,
we feel that relying on teacher assessment, as currently operated,
is not a robust option. Work in 2000 by QCA Research Team showed
a completely unstable relationship between TA and test scores
over time at school level. This is compelling evidence against
an over-dependence on teacher assessment. There are means of delivering
moderated teacher assessment for reporting to parents, and bolstering
accountability not by testing but by regional inspection based
on high expectations and school improvement models (see recommendations
below). National standards in underlying attainment could be delivered
through a light sampling model (with matrix sampling to cover
all key content of the national curriculum). This would enable
a valid answer to the ministerial question " . . . nationally,
what's happening to standards in English?".
10 Teaching to the test
The recent lobbying by Baroness Professor Susan Greenfield
and eminent colleagues is merely the most recent critique of the
problems of teaching to the test. The "Texas Test Effect"
(Wiliam, Oates) is well known but poorly presented to Government.
Bill Boyle (CFAS) is the latest empirical study of the adverse
effects of teaching to the test and its almost universal domination
of educational purposes in the English school system. It is a
very serious issue, and it may be one significant factor (not
the sole one) lying behind the "plateau effect" associated
with the majority of innovations such as the Primary Literacy
and Numeracy Strategies. In other wordsa succession of
well-intended and seemingly robust initiatives repeatedly run
out of steam.
|