RELIABILITY
51. Professors Black, Gardner and Wiliam argued that
the reliability of national tests and testing systems is limited;
that the results of such systems are misused; and that the effects
of such misuse would be reduced if test developers were required
to inform the public of the margins of error inherent in these
testing systems. They stressed that limited reliability of testing
systems is systemic and inevitable and does not imply lack of
competence or professionalism on the part of test developers.[66]
The results of any assessment system are subject to measurement
error because they are based on a limited sample of a candidate's
attainment. In order that the testing system should be manageable
and affordable, only a limited number of questions can be set,
to be answered in a limited time and on a given day. Variations
in results for a given candidate will arise out of the particular
topics and skills tested in the particular test instrument and
out of the performance of the candidate on the day. Other evidence
has suggested that children aged ten or eleven exhibit increased
tension and stress when facing a week of examinations in which
they are expected to demonstrate "the full extent of their
learning from seven years of education".[67]
This may affect examination performance. Black et al stated that
the 'true score' of a candidate can never be known because it
is practically impossible to test more than a limited sample of
his or her abilities.[68]
Indeed, their evidence was that up to 30% of candidates in any
public examination in the UK will receive the wrong level or grade,
a statistical estimate which has also been quoted by others in
evidence.[69] Dr Boston
of the QCA accepted that error in the system exists, but said
he was surprised by a figure as high as 30%.[70]
Jon Coles, Director of 14-19 Reform at the DCSF, told us that:
[
] I simply do not accept that there is
anything approaching that degree of error in the grading of qualifications,
such as GCSEs and A-levels. The OECD has examined the matter at
some length and has concluded that we have the most carefully
and appropriately regulated exam system in the world.[71]
[
] I can say to you without a shadow of
a doubtI am absolutely convincedthat there is nothing
like a 30% error rate in GCSEs and A-levels.[72]
52. We suspect that the strength of this denial stemmed
from a misunderstanding of the argument made by Black et al. In
their argument, they make the assumptions that tests are competently
developed and that marking errors are minimal.[73]
The inherent unreliability of the tests stems from the limited
knowledge and skills tested by the assessment instrument and variations
in individuals' performance on the day of the test.[74]
This does not impugn the work of the regulator or the test development
agencies and very little can be done to enhance reliability whilst
maintaining a manageable and affordable system. The NFER gave
similar evidence that the current Key Stage tests:
[
] have good to high levels of internal
consistency (a measure of reliability) and parallel form reliability
(the correlation between two tests). Some aspects are less reliable,
such as the marking of writing, where there are many appeals/reviews.
However, even here the levels of marker reliability are as high
as those achieved in any other written tests where extended writing
is judged by human (or computer) grades. The reliability of the
writing tests could be increased but only by reducing their validity.
This type of trade off is common in assessment systems with validity,
reliability and manageability all in tension.[75]
53. Black et al identify that reliability could theoretically
be enhanced in a number of ways:
- Narrowing the range of question
types, topics and skills tested; but the result would be less
valid and misleading in the sense that users of that information
would have only a very limited estimate of the candidates' attainments.
- Increasing the testing time to augment the sample
of topics and skills tested; however, reliability increases only
marginally with test length.[76]
For example, to reduce the proportion of pupils wrongly classified
in a Key Stage 2 test to within 10%, it is estimated that 30 hours
of testing would be required. (The NFER expressed the view that
the present tests provide as reliable a measurement of individuals
as is possible in a limited amount of testing time.[77])
- Collating and using information that teachers
have about their pupils. Teachers have evidence of performance
on a range of tasks, in many different topics and skills and on
many different occasions.
54. Black et al conclude this part of their argument
by stating that, when results for a group of pupils are aggregated,
the result for the group will be closer to the 'true score' because
random errors for individualswhich may result in either
higher or lower scores than their individual 'true score'will
average out to a certain extent.[78]
The NFER went further, stating that aggregated results over large
groups such as reasonably large classes and schools give an "extremely
high" level of reliability at the school level.[79]
Nevertheless, Black et al argue that not enough is known about
the margins of error in the national testing system. Professor
Black wrote to the QCA to enquire whether there was any research
on reliability of the tests which it develops:
The reply was that "there is little research
into this aspect of the examining process", and [the QCA]
drew attention only to the use of borderline reviews and to the
reviews arising from the appeals system. We cannot see how these
procedures can be of defensible scope if the range of the probable
error is not known, and the evidence suggests that if it were
known the volume of reviews needed would be insupportable.[80]
55. Black et al go on to argue that it is profoundly
unsatisfactory that a measure of the error inherent in our testing
system is not available, since important decisions are made on
the basis of test results, decisions which will be ill-judged
if it is assumed that these measures are without error. In particular,
they argue that current policy is based on the idea that test
results are reliable and teachers' assessments are unreliable.
They consider that reliability could, in fact, be considerably
enhanced by combining the two effectively and that work leading
in this direction should be prioritised.[81]
Black et al conclude that:
[
] the above is not an argument against
the use of formal tests. It is an argument that they should be
used with understanding of their limitations, an understanding
which would both inform their appropriate role in an overall policy
for assessment, and which would ensure that those using the results
may do so with well-informed judgement.[82]
56. Some witnesses have emphasised what they see
as a tension between validity and consistency in results. The
argument is that, over time, national tests have been narrowed
in scope and marking schemes specified in an extremely detailed
manner in order to maximise the consistency of the tests. In other
words, candidates displaying the same level of achievement in
the test are more likely to be awarded the same grade since there
is less room for the discretion of the examiner. However, it is
argued further that this comes at the expense of validity, in
the sense that the scope of the tests are narrowed so much that
they test very little of either the curriculum or the candidate's
wider skills.[83] Sue
Hackman, Chief Adviser on School Standards at the DCSF, recognised
this trade-off. However, she also told us that in relation to
Key Stage tests the Department, together with the QCA, has tried
to include a range of questions in test papers, some very narrow
and others rather wider. In this way, she considered that a compromise
has been reached between "atomistic and reliable questions,
and wide questions that allow pupils with flair and ability to
show what they can do more widely".[84]
57. Many witnesses have called for greater emphasis
on teacher assessment in order to enhance both the validity and
the reliability of the testing system.[85]
A move towards a better balance between regular, formative teacher
assessment and summative assessments the latter drawn from
a national bank of tests, to be externally moderatedwould
provide a more rounded view of children's achievements, and many
have criticised the reliance on a 'snapshot' examination at a
single point in time.[86]
58. We consider that the over-emphasis on the
importance of national tests, which address only a limited part
of the National Curriculum and a limited range of children's skills
and knowledge has resulted in teachers narrowing their focus.
Teachers who feel compelled to focus on that part of the curriculum
which is likely to be tested may feel less able to use the full
range of their creative abilities in the classroom and find it
more difficult to explore the curriculum in an interesting and
motivational way. We are concerned that the professional abilities
of teachers are, therefore, under-used and that some children
may suffer as a result of a limited educational diet focussed
on testing. We feel that teacher assessment should form a significant
part of a national assessment regime. As the Chartered Institute
of Educational Assessors states, "A system of external testing
alone is not ideal and government's recent policy initiatives
in progress checks and diplomas have made some move towards addressing
an imbalance between external testing and internal judgements
made by those closest to the students, i.e. the teachers, in line
with other European countries".[87]
Information for the public
59. The National Foundation for Educational Research
stated that no changes should be made to the national testing
system without a clear statement of the purposes of that system
in order or priority. The level of requirements for validity and
reliability should be elucidated and it should be made clear how
these requirements would be balanced against the need for manageability
and cost-effectiveness.[88]
The NFER commented that Key Stage testing in particular:
[
] is now a complex system, which has developed
many different purposes over the years and now meets each to a
greater or lesser extent. It is a tenet of current government
policy that accountability is a necessary part of publicly provided
systems. We accept that accountability must be available within
the education system and that the assessment system should provide
it. However, the levels of accountability and the information
to be provided are open to considerable variation of opinion.
It is often the view taken of these issues which determines the
nature of the assessment system advocated, rather than the technical
quality of the assessments themselves.[89]
60. Cambridge Assessment criticised agencies, departments
and Government for exaggerating the technical rigour of national
assessment. It continued:
[
] any attempts to more accurately describe
its technical character run the risk of undermining both the departments
and ministers; '[
] if you're saying this now, how is it
that you said that, two years ago [
]'. This prevents rational
debate of problems and scientifically-founded development of arrangements.[90]
Cambridge Assessment stated further that international
best practice dictates that information on the measurement error
intrinsic to any testing system should be published alongside
test data and argues that this best practice should be adopted
by the Government.[91]
Professor Peter Tymms of Durham University similarly argued that:
[
] it would certainly be worth trying providing
more information. I think that the Royal Statistical Society's
recommendation not to give out numbers unless we include the uncertainties
around them is a very proper thing to do, but it is probably a
bit late.[92]
61. We are concerned about the Government's stance
on the merits of the current testing system. We remain unconvinced
by the Government's assumption that one set of national tests
can serve a range of purposes at the national, local, institutional
and individual levels. We recommend that the Government sets out
clearly the purposes of national testing in order of priority
and, for each purpose, gives an accurate assessment of the fitness
of the relevant test instrument for that purpose, taking into
account the issues of validity and reliability.
62. We recommend further that estimates of statistical
measurement error be published alongside test data and statistics
derived from those data to allow users of that information to
interpret it in a more informed manner. We urge the Government
to consider further the evidence of Dr Ken Boston, that multiple
test instruments, each serving fewer purposes, would be a more
valid approach to national testing.
28 Q287 Back
29
Ev 157 Back
30
Ev 158-159 Back
31
Ev 21 Back
32
Ev 23 Back
33
DES (1988), Task Group on Assessment and Testing: A Report, London,
HMSO Back
34
Ev 23 Back
35
Ev 24 Back
36
Ev 24-25 Back
37
Ev 157 Back
38
Q290 Back
39
Q79 Back
40
Q84 Back
41
Q79; see also Ev 31 Back
42
Q79 Back
43
"System Redesign-2: assessment redesign", David Hargreaves,
Chris Gerry and Tim Oates, November 2007, pp28-29 Back
44
Ev 261; Ev 264; Ev 198; Ev 273; Ev 75; Ev 47; Q134; Q237; written
evidence from Association of Science Education, paras 5-6; written
evidence from The Mathematical Association, under headings "General
Issues" and "National Key Stage Tests"; Back
45
Ev 75 Back
46
Ev 261 Back
47
Ev 264 Back
48
Ev 263 Back
49
Ev 272 Back
50
Ev 273 Back
51
Ev 263 Back
52
Ev 68 Back
53
Ev 69 Back
54
Ev 198 Back
55
Ev 233 Back
56
Ev 110-111 Back
57
Ev 257 Back
58
Ev 257 Back
59
Ev 257-258 Back
60
Ev 32 Back
61
Ev 56; Ev 71; Q128; written evidence from the Advisory Committee
on Mathematics Education, paras 18-20; written evidence from Association
for Achievement and Improvement through Assessment, para 4 Back
62
Ev 263; Ev 269; written evidence from Barbara J Cook, Headteacher,
Guillemont Junior School, Farnborough, Hants Back
63
Ev 238; Ev 239; written evidence from Doug French, University
of Hull, para 2.2; written evidence from Association for Achievement
and Improvement through Assessment, para 4 Back
64
Ev 75; Ev 56; written evidence from Doug French, University of
Hull, para 1.3 Back
65
Ev 60; Ev 75; Ev 232; Q139; Written evidence from Doug French,
University of Hull, para 1.3 Back
66
Ev 202-203 Back
67
Written evidence from Heading for Inclusion, Alliance for Inclusive
Education, para 2(a) Back
68
Ev 203 Back
69
Ev 61; Ev 75; Ev 221-222; Ev 226; Q128 Back
70
Q83 Back
71
Q297 Back
72
Q298 Back
73
Ev 202-203 Back
74
Ev 203 Back
75
Ev 257 Back
76
See also Ev 236 Back
77
Ev 257 Back
78
Ev 203-204; see also Ev 226 Back
79
Ev 257 Back
80
Ev 204 Back
81
Ev 204-205 Back
82
Ev 205 Back
83
Ev 226 Back
84
Q324 Back
85
Ev 112; Ev 204; Ev 205; Ev 223; Ev 239; Ev 271; written evidence
from the Association for Achievement and Improvement through Assessment,
para 5 Back
86
Ev 49; Ev 68; Ev 75; Ev 112; Ev 223; Ev 225; Ev 271; written evidence
from Purbrook Junior School, Waterlooville, para 5; written evidence
from Association for Achievement and Improvement through Assessment,
paras 4-5 Back
87
Ev 222 Back
88
Ev 251 Back
89
Ev 251 Back
90
Ev 251 Back
91
Ev 214-215 Back
92
Q19 Back