Joint memorandum submitted by Paul Black,
Emeritus Professor of Education King's College London, John Gardner,
Professor of Education Queen's University Belfast, and Dylan Wiliam,
Professor and Deputy Director, Institute of Education, London
This submission highlights the limited reliability
of national tests and testing systems and argues that the misuse
of such assessments would be greatly reduced if test developing
agencies and awarding bodies were required to inform the public
fully on reliability matters. It is to be stressed that the limited
reliability is systemic and inevitable, and does not imply any
lack of competence or professionalism on the part of the test
developers and awarding bodies involved.
The basis for our argument is that results of
assessments, whether by tests or other forms of assessment, are
imperfect, ie they are subject to measurement error. The argument
may be summarised by the following six points:
1. That some degree of error in such
results is inescapable.
2. That the best information currently
available indicates that up to 30% of candidates in any public
examination in the UK will receive the wrong level or grade.
3. That sound information on the errors
in examination results is not available from any UK test developers
or awarding bodies for any public examination results, and we
know of no plan or intention to obtain or publish such information.
4. That fully informed decisions using
examination results cannot be made unless those making the decisions
are aware of the chances of error in them.
5. That there are also important issues
concerning the validity of assessment results, which interact
with the issues of reliability, and would broaden the scope, but
not affect the nature, of our arguments.
6. That policy makers, pupils, parents
and the public need to have information about such errors, to
understand the reasons for them, and to consider their implications
for the decisions which are based upon, or are affected by, assessment
These six points are explained below in detail.
1. Error is Inescapable
Errors in test results may arise from many sources.
We are not concerned here with errors in administration and in
marking which we assume to be minimal. The problem we address
arises because any test measure will be in error because it is
bound to be based on a limited sample of any candidate's attainment.
At the end of (say) five years of study for
GCSE in a subject, a candidate could attempt to answer many questions,
or other tasks, different from one another in form, in difficulty,
and in the particular topics and skills which they explore: if
the candidate could attempt all of these, and could do each on
more than one occasion to guard against day to day variations
in performance, the average score from all of these attempts would
represent her/his "true score". In practice this is
So with any test that is affordable, in terms
of its costs of time and money, the score is based on a limited
sample of the candidate's possible attainments, a sample that
is limited in the types of demand, in the topics and skills explored,
and in the test occasions. This limitation does not arise from
any lack of expertise or care from those doing either the setting,
or the marking, or the administration.
Thus the measured score differs from the "true
score" because it is only a limited sample of the possible
evidence. The errors involved are random errors, in that some
candidates' scores may be higher than their true score, others
lower, some may bear a very small error, some a very large one.
For any particular candidate, there is no way of knowing what
this "sampling" error is, because the true score cannot
2. How Big is the Sampling Error?
If a candidate's marks varied hardly at all
from one question to another, we can be confident that increasing
the length of the test would make very little difference: the
sampling error is small. If the marks varied wildly from one question
to another, we can be confident that the sampling error is largea
far longer test is needed. By analysing the range of such variations
from one question to another, over many candidates, it is possible
to make a statistical estimate of the errors involved. If the
candidates are tested on more than one occasion, then the differences
in their scores between the two occasions gives further information
about the likelihood of error. We submit, with this document,
a copy of a chapter from a recent book which reviews publications
in which the results of this type of analysis are reported. The
main conclusions may be summarised as follows:
(i) In such public examinations as Key
Stage tests, 11+ selection tests GCSE and A-level, the sampling
error is such that between 20% and 30% of candidates will earn
a result which will differ by at least one Level or Grade from
their "true" score.
(ii) There is fairly close agreement,
that the errors can be this large, between analyses of standardised
tests used in California, the 11+ selection tests used in Northern
Ireland, and tests at Key Stages 2 and 3 in England.
(iii) That a common measure of so-called
"reliability" of a test can be used to indicate the
error in the raw marks, but that the effect of such error on wrong
assignments of levels or grades will depend on how close, in terms
of marks, the boundaries between these levels or grades happen
(iv) That one way in which this sampling
error can be reduced is to narrow the range of the question types,
topics and skills involved in a testbut then the result
will give misleading information in that it will give users a
very limited estimate of the candidates' attainments.
(v) That another way is to increase
the testing time so that the sample can be larger; unfortunately,
the reliability of testing only increases very slowly with the
test length eg as far as can be estimated, to reduce the proportion
of pupils wrongly classified in a Key Stage 2 test to within 10%
would require 30 hours of testing.
(vi) That there are no other ways of
composing short tests that could reduce these errors. It would
be possible in principle to produce more reliable results if the
information that teachers have about their pupils could be assembled
and used, because teachers can collect evidence of performance
on many different occasions, and in several tasks which can cover
a range in content and in types of demand.
(vii) That these findings are concerned with
the errors in the results of individual pupils. The mean of a
group of observed scores will be closer to the mean true score
of the group simply because many of the sources of random variation
in the performance of individuals will average out.
3. Is enough known about error in our Public
The short answer to this question is "no".
One of us (PJB) wrote to the chief executive of the QCA, in January
2005, enquiring whether there were any well researched results
concerning the reliability of those tests for which the QCA had
responsibility. The reply was that "there is little research
into this aspect of the examining process", and drew attention
only to the use of borderline reviews and to the reviews arising
from the appeals system. We cannot see how these procedures can
be of defensible scope if the range of the probable error is not
known, and the evidence suggests that if it were known the volume
of reviews needed would be insupportable.
Of the information quoted in section 2 above,
one source, that for the 11+ selection tests in Northern Ireland,
is based on a full analysis of data from test papers. The estimates
for Key Stage tests and for A-level are based either on measures
provided by others or on well established data from comparable
4. Do we need Measures of the error in our
The answer to this question ought not to be
in doubt. It is profoundly unsatisfactory that public examination
data in this country do not conform to the requirements which
have been set out as Standards for Educational and Psychological
Tests in a joint publication from the American Educational Research
Association, the American Psychological Association, and the American
National Council on Measurement in Education. In the USA, tests
that do not conform to these requirements, are, to all intents
and purposes, indefensible (at least in law).
However, there is a more immediate reason why
the answer to this question must be "yes": the point
is that decisions are made on the basis of these measures, and
such decisions will be ill-judged if those making them assume
that the measures are without error.
Examples of decisions which might be made differently
if the probability of error were taken into account are as follows:
(i) The general argument is that where people
know results are unreliable, they may seek alternative sources
of evidence for confirmation. Where they regard the test as flawless,
they are more likely to rely entirely on them.
(ii) Public policy at present is based
on a premise that tests results are reliable, teachers' assessments
are not, so the best combination will give little or no weight
to teachers' own assessments. Policy ought to be based on using
the optimum combination of the two. Of course, to do this we would
need serious development work and well researched data, to enhance
and measure the reliability of teachers' own assessments. Some
development work of this sort has been done in the past, and some
is being developed now: it deserves greater priority because use
of teachers' own assessments is the only approach available to
overcome the serious limitations to accuracy which we present
in section 2. We draw attention to the policy of the Australian
State of Queensland, where for over 25 years state test certificates
have been based entirely on teachers' own assessments, albeit
within a rigorous system of inter-school collaboration to ensure
authenticity and comparability of these assessments.
(iii) The recent proposals, in the paper
entitled Making Good Progress, to introduce single level
tests available every six months, are a good example of the limitations
that follow from ignoring the errors in test measures. The proposals
pay no attention to the effects of measurement error. For example,
since the proposals allow pupils to make several repeated attempts,
and given the random errors in each attempt, any pupil with a
true score which is only a few marks below a given level is bound
to succeed eventually, so in time standards will appear to have
risen even if the "true" scores have not risen at all.
(iv) The majority of class teachers
and school managements seem to be unaware of the limitations of
test results. Our experience of working with schools is that many,
either from lack of trust in their own judgments, or because parents
pay more attention to test paper scores rather than to the teachers'
own knowledge of the pupils, will rely heavily, even exclusively,
on the results of short formal tests, often using "off-the-shelf"
test papers taken from previous public examinations. The results
of such tests are then used for setting and streaming in later
years, and/or to advise pupils to drop some subjects. It is very
likely that for a significant proportion of pupils, such decisions
and advice may be ill-advised and even unjust.
(v) Both pupils themselves, their parents,
and those using test results for selection or recruitment, ought
to have information about the probability of error in these results.
5. Validity and Reliability
We have, in this paper, limited our arguments
almost entirely to the issue of reliability. It is possible to
have a very reliable tests result which is based only on performance
on (say) tests of memory under formal examination conditions.
Since few work place situations require only such ability, the
results do not tell employers what they need to knowthe
results may be reliable, but they are invalid. The two requirements,
of validity and of reliability, are inter-related. One example
of this interaction was pointed out above: narrowing the range
of topics and skills addressed in a test can improve reliability,
but make the test less valid as it is a more narrow measure. Another
example is that use of pupils' performance on extended projects,
involving library research, or practical investigations in science,
can enhance validity, and might also improve reliability in basing
the measure on a wider range of activities extending over a longer
time period. Whilst consideration of validity will broaden the
range of the arguments, the recommendations in 6 below would still
apply, but would have to be expanded in scope.
6. So what should be done?
Those responsible for our public examinations
(i) review all available evidence about
(ii) set up a continuing process of
research studies to enhance, and keep up to date, evidence of
the best possible quality;
(iii) set up discussions to determine
the optimum policies for obtaining assessment data in the light
of evidence about errors in the various sources and methods available;
such discussions including fresh consideration of the potential
for far greater use of assessments made by teachers, individually
and by and between schools; and
(iv) develop a programme to ensure that
all who use or depend on assessment results are well informed
about the inevitable errors in these results.
We stress that the above is not an argument
against the use of formal tests. It is an argument that they should
be used with understanding of their limitations, an understanding
which would both inform their appropriate role in an overall policy
for assessment, and which would ensure that those using the results
may do so with well-informed judgement.
The issue that we address here cuts across and
affects consideration of almost all of the issues in which the
Committee has expressed interest. In particular it is relevant
the accountability of the QCA;
whether Key Stage tests adequately
reflect the performance of children and schools;
the role of teachers in assessment;
whether and how the system of national
tests should be changed;
whether testing and assessment in
"summative" tests (for example, GCSE, AS, A2) is fit
for purpose; and
the appropriateness of changes proposed
in GCSE coursework.