Joint memorandum submitted by Paul Black, Emeritus Professor of Education King's College London, John Gardner Professor of Education Queen's University Belfast and

Dylan Wiliam Professor and Deputy Director, Institute of Education, London

 

INTRODUCTION

 

This submission highlights the limited reliability of national tests and testing systems and argues that the misuse of such assessments would be greatly reduced if test developing agencies and awarding bodies were required to inform the public fully on reliability matters. It is to be stressed that the limited reliability is systemic and inevitable, and does not imply any lack of competence or professionalism on the part of the test developers and awarding bodies involved.

 

The basis for our argument is that results of assessments, whether by tests or other forms of assessment, are imperfect, i.e. they are subject to measurement error. The argument may be summarised by the following six points

 

1 That some degree of error in such results is inescapable.

 

2 That the best information currently available indicates that up to 30% of candidates in any public examination in the UK will receive the wrong level or grade.

 

3 That sound information on the errors in examination results is not available from any UK test developers or awarding bodies for any public examination results, and we know of no plan or intention to obtain or publish such information.

 

4 That fully informed decisions using examination results cannot be made unless those making the decisions are aware of the chances of error in them.

 

5 That there are also important issues concerning the validity of assessment results, which interact with the issues of reliability, and would broaden the scope, but not affect the nature, of our arguments.

 

6 That policy makers, pupils, parents and the public need to have information about such errors, to understand the reasons for them, and to consider their implications for the decisions which are based upon, or are affected by, assessment results.

 

These six points are explained below in detail.

 


1 ERROR IS INESCAPABLE.

 

Errors in test results may arise from many sources. We are not concerned here with errors in administration and in marking which we assume to be minimal. The problem we address arises because any test measure will be in error because it is bound to be based on a limited sample of any candidate's attainment.

 

At the end of (say) five years of study for GCSE in a subject, a candidate could attempt to answer many questions, or other tasks, different from one another in form, in difficulty, and in the particular topics and skills which they explore: if the candidate could attempt all of these, and could do each on more than one occasion to guard against day-to-day variations in performance, the average score from all of these attempts would represent her/his "true score". In practice this is impossible.

 

So with any test that is affordable, in terms of its costs of time and money, the score is based on a limited sample of the candidate's possible attainments, a sample that is limited in the types of demand, in the topics and skills explored, and in the test occasions. This limitation does not arise from any lack of expertise or care from those doing either the setting, or the marking, or the administration.

 

Thus the measured score differs from the "true score" because it is only a limited sample of the possible evidence. The errors involved are random errors, in that some candidates' scores may be higher than their true score, others lower, some may bear a very small error, some a very large one. For any particular candidate, there is no way of knowing what this 'sampling' error is, because the true score cannot be known.

 

2 HOW BIG IS THE SAMPLING ERROR ?

 

If a candidate's marks varied hardly at all from one question to another, we can be confident that increasing the length of the test would make very little difference: the sampling error is small. If the marks varied wildly from one question to another, we can be confident that the sampling error is large - a far longer test is needed. By analysing the range of such variations from one question to another, over many candidates, it is possible to make a statistical estimate of the errors involved. If the candidates are tested on more than one occasion, then the differences in their scores between the two occasions gives further information about the likelihood of error. We submit, with this document, a copy of a chapter from a recent book which reviews publications in which the results of this type of analysis are reported. The main conclusions may be summarised as follows:

 

(i) In such public examinations as Key Stage tests, 11+ selection tests GCSE and A-level, the sampling error is such that between 20% and 30% of candidates will earn a result which will differ by at least one Level or Grade from their "true" score.

(ii) There is fairly close agreement, that the errors can be this large, between analyses of standardised tests used in California, the 11+ selection tests used in Northern Ireland, and tests at Key Stages 2 and 3 in England.

(iii) That a common measure of so-called 'reliability' of a test can be used to indicate the error in the raw marks, but that the effect of such error on wrong assignments of levels or grades will depend on how close, in terms of marks, the boundaries between these levels or grades happen to be.

(iv) That one way in which this sampling error can be reduced is to narrow the range of the question types, topics and skills involved in a test - but then the result will give misleading information in that it will give users a very limited estimate of the candidates' attainments.

(v) That another way is to increase the testing time so that the sample can be larger; unfortunately, the reliability of testing only increases very slowly with the test length e.g. as far as can be estimated, to reduce the proportion of pupils wrongly classified in a Key Stage 2 test to within 10% would require 30 hours of testing.

(vi) That there are no other ways of composing short tests that could reduce these errors. It would be possible in principle to produce more reliable results if the information that teachers have about their pupils could be assembled and used, because teachers can collect evidence of performance on many different occasions, and in several tasks which can cover a range in content and in types of demand..

(vii) That these findings are concerned with the errors in the results of individual pupils. The mean of a group of observed scores WILL be closer to the mean true score of the group simply because many of the sources of random variation in the performance of individuals will average out.

 

3 IS ENOUGH KNOWN ABOUT ERROR IN OUR PUBLIC EXAMINATIONS ?

 

The short answer to this question is "no". One of us (PJB) wrote to the chief executive of the QCA, in January 2005, enquiring whether there were any well researched results concerning the reliability of those tests for which the QCA had responsibility. The reply was that "there is little research into this aspect of the examining process", and drew attention only to the use of borderline reviews and to the reviews arising from the appeals system. We cannot see how these procedures can be of defensible scope if the range of the probable error is not known, and the evidence suggests that if it were known the volume of reviews needed would be insupportable.

 

Of the information quoted in section 2 above, one source, that for the 11+ selection tests in Northern Ireland, is based on a full analysis of data from test papers. The estimates for Key Stage tests and for A-level are based either on measures provided by others or on well established data from comparable examination systems.

 

4 DO WE NEED MEASURES OF THE ERROR IN OUR EXAMINATIONS ?

 

The answer to this question ought not to be in doubt. It is profoundly unsatisfactory that public examination data in this country do not conform to the requirements which have been set out as Standards for Educational and Psychological Tests in a joint publication from the American Educational Research Association, the American Psychological Association, and the American National Council on Measurement in Education. In the USA, tests that do not conform to these requirements, are, to all intents and purposes, indefensible (at least in law).

 

However, there is a more immediate reason why the answer to this question must be "yes": the point is that decisions are made on the basis of these measures, and such decisions will be ill-judged if those making them assume that the measures are without error.

 

Examples of decisions which might be made differently if the probability of error were taken into account are as follows:

 

(i) The general argument is that where people know results are unreliable, they may seek alternative sources of evidence for confirmation. Where they regard the test as flawless, they are more likely to rely entirely on them.

 

(ii) Public policy at present is based on a premise that tests results are reliable, teachers' assessments are not, so the best combination will give little or no weight to teachers' own assessments. Policy ought to be based on using the optimum combination of the two. Of course, to do this we would need serious development work and well researched data, to enhance and measure the reliability of teachers' own assessments. Some development work of this sort has been done in the past, and some is being developed now: it deserves greater priority because use of teachers' own assessments is the only approach available to overcome the serious limitations to accuracy which we present in section 2. We draw attention to the policy of the Australian State of Queensland, where for over 25 years state test certificates have been based entirely on teachers' own assessments, albeit within a rigorous system of inter-school collaboration to ensure authenticity and comparability of these assessments.

 

(iii) The recent proposals, in the paper entitled Making Good Progress, to introduce single level tests available every six months, are a good example of the limitations that follow from ignoring the errors in test measures. The proposals pay no attention to the effects of measurement error. For example, since the proposals allow pupils to make several repeated attempts, and given the random errors in each attempt, any pupil with a true score which is only a few marks below a given level is bound to succeed eventually, so in time standards will appear to have risen even if the 'true' scores have not risen at all.

 

(iv) The majority of class teachers and school managements seem to be unaware of the limitations of test results. Our experience of working with schools is that many, either from lack of trust in their own judgments, or because parents pay more attention to test paper scores rather than to the teachers' own knowledge of the pupils, will rely heavily, even exclusively, on the results of short formal tests, often using 'off-the-shelf' test papers taken from previous public examinations. The results of such tests are then used for setting and streaming in later years, and/or to advise pupils to drop some subjects. It is very likely that for a significant proportion of pupils, such decisions and advice may be ill-advised and even unjust.

 

(v) Both pupils themselves, their parents, and those using test results for selection or recruitment, ought to have information about the probability of error in these results.

 

5 VALIDITY AND RELIABILITY

 

We have, in this paper, limited our arguments almost entirely to the issue of reliability. It is possible to have a very reliable tests result which is based only on performance on (say) tests of memory under formal examination conditions. Since few work place situations require only such ability, the results do not tell employers what they need to know - the results may be reliable, but they are invalid. The two requirements, of validity and of reliability, are inter-related. One example of this interaction was pointed out above: narrowing the range of topics and skills addressed in a test can improve reliability, but make the test less valid as it is a more narrow measure. Another example is that use of pupils' performance on extended projects, involving library research, or practical investigations in science, can enhance validity, and might also improve reliability in basing the measure on a wider range of activities extending over a longer time period. Whilst consideration of validity will broaden the range of the arguments, the recommendations in 6 below would still apply, but would have to be expanded in scope.

 

6 SO WHAT SHOULD BE DONE ?

 

Those responsible for our public examinations should

 

(i) review all available evidence about their reliability;

 

(ii) set up a continuing process of research studies to enhance, and keep up to date, evidence of the best possible quality;

 

(iii) set up discussions to determine the optimum policies for obtaining assessment data in the light of evidence about errors in the various sources and methods available; such discussions including fresh consideration of the potential for far greater use of assessments made by teachers, individually and by and between schools

 

(iv) develop a programme to ensure that all who use or depend on assessment results are well informed about the inevitable errors in these results.

 

CONCLUDING COMMENT

 

We stress that the above is not an argument against the use of formal tests. It is an argument that they should be used with understanding of their limitations, an understanding which would both inform their appropriate role in an overall policy for assessment, and which would ensure that those using the results may do so with well-informed judgement.

 

The issue that we address here cuts across and affects consideration of almost all of the issues in which the committee has expressed interest. In particular it is relevant to

the accountability of the QCA;

whether Key Stage tests adequately reflect the performance of children and schools;

the role of teachers in assessment;

whether and how the system of national tests should be changed;

whether testing and assessment in "summative" tests (for example, GCSE, AS, A2) is fit for purpose;

the appropriateness of changes proposed in GCSE coursework.

 

May 2007

 

 

----------------------------------------------------------------------------------------------------------------