Select Committee on Children, Schools and Families Written Evidence


National Assessment—Annex 1

AN OVERVIEW OF THE EVIDENCE

1.  Measurement error and the problems with overlaying levels onto marks

  This does not refer to human error or mistakes in the administration of tests but to the issue of intrinsic measurement error. Contemporary standards in the US lead to the expectation that error estimates are printed alongside individual results: such as " . . . this person has 3,592 (on a scale going to 5,000 marks) and the error on this test occasion means that their true score lies between 3602 and 3582 . . ..". Is this too difficult for people to handle (ie interpret)? In the current climate of increasing statistical literacy in schools, it should not be. Indeed, results could be presented in many innovative ways which better convey where the "true score" of someone lies.

  Error data are not provided for national tests in England, and both the Statistics Commission and commentators (eg Wiliam, Newton, Oates, Tymms) have raised questions as to why this international best practice is not adopted.

  Of course, error can be reduced by changing the assessment processes—which most often results in a dramatic increase in costs. Note "reduce" not "remove"—the latter is unfeasible in mass systems. For example, double marking might be adopted and would increase the technical robustness of the assessments. However, this is impractical in respect of timeframes; it is already difficult to maintain existing numbers of markers, etc. Error can be reduced by increased expenditure but is escalating cost appropriate in the current public sector policy climate?

  One key point to bear in mind is that one must avoid a situation where the error is significantly less than the performance gains which one is expecting from the system, and from schools—and indeed from teachers within the schools. Unfortunately, 1-2% improvement lies within the bounds of error—get the level thresholds wrong by two marks either way (and see the section on KS3 Science below) and the results of 16,000 pupils (ie just over 2%). could be moved.

  Measurement error becomes highly significant when national curriculum levels (or any other grade scale) is overlaid onto the scores. If the error is as above, but a cut score for a crucial level is 48 (out of 120 total available marks) then getting 47 (error range 45-49) would not qualify that person for the higher level, even though the error means that their true score could easily be above the level threshold. In some cases the tests are not long enough to provide information to justify choosing cut-scores between adjacent marks even though the difference between adjacent marks can have a significant effect on the percentages of the cohort achieving particular levels. There are problems with misclassification of levels applied. Wiliam reports that "it is likely that the proportion of students awarded a level higher or lower than they should be because of the unreliability of the tests is at least 30% at Key Stage 2 and may be as high as 40% at Key Stage 3".

  Criterion referencing fails to work well since question difficulty is not solely determined by curriculum content. It can also be affected by "process difficulty" and/or "question or stimulus difficulty", (Pollitt et al). It is also difficult to allocate curriculum to levels since questions testing the same content can cover a wide range of difficulty.

  It is believed that error could be communicated meaningfully to schools, children, parents and the press, and would enhance both intelligence to ministers and the educational use of the data from national tests.

  The current practice of overlaying levels onto the scores brings serious problems and it is clear that the use of levels should be reviewed. One key issue: consider the following:


  Both Child B and Child C are Level 5. But in fact Child A and B are closer in performance, despite A being level 4 and B being Level 5. Further, if Child A progresses to the position of Child B over a period of learning, they have increased by one level. However, if Child B progresses to the same position as Child C, they have progressed further than Child A over the same time, but they do not move up a level. Introducing sub-levels has helped in some ways (4a, 4b etc) but the essential problem remains.

2.  QCA practice in test development

  Pretesting items is extremely helpful; it enables the performance characteristics of each item to be established (particularly how relatively hard or easy the item is). This is vital when going into the summer levels setting exercise—it is known what is being dealt with in setting the mark thresholds at each level. But subject officers and others involved in the management of the tests have had a history of continuing to change items after the second pretest, which compromises the data available to the level setting process, and thus impacts on maintaining standards over time. In addition, the "pretest effect" also remains in evidence—learners are not necessarily as motivated when taking "non-live" tests; they may not be adequately prepared for the specific content of the tests; etc. This places a limit on the pre-test as an infallible predictor of the performance of the live test.

3.  Borderlining

  The decision was taken early in national assessment to re-mark all candidates who fall near to a level threshold. QCA publish the mark range which qualifies children to a re-mark. However, the procedure has been applied only to those below the threshold and who might move up, and not to those just above, who might move down. This has had a very distorting effect on the distributions. Although done in the name of fairness, the practice is seriously flawed. For years, arguments around changing the procedure or removing borderlining completely foundered on the fact that this would effect a major (downward) shift in the numbers gaining each level, and therefore could not be sanctioned politically. A poorly-designed and distorting practice therefore continued. Current practice is unjustifiable and would not be sanctioned in other areas of public awarding (eg GCSE and A/AS).

  It has now been agreed between QCA and the DfES that borderlining will be removed in 2008, when the marking contract passes from Pearson's to the American ETS organization. At this point, a recalibration of standards could be effected to mask the effect of correction and this standard could be carried forward, or a clear declaration could be made on how removal of borderlining affects the "fairness" of the test and has resulted in a change in the numbers attaining a given level. First identified by Quinlan and Scharaskin in 1999, this issue has been a long-running systemic problem. Again, it is a change in practice (alongside changes in the accessibility of the tests, in inclusion of mental arithmetic etc) which compromises the ability of the tests to track change in attainment standards over time.

4.  Fluctuations in Science at KS3

  At Levels 6 and above, standards of attainment have moved up and down in an implausible fashion:


2005
37
(% of children gaining Levels 6 and 7)
2004
35
2003
40
2002
34
2001
33


  The movement over the three year period 2002-04 has involved a 6% increase followed by a 5% decrease—a movement of 11% over two years. This is implausible, and points to problems in the tests and level setting, and not to a real change in underlying standards or in the cohort taking the tests. Significantly, when interviewed on causes, officials and officers gave very different explanations for the effect—in other words, the true cause has not been established with precision.

5.  The Massey Report and Tymms' analysis

  The Massey report used highly robust method to triangulate national tests 1996-2001 and yields solid evidence that attainment standards have risen over that period, but not to the extent in all subjects and all key stages that has been argued by DfES and ministers. Tymms' less robust method and research synthesis suggests broadly the same. Massey made a series of recommendations, some of which have been adopted by QCA, such as equating a number of years' tests and not just the preceding year. However, the absence of a consistent triangulation method and the failure to adopt the Massey recommendation that standards should be held for five years and then publicly recalibrated has not been adopted.

6.  Ofsted's over-dependence on national test outcomes

  The new Ofsted inspection regime is far more dependent on the use of national assessment data than previously. This delivers putative economies since Ofsted feels it can better identify problematic and successful schools, and can use the data to target areas of schools—eg weak maths departments, or poor science etc. The revised regime is broadly welcomed by schools, and has a sound emphasis on each school delivering on its stated policies. But the regime fails to acknowledge the weaknesses of the data which lie at the heart of the pre-inspection reports, and which guides Ofsted on the performance of schools. The greatly increased structural dependence on data which is far less accurate than is implied is problematic. The new regime delivers some valuable functions—but the misapprehension of the real technical rigour of the assessment data is a very serious flaw in arrangements.

7.  Assessment overload accusations whilst using many other non-statutory tests

  This is an interesting phenomenon—the optional tests are liked, the statutory tests are frequently disliked (QCA). KS2 score are mistrusted (ATL). The use of "commercial" CAT tests and CEM's tests (MIDYIS etc) is widespread. CAT scores are trusted by teachers because the results are more stable over time in comparison with national curriculum tests; this reflects the different purpose of the respective instruments. Children say "I did SATs today" when they do a statutory Key Stage test. They also frequently say that when they have taken a CAT test. There is widespread misunderstanding of the purpose of the range of tests which are used. QCA was lobbied over a five-year period to produce guidance on the function of different tests—not least to clarify the exact purpose of national testing. However, no such guidance has been produced. As a result of this, the arguments regarding "over-testing" are extremely confused, and adversely muddy the waters in respect of policy.

8.  Is the timing right?

  Changing the timing of the tests would require a change in primary legislation. However, it is an enhancement of testing which should be considered very seriously. In the final report of the Assessment Review Group in Wales, Daugherty (2004) recommends that "serious consideration should be given to changing the timing of Key Stage 3 statutory assessment so that it is completed no later than the middle of the second term of Year 9". The Group believed the current timing to be unhelpful in relation to a process that could, in principle, inform, and that, "one source of information that would be of use potentially to pupils and their parents is not available until after the choice of pathway for Year 10 and beyond has been made". There are also implications for the potential use of Key Stage 1 and 2 data for transition between phases. "School ownership"—taking the outcomes very seriously in managing learning—would be likely to increase in this re-scheduling of the tests.

9.  The reliability of teacher assessment

  Particularly in the high stakes context of performance tables, we feel that relying on teacher assessment, as currently operated, is not a robust option. Work in 2000 by QCA Research Team showed a completely unstable relationship between TA and test scores over time at school level. This is compelling evidence against an over-dependence on teacher assessment. There are means of delivering moderated teacher assessment for reporting to parents, and bolstering accountability not by testing but by regional inspection based on high expectations and school improvement models (see recommendations below). National standards in underlying attainment could be delivered through a light sampling model (with matrix sampling to cover all key content of the national curriculum). This would enable a valid answer to the ministerial question " . . . nationally, what's happening to standards in English?".

10  Teaching to the test

  The recent lobbying by Baroness Professor Susan Greenfield and eminent colleagues is merely the most recent critique of the problems of teaching to the test. The "Texas Test Effect" (Wiliam, Oates) is well known but poorly presented to Government. Bill Boyle (CFAS) is the latest empirical study of the adverse effects of teaching to the test and its almost universal domination of educational purposes in the English school system. It is a very serious issue, and it may be one significant factor (not the sole one) lying behind the "plateau effect" associated with the majority of innovations such as the Primary Literacy and Numeracy Strategies. In other words—a succession of well-intended and seemingly robust initiatives repeatedly run out of steam.



 
previous page contents next page

House of Commons home page Parliament home page House of Lords home page search page enquiries index

© Parliamentary copyright 2008
Prepared 13 May 2008