Select Committee on Education and Skills Minutes of Evidence

Further supplementary memorandum from the Qualification and Curriculum Authority

  At the Select Committee's 15 May oral evidence session with QCA, you asked if I would provide a more detailed commentary on Professor Tymms's submission to you about the key stage 2 English tests. I am now pleased to enclose an analysis of the points that Professor Tymms raised.

  For ease of reference, the enclosed analysis seeks to summarise each of the questions contained within Professor Tymms's letter and then provides our comments and response. Where there is a need to refer to particular key stage subjects and their tests to support a point, the analysis commonly refers to key stage 2 English and mathematics, given that it is in relation to these two subjects that concerns are most frequently voiced.

  I should like to reinforce some of the points that I made at the Select Committee hearing.

  There are no absolutely perfect techniques to measure pupils' attainments. The key issues, as we perceive them, are whether QCA's tests are fit for the purpose of measuring pupils' attainment against the national curriculum, and whether they and the procedures for marking and level setting are as high in quality as they can be. QCA seeks to use all available techniques to address these aims—the professional judgement of subject specialists, pre-tests, anchor tests, test scrutiny, statistical comparisons and script scrutiny. Through our work means we have extensive experience of dealing with the major technical issues involved, and we are investigating key problems such as pre-test effects. Most importantly, QCA processes do not peg standards in one year solely to the previous year; we reference tightly to an absolute standard set by the national curriculum and carried through all of our tests since 1996.

  There is a second general issue to do with the use of the tests to measure how well pupils have been taught the national curriculum. As I explained, we know of three distinct methods of national testing. Professor Tymms argues for tests which seek to measure underlying ability. Professor Dylan Wiliam argues that the country should place a lot more store on teachers' own judgements about their pupils and that these methods of assessment, he believes, provide the most reliable method of recording what an individual pupil knows and can do. Then there is the QCA system of measuring pupils' attainments against the standards of the national curriculum subjects that they have been taught by their teachers. These are three very different systems. The procedures that might work well for tests of underlying ability, such as those used by Professor Tymms, are not appropriate for national curriculum tests.

  I hope that this more detailed analysis is helpful to you and Select Committee colleagues. You will also find enclosed, for reference, copies of our Standards Reports, which we send to schools each year to provide detailed feedback on test performance in the previous May.


  Professor Tymms's raises a number of questions about QCA's national curriculum testing arrangements, in particular those for testing English in the national curriculum at the end of key stage 2. This paper summarises the questions raised and provides comment and response from QCA.

Question 1

  Does QCA believe that standards have risen at the end of key stage 2 over the last seven years?

  Yes. QCA is confident that standards have risen. Pupils are now taught the skills, knowledge and understanding set out in the national curriculum better than in 1995 and 1996.

  Our evidence to support this statement comes from a number of sources.

    —  HMI's annual reports, including the most recent report by Her Majesty's Chief Inspector of Schools Standards and Quality in Education 2000/01(Feb 2002), have for several years now reported improvements in the quality of teaching and reductions in the proportions of unsatisfactory lessons. In meetings with QCA subject officers, HMI have referred to evidence of improved teaching of reading in key stages 1 and 2; of improvements in the teaching of Shakespeare and improvements in areas of science and mathematics from their surveys of subject teaching.

    —  Each year QCA conducts its own monitoring activities, as well as carrying out a detailed analysis of pupils' test scripts, drawn from a national sample of schools. This latter analysis in particular shows that pupils' performance in English, mathematics and science has improved since 1996. In addition, however, it also shows that pupils are now clearly better at responding to questions in the key stage 2 English reading test which ask about authorial intentions; they are now much better able to answer mental arithmetic questions at key stage 2; they are also better at showing their working, at presenting graphical information; and in key stage 2 science tests, they can now respond very well to questions that ask them about science experiments involving two variables. QCA, therefore, has detailed information about precisely where pupils' improvements have come in each subject. QCA provides this detailed sort of information about test performance each year to the literacy and numeracy strategy teams and to schools through our Standards Reports.

    —  Each year, QCA carries out an analysis of writing on a sample of scripts at each national curriculum level. At each level, against criteria that remain the same each year, the analysis examines the features of the writing that are characteristic of the sample scripts. The analysis has shown that patterns of performance for each level are similar each year. This is further evidence to demonstrate that the expectations (ie the standards) for achievement of the levels have remained constant.

Question 2

  Do the results reflected in those [key stage 2] tests give a general indication that the basic skills of pupils finishing primary schools are higher than they were six or seven years ago?

  Yes. The evidence referred to in response to the previous question demonstrates the improvements in pupils' skills and their ability to show, through the tests, the knowledge, skills and understanding that are required by the national curriculum. The annual Standards Reports provide considerable detail to schools on test performance in relation to particular aspects of the required curriculum.

Question 3a

  Is it the case that in setting the cut-scores for the levels at the end of key stage 2 every year QCA attempts to keep the standard to that of the previous year?

  QCA does seek to maintain standards from year to year. However, it is not accurate to say that our procedures simply link standards in one year to the previous year. Each summer, QCA convenes committees to take final decisions about the number of marks needed for the award of each level covered by the tests. To ensure that those decisions are fair, professional, and maintain standards over time, QCA:

    —  has statistical procedures which anchor the standard of the new year's tests against the standards used every year since 1996;

    —   uses archive scripts from 1996 and subsequent years to ensure that the assessment experts make their recommendations drawing on the standard of pupils' work from a number of years;

    —  draws on the expertise of test development teams (drawing on their own considerable expertise, research and statistical experts) which are largely stable and constant from year to year;

    —  invites an external expert each year to offer independent advice on procedures, the statistics and the decisions being made. In the past QCA has invited Professor Tom Christie of University of Manchester, Professor Dylan Wiliam of King's College and Dr Gordon Stobart of the Institute of Education. This year Mr Alf Massey from UCLES in Cambridge has been invited as the independent expert.

  In addition, there is a wide range of measurement techniques available to QCA and others engaged in the process of setting and maintaining standards for tests. The organisation is continually investigating and conducting research with the aim of improving procedures wherever possible.

Question 3b

  Why have QCA's own insiders been critical of the standard-setting procedures used by QCA (Quinlan and Scharaschkin, 1999)?

  There is a range of views amongst test developers and educational statisticians about what maintaining test standards involves and how best to achieve it. Part of QCA's responsibility is to be aware of that range of views and to ensure the organisation designs procedures having fully discussed the issues.

  The work carried out by Quinlan and Scharaschkin as QCA researchers is an example of internal research that aims to take a critical look at existing procedures. Their work examined and commented on various refinements to arrangements, which have now been implemented. In the main their work was designed to examine what more QCA should be doing to ensure that procedures are beyond reproach. They made a number of recommendations, the most significant of which were:

    —  The statistical methods used to obtain a set of test results early each summer for a representative sample of pupils should be improved. This has been done.

    —  The use of statistics in level setting should be enhanced—this has been put in place.

    —  The conduct of key meetings should be better codified and documented—in terms of who attends, what documentation is produced, and how advice is provided. This has been done and QCA is now drawing up a detailed Code of Practice to cover this area of our work.

  QCA does not rely solely on its own internal experts to provide advice on procedures. The independent review of A level standards, Maintaining GCE A level standards (January 2002), carried out by Professor Eva Baker, is a recent example of an external review. Similarly, QCA also commissions research on the national curriculum tests. Much of this work takes the form of focused investigations, but the organisation also commissions longer-term research into the issue of maintaining standards. The outcomes of this independent research work are reviewed by a group of independent researchers who form QCA's Advisory Group for Research into Assessment and Qualifications. Some of the research has identified changes that could be introduced but all of the work has concluded that QCA has a robust testing system.

  A further point can be made about QCA's internal research. The Rose Panel, in their report Weighing the Baby (July 1999), drew attention to the fact that QCA's test developers see their role as including a responsibility for improving the quality of the tests where there is evidence to suggest it this needed and where it is clear what should be done. Since 1996, QCA has sought to make improvements to all of the tests, including the key stage 2 English tests. For example, QCA has made changes to make lines of questioning clearer for pupils, and the layout of questions and artwork have been improved. All of these changes have been made in order to improve the tests for pupils. It is important to ensure that the tests are good enough to enable all pupils to show what they know and can do. This work has been most apparent in key stage 2 English and these tests are now much more reliable in enabling pupils to demonstrate their learning.

Question 3 (c)

  Does QCA agree that its procedures are bound to lead to drift?

  No, for the reasons outlined above and summarised as follows.

    —  QCA's procedures provide an anchor over several years, not just to the previous year.

    —  QCA evaluates and monitors the tests and the procedures associated with them.

    —  QCA draws on the advice of leading experts to assure the quality of its procedures and to provide long-term strategic advice.

Question 4

  Is QCA unable to maintain standards over time?

  Based on the comments to the earlier questions, QCA is able to and does maintain standards successfully.

  In summary the procedures in place to assure standards over time include:

    —  pre-tests, with anchor tests designed to compare each new test against a common underpinning standard, measured year after year by the anchor test;

    —  script scrutiny meetings in which the most senior and most experienced markers review scripts from previous years (we generally select archive scripts from 3 years, at random), and in which the markers seek to track the standard shown in the archive scripts over a number of years, into the new year's test;

    —  `Angoff'[1] meetings in which we ask teachers to review the difficulty of the tests and to evaluate their views using the national curriculum level descriptions;

    —  statistical reviews of national results, drawn from a sample of schools nationally.

  The evidence from each of these sources is considered annually by the level setting committee, which includes QCA, the test development teams, the marking teams and external observers, including an invited independent expert.

Question 5

  If the cut-off score were shifted by one mark on the test, how much difference would this make to the proportion of pupils getting a level 4 across the country?

  As Professor Tymms indicates, the decisions made by QCA are significant for pupils and their schools. At the level 4 boundary for key stage 2 English and mathematics, the proportion of pupils affected if a threshold were to have been moved by a mark in either direction in 2001 was between one and two per cent.

Question 6

  Approximately what range of marks is considered during the discussion when cut-off scores for level 4 are being decided?

  It is possible for the range of marks considered to vary from year to year and from one subject to another. QCA's internal procedures do not prescribe a pre-specified range that should be considered each year when setting cut-off scores. However, the issue is more complex than Professor Tymms submission suggests.

  For each pre-test that QCA conducts, there are a number of different statistical measures that can be used to equate the standard in the new test to the previous standard, the anchor test and the previous years' tests. The reliability of each of these measures depends on the nature of each of the tests, as well as on the number of pupils involved in the pre-test. However, with only very rare exceptions each statistical measure generally agrees to within one or two marks with the others. There is also clear advice from the test developers' statisticians about which of the statistical measures should be used. There is not a process of splitting the difference.

  In the script scrutiny process there is a prescribed mark-range that the markers are required to investigate. This varies for good reasons from subject to subject but is always at least nine marks. The variation occurs as a result of the nature of the tests and their tiering arrangements[2]. Markers are required to determine within that range which scripts are definitely below the standard of the level, which above it and which on the standard from previous years. This process typically enables the markers to identify a provisional range of two to three marks around a potential cut-off score; they are then required to discuss and evaluate their differences of view (each marker will be clear in their own views which mark best reflects the standard of previous years, but there is generally some disagreement initially around the table about which mark this should be) and make a single recommendation to QCA.

  The 'Angoff' teachers' judgmental process also derives a single mark recommendation.

  Taking these three procedures together, it is usual for there to be some limited difference in the marks that each process generates; which is why QCA operates several processes. In discussing the differences, which are typically of one or two marks, the level setting committee will sometimes identify the fact that one indicator is an outlier and is therefore unreliable and not considered further.

  The level setting committee does not take a decision based on the average of the indicators arising from the different processes. The meeting weighs the evidence carefully and generally agrees which of the indicators is based on the most reliable evidence.

  Where the standard setting meeting considers that the range of marks to be considered is too wide or is otherwise reluctant to make a decision, the meeting can request further analysis—a re-run of the script scrutiny—before reconvening. This has only happened on two or three occasions across all of the testing system. A decision has then been taken at a reconvened meeting, having considered further evidence.

Question 7 (a)

  What is QCA's estimate of the margin of error on the decisions made about the proportions of pupils achieving level 4 or above at key stage 2.

  Professor Tymms's submission suggests that the margin or error is to be counted in relation to whole marks, that QCA might equally well have selected a higher or lower mark for the level threshold.

  This is not QCA's view. Many of the measures that are considered in the level setting meetings are expressed to one decimal place. This enables QCA to be more confident than Professor Tymms's submission suggests, and certainly QCA would not suggest that the margin or error could be counted as high as 0.5 per cent.

Question 7 (b)

  If we saw, say, a 2 per cent rise in Level 4s in maths, how confident would [QCA] be that the rise represented a real increase in standards and was not within the margin of error?

  It would be unwise for anyone to claim a rise of 2 per cent as clear and unequivocal evidence of an increase in standards, unless there was a similar level of increase for a number of years consecutively. A rise in a single year of more than 2 per cent, or a rise of 1-2 per cent each year for three or more years, would provide unequivocal evidence of a real increase in standards of performance in QCA's view.

Question 8 (a)

  How comfortable does QCA feel with next year's test being trialled the previous year?

  QCA is confident that this is the best possible way of pre-testing the tests. One alternative would be not to pre-test: this would mean that untried questions were placed in front of pupils and the system of external marking could not be run in the way it is or to the tight timescale that it currently is. Another alternative would be to pre-test in another country, but to be reliable the same curriculum would have to apply, together with broadly the same major educational initiatives. A third option would be to find small groups of pupils to take both the old test and the new test under pre-test conditions—ie, low stakes conditions, with no revision or preparation effects. In practical terms QCA has rejected this option as it would be impossible to determine whether pupils had prior sight of the old test (they are all in the public domain and used by teachers during year 6).

  QCA has concluded that that the system of pre-testing currently used is as good as it can be.

Question 8 (b)

  Does QCA make a correction for age difference?

  No. The national curriculum tests do not seek to relate test performance to age, other than at key stage 1. Nor in practice are the age differences material (they are a matter of weeks, not terms).

  Moreover, the pre-test results are not the final data that inform the decisions about where the level thresholds need to be set. They provide the first evidence which is then considered over a 12-month period alongside the other measures described elsewhere in this commentary.

Question 8 (c)

  How does QCA deal with the fact that the next years' test acts as a dress rehearsal for the actual end of key stage test?

  QCA's advice to all schools involved in either the first or second pre-test is to consider providing pupils with the opportunity to practice a previous year's test, so that pupils may be familiar with the requirements of their tests. A pre-test in this sense is therefore just good practice for schools.

  QCA also measures something termed the 'pre-test effect'. This effect is measured accurately historically and the historical data are then used to estimate the effect on the new test. Should that estimate prove inaccurate, this would be clear at the level setting meetings when actual test performance statistics are made available. In terms of QCA procedures these factors do not affect final decisions.

Question 9 (a)

  Has QCA used the same anchor test for each of the past seven years.

  In respect of key stage 2 English, there are two components to the test; a Reading Test and a Writing Test. Each is worth 50 marks out of a total of 100 marks. The Writing Test is absolutely anchored to 1996. This is achieved because the form of the mark scheme for the test has remained stable since 1995. The mark scheme uses identical criteria, year on year, adapted to the range of tasks set in the test.

  For the Reading Test QCA uses a short reading test as an anchor. This was developed in 1995 and has been used in all pre-test administrations since. Pupils involved in a pre-test take their actual statutory key stage 2 test, the anchor test, and either the new reading test or the new writing test.

  For key stage 2 mathematics the anchor test procedures need to be rather different. Here the approach is to select six of the most technically robust items used in previous years' tests and to 'seed' those six items in every pre-test booklet. (A 'robust' item, in this respect, is one that performs very consistently, showing a good match with pupils' total scores on the tests.) This enables QCA's statisticians to use a sound statistical procedure (the most common is a highly technical procedure known as item response theory (IRT) analysis) to compare all new items against the standards of all previous years' tests. Where there is evidence to suggest the need, items are retired from the bank of six anchor items and replaced with more robust items. In practice, one item is retired every other year, giving a largely stable base to the anchor test over time.

Question 9 (b)

  Is QCA prepared to share the anchor test data?

  As explained earlier in this commentary, the anchor test is just one piece of evidence considered when setting levels.

  Data from the anchor test are used in different ways by each test development agency, depending on the subject. For example, the Rasch model of statistical equating uses the data from the anchor test as an integral part of the statistical analysis process. However, Rasch is not appropriate for all subjects and is not used by all test developers. Item response theory (IRT), used in mathematics for example, uses the anchor test data in a different way from the Rasch process. Providing the anchor test data out of context, therefore, would not be helpful.

  In English at Key Stage 2, the raw data emerging from the anchor test are subject to statistical methods that equate outcomes on the anchor test with outcomes on the live national curriculum test. In 2002, there was a high correlation statistically of 0.82 between the data from the anchor test and the data from the live national curriculum test.

QCA, June 2002

1   Angoff describes generally a set of procedures applicable in education and other professions through which the views of the professionals (in this case teachers) about standards are sought. Back

2   For example, mathematics at key stage 3 has papers targeting levels 3-5; 4-6; 5-7 and 6-8 of the national curriculum in mathematics. Back

previous page contents

House of Commons home page Parliament home page House of Lords home page search page enquiries index

© Parliamentary copyright 2002
Prepared 24 September 2002