Select Committee on Children, Schools and Families Written Evidence

Memorandum submitted by Cambridge Assessment

  Cambridge Assessment is Europe's largest assessment agency and plays a leading role in researching, developing and delivering assessment across the globe. It is a department of the University of Cambridge and a not-for-profit organisation with a turnover of around £175 million. The Group employs around 1,400 people and contracts some 15,000 examiners each year.

  Cambridge Assessment's portfolio of activities includes world-leading research, ground-breaking new developments and career enhancement for assessment professionals. Public examinations and tests are delivered around the globe through our three highly respected examining bodies.

  The assessment providers in the Group include:

    University of Cambridge English for Speakers of Other Languages (Cambridge ESOL)

    Tests and qualifications from Cambridge ESOL are taken by over 1.75 million people, in 135 countries. Cambridge ESOL's Teaching Awards provide a route into the English Language Teaching profession for new teachers and first class career development opportunities for experienced teachers. Cambridge ESOL works with a number of governments in the field of language and immigration.

    University of Cambridge International Examinations (CIE)

    CIE is the world's largest provider of international qualifications for 14-19 year-olds. CIE qualifications are available in over 150 countries. CIE works directly with a number of governments to provide qualifications, training and system renewal.


    The Oxford Cambridge and RSA awarding body (OCR) provides general academic qualifications and vocational qualifications for learners of all ages through 13,000 schools, colleges and other institutions. It is one of the three main awarding bodies for school qualifications in England.


    The Assessment Research and Development division (ARD) supports development and evaluation work across the Cambridge Assessment group and administers a range of admissions tests for entry to Higher Education. The ARD includes the Psychometrics Centre, a provider and developer of psychometric tests.


  1.  At Cambridge Assessment we recognize that it is vital not to approach assessment on a piecemeal basis. The education system is exactly that: a system. Later experiences of learners are conditioned by earlier ones; different elements of the system may be experienced by learners as contrasting and contradictory; discontinuities between elements in the system (eg transition from primary to secondary education) may be very challenging to learners.

  2.  Whilst understanding the system as a system is important, we believe that the current focus on 14-19 developments (particularly the development of the Diplomas and post-Tomlinson strategy) can all too readily take attention away from the serious problems which are present in 5-14 national assessment.

  3.  Our evidence tends to focus on assessment issues. This is central to our organisation's functions and expertise. However, we are most anxious to ensure that assessment serves key functions in terms of supporting effective learning (formative functions) and progression (summative functions). Both should be supported by effective assessment.

  4.  We welcome the framing of the Committee's terms of reference for this Inquiry, which make it clear that it intends to treat these two areas as substantially discrete. Cambridge Assessment's qualifications deliverers (exam boards), OCR and University of Cambridge International Examinations, have tendered evidence separately to this submission. They have looked chiefly at 14-19 qualifications.

  5.  This particular submission falls into two sections: Firstly Cambridge Assessment's views on the national assessment framework (for children aged 5-14). These are informed by, but not necessarily limited to, the work which we carried out through out 2006 in partnership with the Institute for Public Policy Research (IPPR) and the substantial expertise in the Group of those who have worked on national tests.

  6.  The second section is on University Entrance Tests. Cambridge Assessment has been involved in the development of these for nearly a decade and uses a research base that stretches back even further. At first their scope was limited to Cambridge University but over the last four years it has grown to include many other institutions. That they are administered under Cambridge Assessment's auspices (as opposed to those of one of our exam boards) is a reflection of their roots within our research faculty and the non statutory nature of the tests themselves.



  7.  In this section we have sought to outline the problems that have built up around the national assessment arrangements. We have then gone on to discuss the changes proposed in our work with the IPPR. We also then discuss the problems that appear to be inherent in the `Making Progress' model that the Government is committed to trialling. Our conclusion is that there is a window of opportunity before us at the present time, just one of the reasons that the Committee's Inquiry is so timely, and that the Government should not close it with the dangerous haste that it seems bent on. There are a range of options and to pursue only one is a serious mistake.

  8.  We have included two Annexes:

    —  an overview of the evidence on national assessment dealing with questions ranging from "teaching to the test" to "measurement error"; and

    —  a brief discussion of why the sometimes mooted return of the APU might not deliver all the objectives desired of it.


  9.  It is important to note that Cambridge Assessment is highly supportive of the principle of a National Curriculum and related national assessment. The concept of "entitlement" at the heart of the National Curriculum has been vital to raising achievement overall; raising the attainment of specific groups (eg females in respect of maths and science); and ensuring breadth and balance. We recognise that enormous effort has been put in, by officials and developers, to improving the tests in successive years. We support the general sentiment of the Rose Review—that the system has some strong characteristics—but it is clear that deep structural problems have built up over time.

  10.  Whilst being concerned over these problems, Cambridge Assessment is committed to the key functions supported by national assessment: provision of information for formative and diagnostic purposes to pupils, teachers and parents; information on national standards, and accountability at school level. We return to these key functions in more detail below. However, Cambridge Assessment is critical of the way in which national assessment has been progressively and successively elaborated into a system which appears to be yielding too many serious and systemic problems.

Accumulating problems in National Assessment—a vessel full to bursting point?

  11.  There are two particularly significant problems in the highly sensitive area of technical development of national assessment arrangements. Firstly, previous statements by agencies, departments and Government have exaggerated the technical rigour of national assessment. Thus any attempts to more accurately describe its technical character run the risk of undermining both the departments and ministers; " . . . if you're saying this now, how it is that you said that, two years ago . . .". This prevents rational debate of problems and scientifically-founded development of arrangements.

Secondly, as each critique has become public, the tendency is to breathe a sigh of relief as the press storm abates; each report is literally or metaphorically placed in a locked cupboard and forgotten.

  12.  In contrast, we have attempted here to take all relevant evidence and integrate it; synthesising it in such a way that underlying problems and tendencies can accurately be appraised—with the intention of ensuring effective evaluation and refinement of systems.

  14.  Put simply, if a minister asks a sensible question: " . . . are attainment standards in English going up or down and by how much? . . ." there is no means of delivering a valid and sound response to that question using current arrangements. This is a serious problem for policy formation and system management. It is not a position which obtains in systems which use independent light sampling methods such as the US NAEP (National Assessment of Educational Progress).


  15.  Current national curriculum assessment arrangements within England have attracted increasing criticism in respect of the extent to which they are carrying too many purposes (Brooks R & Tough S; Bell J et al; Daugherty R et al). Since 1988 a substantial set of overt and tacit functions have found themselves added. The original purposes specified in the TGAT Report (Task Group on Assessment and Testing) comprised:

    1.  formative (diagnostic for pupils; diagnostic for teachers);

    2.  summative (feedback for pupils and parents);

    3.  evaluative (providing information at LEA and school level); and

    4.  informative (providing information on educational standards at system level).

  16.  The following have been added, as increasingly elaborated uses of the flow of detailed data from national assessment:

    —  school accountability;

    —  departmental accountability;

    —  apportionment of funds;

    —  inspection patterns and actions;

    —  upwards pressure on standards/target setting;

    —  structuring of educational markets and school choice;

    —  emphasis of specific curriculum elements and approaches;

    —  detailed tracking of individual attainment, strengths and weaknesses; and

    —  quantification of progress.

  17.  Unsurprisingly, many educationalists have expressed the view that the current tests carry too many functions and that the underlying management processes are too elaborated. To carry this broad range of functions, the system of assessing every child at the end of each Key Stage is dependent on maintaining test standards over time in a way which is in fact not practical.

  18.  If you want to measure change, don't change the measure. But the nation does—and should—change/update the National Curriculum regularly. Whenever there is change (and sometimes radical overhaul) the maintenance of test standards becomes a particularly aggressive problem. It does, of course, remain a constant problem in areas such as English Literature when one could be pretesting a test on Macbeth which will be taken in 2008 but the pupils are currently studying As You Like it when they sit the pretest. There are remedies to some of the problems this creates—namely switch to different sampling processes; announcing radical recalibration, or switch to low stakes sampling of children's performance, using a NAEP or a modernized APU-style model (Assessment of Performance Unit—see Annexe 2).

  19.  Attempting to use national assessment to measure trends over time has produced some of the most intense tensions amongst the set of functions now attached to national testing. Stability in the instruments is one of the strongest recommendations emerging from projects designed to monitor standards over time. Running counter to this, QCA and the DfES have—in line with commitments to high quality educational provision, the standards agenda and responses from review and evaluation processes—sought to optimize the National Curriculum by successive revision of content, increasing the "accessibility of tests", and ensuring tight linkage of the tests to specific curriculum content.

  20.  These are laudable aims—and the emphasis on the diagnostic function of the data from tests has been increasing in recent innovations in testing arrangements However, pursuit of these aims has led to repeated revision rather than stability in the tests. The Massey Report suggested that if maintenance of standards over time remained a key operational aim, then stability in the test content was imperative (Massey A et al). In the face of these tensions, a light sampling survey method would enable de-coupling of national assessment from a requirement to deliver robust information on national educational standards. This would enable testing to reflect curriculum change with precision, to optimize the learning-focussed functions of testing, and enable constant innovation in the form of tests to optimize accessibility.

  21.  It is therefore clear that the current functions of national testing arrangements are in acute and chronic tension. Using the pragmatic argument that "every policy should have a policy instrument" we conclude that national arrangements should indeed support school accountability and improvement, report to parents and monitor national standards but that a change of arrangements is required to achieve this. A range of approaches are necessary to deliver these functions and we outline some viable options below.



  22.  There is a need to conceptualise a number of possible models for consideration in an attempt to address the problems of "multipurpose testing". It is vital to note that we present here three alternatives. We do this to show that there are credible alternatives for delivering on the key objectives of national assessment—it is simply not the case that there is only one way of moving forward.

  23.  We believe the aims should be to:

    —  reduce the assessment burden on schools;

    —  provide formative assessment for teaching and learning;

    —  provide information for school accountability; and

    —  provide information on national standards.

  24.  In order to secure widespread support within the education community (including parents) a firm re-statement of educational purpose (values) and a commitment to high degrees of validity is essential. It is not enough to initiate changes merely because of concerns about the defects of existing arrangements. We do not here outline values and validity in detail, but recognise that this is a vital precondition of designing revised arrangements, putting them in place, and monitoring their operation. It is important that a full discussion of these matters precedes any executive decision regarding revised arrangements.

Alternative models for national assessment

Model 1:  Validity in monitoring plus accountability to school level

  25.  The aim of this approach is to collect data using a national monitoring survey and to use this data for monitoring standards over time as well as for moderation of teacher assessment. This would enable school performance to be measured for accountability purposes and would involve a special kind of criterion referencing known as domain referencing.

  26.  Question banks would be created based on the curriculum with each measure focusing on a defined domain. A sample of questions would be taken from the bank and divided into lots of small testlets (smaller than the current KS tests). These would then be randomly allocated to each candidate in a school. Every question is therefore attempted by thousands of candidates so the summary statistics are very accurate and there are summary statistics on a large sample of questions. This means that for a particular year it is known, for example, that on average candidates can obtain 50% of the marks in domain Y.

  27.  The following year it might be found that they obtain 55% of the marks in that domain. This therefore measures the change and no judgement about relative year on year test difficulty is required. Neither is there a need for a complex statistical model for analysing the data, although modelling would be required to calculate the standard errors of the statistics reported. However, with the correct design they would be superfluous because they would be negligible. It would be possible to use a preliminary survey to link domains to existing levels and the issue of changing items over time could be solved by chaining and making comparisons based on common items between years. Although each testlet would be an unreliable measure in itself, it would be possible to assign levels to marks using a statistical method once an overall analysis had been carried out. The average of the testlet scores would be a good measure of a school's performance given that there are sufficient candidates in the school. The appropriate number of candidates would need to be investigated.

  28.  The survey data could also be used to moderate teacher assessment by asking the teacher to rank order the candidates and to assign a level to each of them. Teacher assessment levels would then be compared with testlet levels and the differences calculated. It would not be expected that the differences should be zero, but rather that the need for moderation should be determined by whether the differences cancel out or not. Work would need to be done to establish the levels of tolerance and the rules for applying this process would need to be agreed. The school could have the option of accepting the statistical moderation or going through a more formal moderation process.

  29.  There would be a number of potential advantages related to this model. Validity would be increased as there would be greater curriculum coverage. The data would be more appropriate for the investigation of standards over time. The test development process would be less expensive as items could be re-used through an item bank, including past items from national curriculum tests. There would also be fewer problems with security related to "whole tests". No awarding meetings would be needed as the outcomes would be automatic and not judgemental. Since candidates would not be able to prepare for a specific paper the negative wash-back and narrowing of the curriculum would be eliminated (ie the potential elimination of "teaching to the test"). There would also be less pressure on the individual student since the tests would be low stakes.

  30.  Given that there are enough students in a school, the differences in question difficulty and pupil question interaction would average out to zero leaving only the mean of the pupil effects. From the data it would be possible to generate a range of reports eg equipercentiles and domain profiles. Reporting of domain profiles would address an issue raised by Tymms (2004) that "the official results deal with whole areas of the curriculum but the data suggests that standards have changed differently in different sub-areas".

  31.  Work would need to be done to overcome a number of potential disadvantages of the model. Transparency and perception would be important and stakeholders would need to be able to understand the model sufficiently to have confidence in the outcomes. This would be a particularly sensitive issue as students could be expected to take tests that prove to be too difficult or too easy for them. Some stratification of the tests according to difficulty and ability would alleviate this problem. There is an assumption that teachers can rank order students (Lamming D) and this would need to be explored. Applying the model to English would need further thought in order to accommodate the variations in task type and skills assessed that arise in that subject area.

  32.  Eventually the model would offer the possibility of reducing the assessment burden but the burden would be comparatively greater for the primary phase. Although security problems could be alleviated by using item banking, the impact of item re-use would need to be considered. Having items in the public domain would be a novel situation for almost any other important test in the UK (except the driving test).

  33.  Discussion and research would be needed in a number of areas:

    —  values and validity;

    —  scale and scope eg number and age of candidates, regularity and timing of tests;

    —  formal development of the statistics model;

    —  simulation of data (based on APU science data initially);

    —  stratification of tests / students; and

    —  pilots and trials of any proposed system.

Model 2:  Validity in monitoring plus a switch to "school-improvement inspection"

  34.  Whilst the processes for equating standards over time have been enhanced since the production of the Massey Report, there remain significant issues relating to:

    —  teacher confidence in test outcomes;

    —  evidence of negative wash-back into learning approaches;

    —  over-interpretation of data at pupil group level; inferences of improvement or deterioration of performance not being robust due to small group size;

    —  ambiguity in policy regarding borderlining;

    —  no provision to implement Massey recommendations regarding keeping tests stable for five years and then "recalibrating" national standards; and

    —  publishing error figures for national tests.

  35.  In the face of these problems, it is attractive to adopt a low-stakes, matrix-based, light sampling survey of schools and pupils in order to offer intelligence to Government on underlying educational standards. With a matrix model underpinning the sampling frame, far wider coverage of the curriculum can be offered than with current national testing arrangements.

  36.  However, if used as a replacement for national testing of every child at the end of KS1, 2 and 3, then key functions of the existing system would not be delivered:

    —  data reporting, to parents, progress for every child at the end of each key stage; and

    —  school accountability measures.

  37.  In a system with a light sampling model for monitoring national standards, the first of these functions could be delivered through (i) moderated teacher assessment, combined with (ii) internal testing, or tests provided by external agencies and/or grouped schools arrangements. The DfES prototype work on assessment for learning could form national guidelines for (i) the overall purpose and framework for school assessment, and (ii) model processes. This framework of assessment policy would be central to the inspection framework used in school inspection.

  38.  The intention would be to give sensitive feedback to learners and parents, with the prime function of highlighting to parents how best to support their child's learning. Moderated teacher assessment has been proven to facilitate staff development and effective pedagogic practice. Arrangements could operate on a local or regional level, allowing transfer of practice from school to school.

  39.  The second of these functions could be delivered through a change in the Ofsted inspection model. A new framework would be required since the current framework is heavily dependent on national test data, with all the attendant problems of the error in the data and instability of standards over time. Inspection could operate through a new balance of regional/area inspection services and national inspection—inspection teams operating on a regional/area basis could be designated as "school improvement teams". To avoid competition between national and regional inspection, national inspections would be joint activities led by the national inspection service.

  40.  These revised arrangements would lead to increased frequency of inspection (including short-notice inspection) for individual schools and increased emphasis on advice and support to schools in respect of development and curriculum innovation. Inspection would continue to focus on creating high expectations, meeting learner needs, and ensuring progression and development.

Model 3:  Adaptive, on-demand testing using IT- based tests

  41.  In 2002, Bennett outlined a new world of adaptive, on-demand tests which could be delivered through machines. He suggests that "the incorporation of technology into assessment is inevitable because, as technology becomes intertwined with what and how students learn, the means we use to document achievement must keep pace". Bennett (2001) identifies a challenge, "to figure out how to design and deliver embedded assessment that provides instructional support and that globally summarises learning accomplishment". He is optimistic that "as we move assessment closer to instruction, we should eventually be able to adapt to the interests of the learner and to the particular strengths and weaknesses evident at any particular juncture . . .". This is aligned to the commitments of Government to encourage rates of progression based on individual attainment and pace of learning rather than age-related testing.

  42.  In the Government's five year strategy for education and children's services (DfES, 2004) principles for reform included "personalisation and choice as well as flexibility and independence". The White Paper on 14-19 Education and Skills (2005) stated, "Our intention is to create an education system tailored to the needs of the individual pupil, in which young people are stretched to achieve, are more able to take qualifications as soon as they are ready, rather than at fixed times . . ." and "to provide a tailored programme for each young person and intensive personal guidance and support". These intentions are equally important in the context of national testing systems.

  43.  The process relies on item-banking, combining items in individual test sessions to feed to students a set of questions appropriate to their stage of learning and to their individual level of attainment. Frequent low-stakes assessments could allow coverage of the curriculum over a school year. Partial repetition in tests, whilst they are "homing in" on an appropriate testing level, would be useful as a means of checking the extent to which pupils have really mastered and retained knowledge and understanding.

  44.  Pupils would be awarded a level at the end of each key stage based on performance on groups of questions to which a level has been assigned. More advantageously, levels could be awarded in the middle of the key stage as in the revised Welsh national assessment arrangements.

  45.  Since tests are individualised, adaptivity helps with security, with manageability, and with reducing the "stakes", moving away from large groups of students taking a test on a single occasion. Cloned items further help security. This is where an item on a topic can include different number values on a set of variables, allowing the same basic question to be systematically changed on different test administrations, thus preventing memorisation of responses. A simple example of cloning is where a calculation using ratio can use a 3:2 ratio in one item version and 5:3 ratio in another. The calibration of the bank would be crucial with item parameters carefully set and research to ensure that cloning does not lead to significant variations in item difficulty.

  46.  Reporting on national standards for policy purposes could be delivered through periodic reporting of groups of cognate items. As pupils nationally take the tests and when a critical nationally representative sample on a test is reached, this would be lodged as the national report of standards in a given area. This would involve grouping key items in the bank eg on understanding 2D representation of 3D objects and accumulating pupils' performance data on an annual basis (or more or less frequently, as deemed appropriate) and reporting on the basis of key elements of maths, English etc.

  47.  This "cognate grouping" approach would tend to reduce the stakes of national assessment, thus gauging more accurately underlying national standards of attainment. This would alleviate the problem identified by Tymms (2004) that "the test data are used in a very high-stakes fashion and the pressure created makes it hard to interpret that data. Teaching test technique must surely have contributed to some of the rise, as must teaching to the test".

  48.  Data could be linked to other cognate groupings, eg those who are good at X are also good at Y and poor on Z. Also, performance could be linked across subjects.

  49.  There are issues of reductivism in this model as there could be a danger to validity and curriculum coverage as a result of moving to test forms which are "bankable", work on-screen and are machine-markable. Using the Cambridge taxonomy of assessment items is one means of monitoring intended and unintended drift. It is certainly not the case that these testing technologies can only utilise the most simple multiple-choice (mc) items. MC items are used as part of high-level professional assessment eg in the medical and finance arenas, where well-designed items can be used for assessing how learners integrate knowledge to solve complex problems.

  50.  However, it is certainly true that, at the current stage of development, this type of approach to delivering assessment cannot handle the full range of items which are currently used in national testing and national qualifications. The limitation on the range of item types means that this form of testing is best used as a component in a national assessment model, and not the sole vehicle for all functions in the system.

  51.  School accountability could be delivered through this system using either (i) a school accumulation model, where the school automatically accumulates performance data from the adaptive tests in a school data record which is submitted automatically when the sample level reaches an appropriate level in each or all key subject areas, or (ii) the school improvement model outlined in model 2 above.

  52.  There are significant problems of capacity and readiness in schools, as evidenced through the problems being encountered by the KS3 ICT test project which has successively failed to meet take-up targets. It remains to be seen whether these can be swiftly overcome or are structural problems eg schools adopting very different IT network solutions and arranging IT in inflexible ways. However, it is very important to note that current arrangements remain based on "test sessions" of large groups of pupils, rather than true on-demand, adaptive tests. These arrangements could relieve greatly the pressures on infrastructure in schools, since sessions would be arranged for individuals or small groups on a "when ready" basis.

  53.  There are technical issues of validity and comparability to be considered. The facility of a test is more than the sum of the facility on the individual items which make up each test. However, this is an area of intense technical development in the assessment community, with new understanding and theorisations of assessment emerging rapidly.

  54.  There are issues of pedagogy. Can schools and teachers actually manage a process where children progress at different rates based on on-demand testing? How do learners and teachers judge when a child is ready? Will the model lead to higher expectations for all students, or self-fulfilling patterns of poor performance amongst some student groups? These—and many more important questions—indicate that the assessment model should be tied to appropriate learning and management strategies, and is thus not neutral technology, independent of learning.


  55.  Each of the models addresses the difficulties of multipurpose testing. However, each model also presents challenges to be considered and overcome. The Statistics Commission (2005) commented that "there is no real alternative at present to using statutory tests for setting targets for aggregate standards". The task is to find such an alternative. The real challenge is to provide school accountability data without contaminating the process of gathering data on national standards and individual student performance. All three models have their advantages and could lead to increased validity and reliability in national assessment arrangements and—crucially—the flow of reliable information on underlying educational standards; something which is seriously compromised in current arrangements.


  56.  As a possible line of development for new arrangements, the DfES recently has announced pilots of new test arrangements, to be trialled in 10 authorities. Cambridge Assessment has reviewed the proposals and, along with many others in the assessment community, consider that the design is seriously flawed. The deficiencies are significant enough to compromise the new model's capacity to deliver on the key functions of national assessment; ie information on attainment standards at system level; feedback to parents, pupils and teacher; and provision of school accountability.

  57.  Cambridge Assessment's response to the DfES consultation document on the progress tests covered the subject in some detail and we reproduce it below for the Select Committee:

    i  We welcome the developing debate on the function and utility of national assessment arrangements. We applaud the focus on development of arrangements which best support the wide range of learning and assessment needs amongst those in compulsory schooling.

    ii  As specialists in assessment, we have focused our comments on the technical issues associated with the proposals on testing. However, it is vital to note that Cambridge Assessment considers fitness for purpose and a beneficial linkage between learning and assessment to be at the heart of sound assessment practice.

    iii  We consider effective piloting, with adequate ethical safeguards for participants, to be essential to design and implementation of high quality assessment arrangements. It is essential that evaluation method, time-frames, and steering and reporting arrangements all enable the outcomes of piloting to be fed into operational systems. There is inadequate detail in the document to determine whether appropriate arrangements are in place.

    iv  We remain concerned over conflicting public statements regarding the possible status of the new tests (TES 30 March), which make it very unclear as to whether existing testing arrangements will co-exist alongside new arrangements, or whether one will be replaced by the other. This level of confusion is not helpful.

    v  We see three functions as being essential to national assessment arrangements:

    Intelligence on national standards—for the policy process.

    Information on individual pupil performance—for the learner, for parents, for teachers.

    Data on school performance—for accountability arrangements.

  We do not feel that the new model will meet these as effectively as other possible models. We would welcome discussions on alternatives.

    vi  We believe that, by themselves, the new test arrangements will not provide robust information on underlying standards in the education system. With entry to single-level tests dependent on teachers' decisions, teachers in different institutions and at different times are likely to deploy different approaches to entry. This is likely to be very volatile, and effects are unlikely to always cancel out. This is likely to contaminate the national data in a very new ways, compared with existing testing arrangements. There are no obvious remedies to this problem within the proposed arrangements, either in the form of guidance or regulation.

    vii  Teachers are likely to come under peculiar pressures, from institutions wishing to optimise performance-table position, from parents of individual children etc. This is an entirely different scenario to the "all to be tested and then a level emerges" character of current arrangements. Tiering invokes a similar, though not as here all-pervasive, effect.

    viii  Although advanced as "on-demand" testing, the regime is not an "on-demand" regime, and it is misleading to promote it as such. It provides one extra test session per year.

    ix  The frequency of testing is likely to increase the extent to which testing dominates teaching time. This is not a problem where the majority of washback effects from testing are demonstrably beneficial; we believe that other features of the tests mean that washback effects are likely to be detrimental. It is not clear what kind of differentiation in teaching will flow back from the tests. Ofsted and other research shows differentiation to be one of the least developed areas of teaching practices. We are concerned that the "grade D" problem (neglect of this those not capable of getting a C and those who will certainly gain a C) will emerge in a very complex form in the new arrangements.

    x  The tests may become MORE high stakes for learners. Labelling such as " . . . you're doing Level 2 for the third time! . . ." may emerge and be very pernicious. Jean Rudduck's work shows such labelling to be endemic and problematic.

    xi  We are unclear regarding the impact on those learners who fail a test by a small margin—they will wait six months to be re-tested. Do teachers judge that they should "lose six months of teaching time" to get them up to the required level or just carry on with no special support. If special support is given, what is the child not doing which they previously would have done? This is a key issue with groups such as less able boys—they will need to take time out of things which they are good at and which can bolster their "learning identities". Those who are a "near miss" will need to know—the document does not make clear whether learners will just "get a level back"; will get a mark; or an item-performance breakdown.

    xii  Testing arrangements are likely to become much more complex—security issues, mistakes (such as wrong test for a child) etc are likely to gain in significance.

    xiii  The length of the tests may be an improvement over existing tests, but further investigative work must be done to establish whether this is indeed the case. 45-minute tests may, or may sample more from each subject domain at an appropriate level, compared with existing national tests. This is an empirical question which needs to be examined. Lower sampling would reduce the reliability of the tests. Compounding this, the issue of pass marks must be addressed—compensation within the tests raises not only reliability questions but also washback effects into formative assessment. People who pass may still need to address key areas of learning in a key stage, if compensation and pass marks combine disadvantageously. The length of the tests and the need to cover the domain will tend to drive tests to a limited set of item types, raising validity issues. This in turn affects standards maintenance—if items are clustered around a text, if the text is changed (remembering test frequency is increased 100%) then all the items are no longer usable. This represents a dramatic escalation of burden in test development. Constructing and maintaining the bank of items will be very demanding.

    xiv  If a high pass mark is set (and the facility of items tuned to this) there will be little evidence of what a child cannot do. Optimising the formative feedback element—including feedback for high attainers—in the face of demand for high domain coverage, reasonable facility, and accessibility (recognisable stimulus material etc) will be very demanding for test designers. Level-setting procedures are not clear. The regime requires a very fast turnaround in results—not least to set in place and deliver learning for a "re-take" in the next test session (as well as keeping up with the pace of progression through the National Curriculum content). This implies objective tests. However, some difficult factors then combine. The entry will be a volatile mix of takers and re-takers.

    xv  While calibration data will exist for the items, random error will increase due to the volatility of entry, feeding into problems in the reliability of the item data in the bank. Put crudely, with no awarding processes (as present in existing national tests) there will be a loss of control over the overall test data—and thus reliability and standards over time will become increasingly problematic. As one possible solution, we recommend the development of parallel tests rather than successively different tests. Pre-tests and anchor tests become absolutely vital—and the purpose and function of these must be explained clearly to the public and the teaching profession. More information on this can be provided.

    xvi  Having the same tests for different key stages (as stated by officials) is problematic. There is different content in different stages (see English in particular). QCA has undertaken previous work on "does a Level 4 mean something different in different key stages"—the conclusion was that it did.

    xvii  The 10-hour training/learning sessions are likely to be narrowly devoted to the tests. This may communicate strong messages in the system regarding the importance of drilling and "surface learning"—exactly the opposite of what the DfES is supporting in other policy documents. Although superficially in line with "personalisation", It may instil dysfunctional learning styles.

    xviii  We applaud the sensitivity of the analysis emerging from the DfES in respect of the different populations of learners who are failing to attain target levels. We also support the original Standards Unit's commitment to a culture of high expectations, combined with high support. However, this level of sensitivity of analysis is not reflected in the blanket expectation that every child should improve by two levels.

    xix  We do not support "payment by results" approaches—in almost any form these have successively been found wanting. Undue pressure is exerted on tests and test administration—maladministration issues escalate.

    xx  In the face of the considerable challenge of developing a system which meets the demanding criteria which we associate with the operation of robust national assessment, we would welcome an opportunity to contribute to further discussions on the shape of enhanced national arrangements.


  58.  What is needed is a new look at options—and both the technical and political space for manoeuvre. Cambridge Assessment has not only attempted to assemble the evidence but have produced a "3 option" paper which outlines possible approaches to confront the very real problems outlined above. We commend a thoroughgoing review of the evidence. Not a—"single person review" like "Dearing" or "Tomlinson", but a more managed appraisal of options and a sober analysis of the benefits and deficits of alternatives. For this, we believe that a set of clear criteria should be used to drive the next phase of development:

    —  technically-robust arrangements should be developed;

    —  the arrangements should be consistent with stated functions;

    —  insights from trialling should fed into fully operational arrangements;

    —  unintended consequences are identified and remedied;

    —  full support from all levels of the system is secured in respect of revised arrangements;

    —  a number of models should be explored at the same time, in carefully designed programmes—in other words there should be parallel rather than serial development, trialling and evaluation; and

    —  appropriate ethical safeguards and experimental protocols should be put in place during development and trialling.

  59.  It is, of course, vital to consider not only the form of revised arrangements which better deliver the purposes of national assessment but also to consider the methods and time frame for development arrangements, as well as the means of securing genuine societal and system support.

  60.  The last two elements listed above are critical to this: currently, there are no plans for trialling more than one revised model for national testing. However, a cursory glance in the education research field shows that there is a range of contrasting approaches to delivering the key functions of national testing, many of which may well be presented to this Inquiry . . . It therefore would seem important to trial more than one model rather than "put all eggs in one basket" or take forward only modifications of existing arrangements.

  61.  It is unclear whether adequate safeguards have been put in place to protect learners exposed to revised national assessment arrangements. Cambridge Assessment recommends—in line with the standards being developed by the Government's Social Research Unit—that new protocols should be developed, as a matter of urgency for the trialling of revised arrangements.

previous page contents next page

House of Commons home page Parliament home page House of Lords home page search page enquiries index

© Parliamentary copyright 2008
Prepared 13 May 2008