Select Committee on Children, Schools and Families Written Evidence

National Assessment—Annex 2

The Assessment of Performance Unit—should it be re-instated?

The origins and demise of the APU

  1.  The inability of current arrangements to provide a robust flow of policy intelligence on trends in pupil attainment has emerged as a serious problem. The causes are multifaceted, and include:

    —  instability in standards within the testing system (Massey, Oates, Stats Commission);

    —  acute classification error affecting assignment of pupils to levels (Wiliam, Tymms); and

    —  teaching to the test/"Texas Test Effect" (Wiliam).

  2.  Growing awareness of this issue has prompted increasing calls for " . . . a return to the APU . . ." (the Assessment of Performance Unit)—a separate, "low stakes", light-sampling survey for the purpose of reliable detection of patterns of pupil attainment, and of trends in attainment over time. But there are dangers in an unreflective attempt to re-instate arrangements which actually fell short of their aims. The APU processes were innovative and progressive. They mirrored the fore-running US NAEP (National Assessment of Educational Progress) and pre-dated the arrangements now in place in New Zealand and Scotland. Running surveys from 1978-88, politicians and civil servants saw it being redundant in the face of the data on each and every child of age 7,11 and 14 which would be yielded from National Curriculum assessment processes. The APU was hardly problem-free. Significant issues emerged in respect of:

    —  under-developed sampling frames;

    —  tensions between subject-level and component-level analysis and reporting;

    —  differing measurement models at different times in different subjects;

    —  lack of stability in item forms;

    —  escalating sampling burden;

    —  difficulty in developing items for the highest attaining pupils;

    —  the nature of reporting arrangements;

    —  replacement strategy in respect of dated items;

    —  ambiguity in purpose re "process skills" as a principal focus versus curriculum content;

    —  research/monitoring tensions; and

    —  compressed development schedules resulting from pressure from Government.

  3.  There was acute pressure on the APU to deliver. Rather than recognize that the function of the APU was essential for high-level policy processes and would need to persist (NEAP has been in place in the US since 1969), the intense pressure led to poor refinement of the technical processes which underpinned the operation of the APU, and high turnover in the staff of the different subject teams (Gipps and Goldstein 1983). Crucially, the compressed piloting phases had a particularly adverse impact; there was no means of undertaking secure evaluation of initial survey work and feeding in "lessons learned":

    " . . . the mathematics Group, in particular, felt that they were continually being rushed: their requests for a delay in the monitoring programme were rejected; their desire for three pilot surveys was realized as only one; they experienced a high turnover of staff and a resulting shortage of personnel. The constant rush meant that there was no time for identifying and remedying problems identified in the first year of testing.

  4.  In fact, all three teams suffered from a rapid turnover of staff, put down to the constant pressure of work combined with a lack of opportunity was no time to `side track' into interesting research issues . . ." (Newton P 2005 p14)

  5.  This is not a trivial failure of a minor survey instrument. The APU was founded in 1974 after publication of a DES White Paper (Educational Disadvantage and the Needs of Immigrants). It was the result of a protracted strategic development process, which led from the DES-funded Working Group on the Measurement of Educational Attainment (commissioned in 1970) and the NFER's DES-funded development work on Tests of Attainment in Mathematics in Schools. If it had successfully attained its objectives, it would have relieved National Curriculum testing of the burden of attempting to measure standards over time—a purpose which has produced some of the most intense tensions amongst the set of functions now attached to national testing. Stability in the instruments is one of the strongest recommendations emerging from projects designed to monitor standards over time. In sharp tension with this, QCA and the State has—in line with commitments to high quality educational provision; the standards agenda; and responses from review and evaluation processes—sought to optimize the National Curriculum by successive revision of content; increasing the "accessibility of tests"; and ensuring tight linkage of the tests to specific curriculum content. These are laudable aims—and the emphasis on the diagnostic function of the data from tests has been increasing in recent innovations in testing arrangements. But pursuit of these aims has led to repeated revision rather than stability in the tests.

  6.  The Massey Report suggested that if maintenance of standards over time remained a key operational aim, then stability in the test content was imperative. In the face of these tensions, retaining an APU-style light sampling survey method would enable de-coupling of national assessment from a requirement to deliver robust information on national educational standards, and enable testing to reflect curriculum change with precision, to optimize the learning-focussed functions of testing, and enable constant innovation in the form of tests (eg to optimize accessibility).

  7.  Thus, the deficits and closure of the APU were, and remain, very serious issues in the operation and structure of national assessment arrangements. Temporal discontinuity played a key role in the methodological and technical problems experienced by the APU developers. As outlined above, rushing the development phases had a variety of effects, but the most serious of these was the failure to establish with precision a clear set of baseline data, accompanied by stable tests with known performance data; " . . . an effective national monitoring system cannot be brought `on stream' in just a couple of years . . ." (Newton P, 2005).

  8.  Our conclusion is not "bring back the APU", but develop a new light sampling, matrix-based model using the knowledge from systems used in other nations and insights from the problems of the APU. Models 1 and 2 which we outline as alternatives in the main body of this evidence rely on the development of new versions of the APU rather than simple re-instatement.



Determining role and function

  1.  Since the publication, in September 2001, of the Schwartz report (Fair Admissions to higher education: recommendations for good practice), the issue of the role and function of admissions tests has been a controversial area. Cambridge Assessment has been cautious in its approach to this field. We have based our development programme on carefully-considered criteria. We believe that dedicated admissions tests should:

    —  produce information which does not duplicate information from other assessments and qualifications;

    —  make a unique and useful contribution to the information available to those making admissions decisions; and

    —  predict students' capacity to do well in, and benefit from higher education.

  2.  Since the Cambridge Assessment Group includes the OCR awarding body, we are also heavily involved in refining A levels in the light of the "stretch and challenge" agenda—working to include A* grades in A levels, inclusion of more challenging questions, and furnishing unit and UMS scores (Uniform Mark Scheme scores -a mechanism for equating scores from different modules/units of achievement) as a means of helping universities in the admissions process.

  3.  We recognize that HE institutions have clear interests in identifying, with reasonable precision and economy, those students who are most likely to benefit from specific courses, are likely to do well, and who are unlikely to drop out of the course. We also recognize that there is a strong impetus behind the "widening participation" agenda.

  4.  Even with the proposed refinements in A level and the move to post- qualification applications (PQA), our extensive development work and consultation with HE institutions has identified a continuing need for dedicated assessment instruments which facilitate effective discrimination between high attaining students and are also able to identify those students who possess potential, but who have attained lower qualification grades for a number of reasons.

  5.  We are very concerned not to contribute to any unnecessary proliferation of tests and so have been careful only to develop tests where they make a unique and robust contribution to the admissions process, enhance the admissions process, and do not replicate information from any other source. To these ends, we have developed the BMAT for medical and veterinary admissions. We have developed the TSA (Thinking Skills Assessment), which is being used for admissions to some subjects in Cambridge and Oxford and is being considered by a range of other institutions. The TSA items (questions) also form part of the uniTEST which were developed in conjunction with ACER (Australian Council for Educational Research). UniTEST is being trialled with a range of institutions, both "selecting" universities and "recruiting" universities.

  6.  This test is designed to help specifically with the widening participation agenda. Preliminary data suggests that this test is useful in helping identify students who are capable of enrolling on courses at more prestigious universities than the ones for which they have applied as well as those who should consider HE despite low qualification results.

  7.  The TSA should be seen more as a test resource rather than a specific test: TSA items are held in an "item bank", and this is used to generate tests for different institutions. Although TSA items were originally developed for admissions processes in Cambridge where discrimination between very high attaining students is problematic and A level outcomes inadequate as a basis for admissions decisions, Cambridge Assessment research team is developing an "adaptive TSA". This utilizes the latest measurement models and test management algorithms to create tests which are useful with a very broad range of abilities.

  8.  The validation data for the TSA items is building into a large body of evidence and the tests are yielding correlations which suggest that they are both valid and useful in admissions—and do not replicate information from GCSE and AS/A2 qualifications. In other words, they are a useful addition to information from these qualifications and allow more discriminating decisions to be made than when using information from those qualifications alone. In addition, they yield information which is more reliable than the decisions which are made through interviews and will provide a stable measure over the period that there are major changes to AS and A2 qualifications.

The American SAT

  9.  Cambridge Assessment supports the principles which are being promoted by the Sutton Trust and the Government in respect of widening participation. However, we have undertaken evaluation work which suggests that the promotion of the American SAT test as a general admissions test for the UK is ill-founded. The five-year SAT trial in the UK is part-funded by Government (£800,000), the College Board (the test developers) contributing £400,000 and with the Sutton Trust and NFER each contributing £200,000.

  10.  The literature on the SAT trial in the UK states that the SAT1 is an "aptitude" test. It also makes two strong claims that are contested:

    " . . . Other selection tests are used by universities in the United Kingdom, but none of these is as well constructed or established as the SAT©.

    In summary, a review of existing research indicates that the SAT© (or similar reasoning-type aptitude test) adds some predictive power to school / examination grades, but the extent of its value in this respect varies across studies. In the UK, it has been shown that the SAT© is an appropriate test to use and that it is modestly associated with A-level grades whilst assessing a different construct. No recent study of the predictive power of SAT© results for university outcomes has been undertaken in the UK, and this proposal aims to provide such information . . ."

    Source: (

  11.  The claim that "none of these is as well constructed or established as the SAT©" fails to recognise that Cambridge Assessment has assembled comprehensive data on specific tests amongst its suite of admissions tests and ensures that validity is at the heart of the instruments. These are certainly not as old as the SAT but it is entirely inappropriate to conflate quality of construction and duration of use.

  12.  More importantly, the analysis below suggests that the claim that the SAT1 is a curriculum-independent "aptitude" test is deeply flawed. This is not the first time that this claim has been contested (Jencks, C. and Crouse, J; Wolf A and Bakker S), but it is that first time that such a critique has been based on an empirical study of content.

  13.  It is important to note that the SAT is under serious criticism in the US (Cruz R; New York Times) and also, despite many UK-commentators' assumptions, the SAT1 is not the sole, or pre-eminent, test used as part of US HE admissions (Wolf A and Bakker S). The SAT2 is increasingly used—this is an avowedly curriculum-based test. Similarly, there has been a substantial increase in the use of the Advanced Placement Scheme—subject-based courses and tests which improve students' grounding in specific subjects, and are broadly equivalent to English Advanced Level subject-specific qualifications.

  14.  It is also important to note that (i) the US does not have standard national examinations—in the absence of national GCSE-type qualifications, a curriculum-linked test such as the SAT1 is a sensible instrument to have in the US, to guarantee that learners have certain fundamental skills and knowledge—but GCSE fulfils this purpose in England; (ii) the USA has a four-year degree structure, with a "levelling" general curriculum for the first year; and (iii) the SAT1 scores are used alongside college grades, personal references, SAT2 scores and Advanced Placement outcomes:

    " . . . One of the misunderstood features of college selection in America is that SATs are only one component, with high school grades and other `portfolio' evidence playing a major role. The evidence is that high school grades are a slightly better predictor of college achievement than SAT scores, particularly for females and minority students. Combining both provides the best, though still limited, prediction of success . . ."

    (Stobart G)

Curriculum mapping—does the SAT mirror current arrangements?

  15.  In the light of research comment on the SAT and emerging serious criticisms of the instrument in its home context, Cambridge Assessment commissioned a curriculum mapping of the SAT in 2006 comparing it with content in the National Curriculum (and, by extension, GCSE) and the uniTEST.

  16.  It is surprising that such a curriculum content mapping has not been completed previously. Prior studies (McDonald et al) have focused on comparison of outcomes data from the SAT and qualifications (eg A level) in order to infer whether the SAT is measuring something similar or different to those qualifications. But the failure to undertake a comparison of the SAT with the content of the English National Curriculum is a serious oversight. The comparison is highly revealing.

  17.  The study consisted of a comparison of published SAT assessment criteria, items included in SAT1 sample papers, the National Curriculum programmes of study, and items within the uniTEST. The SAT assessment criteria and National Curriculum programmes of study were checked for analogous content. The National Curriculum reference of any seemingly relevant content was then noted and checked against appropriate SAT1 specimen items. The full analysis was then verified by researchers outside the admissions team, who were fully acquainted with the content of the National Curriculum and GCSEs designed to assess National Curriculum content. The researchers endorsed the analysis completed by the admissions test developers.

The outcomes of the curriculum mapping study

  18.  The full results are shown in Higher Education admissions tests Annex 3. Column 1 shows the sections and item content of the SAT1. Column 2 gives the reference number of the related National Curriculum content. For example, MA3 2i refers to the statement:

    Mathematics Key Stage 4 foundation

    Ma3 Shape, space and measures

    Geometrical reasoning 2

    Properties of circles

  recall the definition of a circle and the meaning of related terms, including centre, radius, chord, diameter, circumference, tangent, arc, sector, and segment; understand that inscribed regular polygons can be constructed by equal division of a circle.

  19.  Column 3 in Annex 3 shows the relation between the content of the SAT1, the relevant components of the National Curriculum and the Cambridge/ACER uniTEST admissions test.

  20.  The analysis indicates that

    —  the SAT1 content is largely pitched at GCSE-level curriculum content in English and Maths, and replicates GCSE assessment of that content; and

    —  the item types and item content in the SAT1 are very similar to that of GCSEs.

  It is therefore not clear exactly what the SAT1 is contributing to assessment information already generated by the national examinations system in England.

  21.  Previous appraisals of the SAT1 have been based on correlations between GCSE, A level and SAT1 outcomes. This has shown less than perfect correlation, which has been interpreted as indicating that the SAT1 assesses something different to GCSE and A level. But GCSE and A level are based on compensation—particularly at lower grades, the same grade can be obtained by two candidates with different profiles of performance. The inferences from the data were previously made in the absence of a curriculum mapping. The mapping suggests that discrepancies between SAT1 and GCSE/A level outcomes may be the result of the candidates not succeeding at certain areas in these exams, nonetheless gaining a reasonable grade—but this being re-assessed by SAT1 and thus their performance found wanting.

  22.  The existence of such comprehensive overlap suggests that the SAT either presents an unnecessary replication of GCSE assessment or an indication of the problems of compensation in the established grading arrangements for GCSE.

  23.  Identical analysis of uniTEST, currently being piloted and developed by Cambridge Assessment and ACER, suggests that uniTEST does not replicate GCSE assessment to the same extent as the SAT1 but focuses on the underlying thinking skills rather than on formal curriculum content. There is some overlap in the areas of verbal reasoning, problem solving, and quantitative and formal reasoning. There are, however, substantial areas of content which are not covered in the National Curriculum statements of attainment nor in the SAT1. These are in critical reasoning and socio-cultural understanding. This suggests that uniTEST is not replicating GCSE and does offer unique measurement. Preliminary data from the pilot suggest that uniTEST is detecting learners who might aspire to universities of higher ranking that the ones to which they have actually applied.

June 2007

previous page contents next page

House of Commons home page Parliament home page House of Lords home page search page enquiries index

© Parliamentary copyright 2008
Prepared 13 May 2008