National AssessmentAnnex 2
The Assessment of Performance Unitshould
it be re-instated?
The origins and demise of the APU
1. The inability of current arrangements
to provide a robust flow of policy intelligence on trends in pupil
attainment has emerged as a serious problem. The causes are multifaceted,
and include:
instability in standards within the
testing system (Massey, Oates, Stats Commission);
acute classification error affecting
assignment of pupils to levels (Wiliam, Tymms); and
teaching to the test/"Texas
Test Effect" (Wiliam).
2. Growing awareness of this issue has prompted
increasing calls for " . . . a return to the APU . . ."
(the Assessment of Performance Unit)a separate, "low
stakes", light-sampling survey for the purpose of reliable
detection of patterns of pupil attainment, and of trends in attainment
over time. But there are dangers in an unreflective attempt to
re-instate arrangements which actually fell short of their aims.
The APU processes were innovative and progressive. They mirrored
the fore-running US NAEP (National Assessment of Educational Progress)
and pre-dated the arrangements now in place in New Zealand and
Scotland. Running surveys from 1978-88, politicians and civil
servants saw it being redundant in the face of the data on each
and every child of age 7,11 and 14 which would be yielded from
National Curriculum assessment processes. The APU was hardly problem-free.
Significant issues emerged in respect of:
under-developed sampling frames;
tensions between subject-level and
component-level analysis and reporting;
differing measurement models at different
times in different subjects;
lack of stability in item forms;
escalating sampling burden;
difficulty in developing items for
the highest attaining pupils;
the nature of reporting arrangements;
replacement strategy in respect of
dated items;
ambiguity in purpose re "process
skills" as a principal focus versus curriculum content;
research/monitoring tensions; and
compressed development schedules
resulting from pressure from Government.
3. There was acute pressure on the APU to
deliver. Rather than recognize that the function of the APU was
essential for high-level policy processes and would need to persist
(NEAP has been in place in the US since 1969), the intense pressure
led to poor refinement of the technical processes which underpinned
the operation of the APU, and high turnover in the staff of the
different subject teams (Gipps and Goldstein 1983). Crucially,
the compressed piloting phases had a particularly adverse impact;
there was no means of undertaking secure evaluation of initial
survey work and feeding in "lessons learned":
" . . . the mathematics Group, in particular,
felt that they were continually being rushed: their requests for
a delay in the monitoring programme were rejected; their desire
for three pilot surveys was realized as only one; they experienced
a high turnover of staff and a resulting shortage of personnel.
The constant rush meant that there was no time for identifying
and remedying problems identified in the first year of testing.
4. In fact, all three teams suffered from
a rapid turnover of staff, put down to the constant pressure of
work combined with a lack of opportunity was no time to `side
track' into interesting research issues . . ." (Newton P
2005 p14)
5. This is not a trivial failure of a minor
survey instrument. The APU was founded in 1974 after publication
of a DES White Paper (Educational Disadvantage and the Needs
of Immigrants). It was the result of a protracted strategic
development process, which led from the DES-funded Working Group
on the Measurement of Educational Attainment (commissioned in
1970) and the NFER's DES-funded development work on Tests of Attainment
in Mathematics in Schools. If it had successfully attained its
objectives, it would have relieved National Curriculum testing
of the burden of attempting to measure standards over timea
purpose which has produced some of the most intense tensions amongst
the set of functions now attached to national testing. Stability
in the instruments is one of the strongest recommendations emerging
from projects designed to monitor standards over time. In sharp
tension with this, QCA and the State hasin line with commitments
to high quality educational provision; the standards agenda; and
responses from review and evaluation processessought to
optimize the National Curriculum by successive revision of content;
increasing the "accessibility of tests"; and ensuring
tight linkage of the tests to specific curriculum content. These
are laudable aimsand the emphasis on the diagnostic function
of the data from tests has been increasing in recent innovations
in testing arrangements. But pursuit of these aims has led to
repeated revision rather than stability in the tests.
6. The Massey Report suggested that if maintenance
of standards over time remained a key operational aim, then stability
in the test content was imperative. In the face of these tensions,
retaining an APU-style light sampling survey method would enable
de-coupling of national assessment from a requirement to deliver
robust information on national educational standards, and enable
testing to reflect curriculum change with precision, to optimize
the learning-focussed functions of testing, and enable constant
innovation in the form of tests (eg to optimize accessibility).
7. Thus, the deficits and closure of the
APU were, and remain, very serious issues in the operation and
structure of national assessment arrangements. Temporal discontinuity
played a key role in the methodological and technical problems
experienced by the APU developers. As outlined above, rushing
the development phases had a variety of effects, but the most
serious of these was the failure to establish with precision a
clear set of baseline data, accompanied by stable tests with known
performance data; " . . . an effective national monitoring
system cannot be brought `on stream' in just a couple of years
. . ." (Newton P, 2005).
8. Our conclusion is not "bring back
the APU", but develop a new light sampling, matrix-based
model using the knowledge from systems used in other nations and
insights from the problems of the APU. Models 1 and 2 which we
outline as alternatives in the main body of this evidence rely
on the development of new versions of the APU rather than simple
re-instatement.
SECTION 2
1. HIGHER EDUCATION
ADMISSIONS TESTS
Determining role and function
1. Since the publication, in September 2001,
of the Schwartz report (Fair Admissions to higher education: recommendations
for good practice), the issue of the role and function of admissions
tests has been a controversial area. Cambridge Assessment has
been cautious in its approach to this field. We have based our
development programme on carefully-considered criteria. We believe
that dedicated admissions tests should:
produce information which does not
duplicate information from other assessments and qualifications;
make a unique and useful contribution
to the information available to those making admissions decisions;
and
predict students' capacity to do
well in, and benefit from higher education.
2. Since the Cambridge Assessment Group
includes the OCR awarding body, we are also heavily involved in
refining A levels in the light of the "stretch and challenge"
agendaworking to include A* grades in A levels, inclusion
of more challenging questions, and furnishing unit and UMS scores
(Uniform Mark Scheme scores -a mechanism for equating scores from
different modules/units of achievement) as a means of helping
universities in the admissions process.
3. We recognize that HE institutions have
clear interests in identifying, with reasonable precision and
economy, those students who are most likely to benefit from specific
courses, are likely to do well, and who are unlikely to drop out
of the course. We also recognize that there is a strong impetus
behind the "widening participation" agenda.
4. Even with the proposed refinements in
A level and the move to post- qualification applications (PQA),
our extensive development work and consultation with HE institutions
has identified a continuing need for dedicated assessment instruments
which facilitate effective discrimination between high attaining
students and are also able to identify those students who possess
potential, but who have attained lower qualification grades for
a number of reasons.
5. We are very concerned not to contribute
to any unnecessary proliferation of tests and so have been careful
only to develop tests where they make a unique and robust contribution
to the admissions process, enhance the admissions process, and
do not replicate information from any other source. To these ends,
we have developed the BMAT for medical and veterinary admissions.
We have developed the TSA (Thinking Skills Assessment), which
is being used for admissions to some subjects in Cambridge and
Oxford and is being considered by a range of other institutions.
The TSA items (questions) also form part of the uniTEST which
were developed in conjunction with ACER (Australian Council for
Educational Research). UniTEST is being trialled with a range
of institutions, both "selecting" universities and "recruiting"
universities.
6. This test is designed to help specifically
with the widening participation agenda. Preliminary data suggests
that this test is useful in helping identify students who are
capable of enrolling on courses at more prestigious universities
than the ones for which they have applied as well as those who
should consider HE despite low qualification results.
7. The TSA should be seen more as a test
resource rather than a specific test: TSA items are held in an
"item bank", and this is used to generate tests for
different institutions. Although TSA items were originally developed
for admissions processes in Cambridge where discrimination between
very high attaining students is problematic and A level outcomes
inadequate as a basis for admissions decisions, Cambridge Assessment
research team is developing an "adaptive TSA". This
utilizes the latest measurement models and test management algorithms
to create tests which are useful with a very broad range of abilities.
8. The validation data for the TSA items
is building into a large body of evidence and the tests are yielding
correlations which suggest that they are both valid and useful
in admissionsand do not replicate information from GCSE
and AS/A2 qualifications. In other words, they are a useful addition
to information from these qualifications and allow more discriminating
decisions to be made than when using information from those qualifications
alone. In addition, they yield information which is more reliable
than the decisions which are made through interviews and will
provide a stable measure over the period that there are major
changes to AS and A2 qualifications.
The American SAT
9. Cambridge Assessment supports the principles
which are being promoted by the Sutton Trust and the Government
in respect of widening participation. However, we have undertaken
evaluation work which suggests that the promotion of the American
SAT test as a general admissions test for the UK is ill-founded.
The five-year SAT trial in the UK is part-funded by Government
(£800,000), the College Board (the test developers) contributing
£400,000 and with the Sutton Trust and NFER each contributing
£200,000.
10. The literature on the SAT trial in the
UK states that the SAT1 is an "aptitude" test. It also
makes two strong claims that are contested:
" . . . Other selection tests are used by
universities in the United Kingdom, but none of these is as well
constructed or established as the SAT©.
In summary, a review of existing research indicates
that the SAT© (or similar reasoning-type aptitude test) adds
some predictive power to school / examination grades, but the
extent of its value in this respect varies across studies. In
the UK, it has been shown that the SAT© is an appropriate
test to use and that it is modestly associated with A-level grades
whilst assessing a different construct. No recent study of the
predictive power of SAT© results for university outcomes
has been undertaken in the UK, and this proposal aims to provide
such information . . ."
Source: (http://www.nfer.org.uk/research-areas/pims-data/outlines/update-for-students-taking-part-in-this-research/a-validity-study-background.cfm)
11. The claim that "none of these is
as well constructed or established as the SAT©" fails
to recognise that Cambridge Assessment has assembled comprehensive
data on specific tests amongst its suite of admissions tests and
ensures that validity is at the heart of the instruments. These
are certainly not as old as the SAT but it is entirely inappropriate
to conflate quality of construction and duration of use.
12. More importantly, the analysis below
suggests that the claim that the SAT1 is a curriculum-independent
"aptitude" test is deeply flawed. This is not the first
time that this claim has been contested (Jencks, C. and Crouse,
J; Wolf A and Bakker S), but it is that first time that such a
critique has been based on an empirical study of content.
13. It is important to note that the SAT
is under serious criticism in the US (Cruz R; New York Times)
and also, despite many UK-commentators' assumptions, the SAT1
is not the sole, or pre-eminent, test used as part of US HE admissions
(Wolf A and Bakker S). The SAT2 is increasingly usedthis
is an avowedly curriculum-based test. Similarly, there has been
a substantial increase in the use of the Advanced Placement Schemesubject-based
courses and tests which improve students' grounding in specific
subjects, and are broadly equivalent to English Advanced Level
subject-specific qualifications.
14. It is also important to note that (i)
the US does not have standard national examinationsin the
absence of national GCSE-type qualifications, a curriculum-linked
test such as the SAT1 is a sensible instrument to have in the
US, to guarantee that learners have certain fundamental skills
and knowledgebut GCSE fulfils this purpose in England;
(ii) the USA has a four-year degree structure, with a "levelling"
general curriculum for the first year; and (iii) the SAT1 scores
are used alongside college grades, personal references, SAT2 scores
and Advanced Placement outcomes:
" . . . One of the misunderstood features
of college selection in America is that SATs are only one component,
with high school grades and other `portfolio' evidence playing
a major role. The evidence is that high school grades are a slightly
better predictor of college achievement than SAT scores, particularly
for females and minority students. Combining both provides the
best, though still limited, prediction of success . . ."
Curriculum mappingdoes the SAT mirror current
arrangements?
15. In the light of research comment on
the SAT and emerging serious criticisms of the instrument in its
home context, Cambridge Assessment commissioned a curriculum mapping
of the SAT in 2006 comparing it with content in the National Curriculum
(and, by extension, GCSE) and the uniTEST.
16. It is surprising that such a curriculum
content mapping has not been completed previously. Prior studies
(McDonald et al) have focused on comparison of outcomes
data from the SAT and qualifications (eg A level) in order to
infer whether the SAT is measuring something similar or different
to those qualifications. But the failure to undertake a comparison
of the SAT with the content of the English National Curriculum
is a serious oversight. The comparison is highly revealing.
17. The study consisted of a comparison
of published SAT assessment criteria, items included in SAT1 sample
papers, the National Curriculum programmes of study, and items
within the uniTEST. The SAT assessment criteria and National Curriculum
programmes of study were checked for analogous content. The National
Curriculum reference of any seemingly relevant content was then
noted and checked against appropriate SAT1 specimen items. The
full analysis was then verified by researchers outside the admissions
team, who were fully acquainted with the content of the National
Curriculum and GCSEs designed to assess National Curriculum content.
The researchers endorsed the analysis completed by the admissions
test developers.
The outcomes of the curriculum mapping study
18. The full results are shown in Higher
Education admissions tests Annex 3. Column 1 shows the sections
and item content of the SAT1. Column 2 gives the reference number
of the related National Curriculum content. For example, MA3 2i
refers to the statement:
Mathematics Key Stage 4 foundation
Ma3 Shape, space and measures
recall the definition of a circle and the meaning
of related terms, including centre, radius, chord, diameter, circumference,
tangent, arc, sector, and segment; understand that inscribed regular
polygons can be constructed by equal division of a circle.
19. Column 3 in Annex 3 shows the relation
between the content of the SAT1, the relevant components of the
National Curriculum and the Cambridge/ACER uniTEST admissions
test.
20. The analysis indicates that
the SAT1 content is largely pitched
at GCSE-level curriculum content in English and Maths, and replicates
GCSE assessment of that content; and
the item types and item content in
the SAT1 are very similar to that of GCSEs.
It is therefore not clear exactly what the SAT1
is contributing to assessment information already generated by
the national examinations system in England.
21. Previous appraisals of the SAT1 have
been based on correlations between GCSE, A level and SAT1 outcomes.
This has shown less than perfect correlation, which has been interpreted
as indicating that the SAT1 assesses something different to GCSE
and A level. But GCSE and A level are based on compensationparticularly
at lower grades, the same grade can be obtained by two candidates
with different profiles of performance. The inferences from the
data were previously made in the absence of a curriculum mapping.
The mapping suggests that discrepancies between SAT1 and GCSE/A
level outcomes may be the result of the candidates not succeeding
at certain areas in these exams, nonetheless gaining a reasonable
gradebut this being re-assessed by SAT1 and thus their
performance found wanting.
22. The existence of such comprehensive
overlap suggests that the SAT either presents an unnecessary replication
of GCSE assessment or an indication of the problems of compensation
in the established grading arrangements for GCSE.
23. Identical analysis of uniTEST, currently
being piloted and developed by Cambridge Assessment and ACER,
suggests that uniTEST does not replicate GCSE assessment to the
same extent as the SAT1 but focuses on the underlying thinking
skills rather than on formal curriculum content. There is some
overlap in the areas of verbal reasoning, problem solving, and
quantitative and formal reasoning. There are, however, substantial
areas of content which are not covered in the National Curriculum
statements of attainment nor in the SAT1. These are in critical
reasoning and socio-cultural understanding. This suggests that
uniTEST is not replicating GCSE and does offer unique measurement.
Preliminary data from the pilot suggest that uniTEST is detecting
learners who might aspire to universities of higher ranking that
the ones to which they have actually applied.
June 2007
|