Memorandum
submitted by Cambridge Assessment
Cambridge Assessment is Europe's largest
assessment agency and plays a leading role in researching, developing and
delivering assessment across the globe. It is a department of the University of
Cambridge and a not-for-profit organisation with a turnover of around £175
million. The Group employs around 1,400 people and
contracts some 15,000 examiners each year.
Cambridge Assessment's portfolio of
activities includes world-leading research, ground-breaking new developments
and career enhancement for assessment professionals. Public examinations and
tests are delivered around the globe through our three highly respected
examining bodies.
The assessment providers in the Group
include:
University of Cambridge English for Speakers of Other Languages (Cambridge
ESOL)
Tests and qualifications from Cambridge ESOL are
taken by over 1.75 million people, in 135 countries. Cambridge ESOL's Teaching
Awards provide a route into the English Language Teaching profession for new
teachers and first class career development opportunities for experienced
teachers. Cambridge ESOL works with a number of governments in the field of
language and immigration.
University
of Cambridge International Examinations (CIE)
CIE is the world's largest provider of
international qualifications for 14-19 year-olds. CIE qualifications are
available in over 150 countries. CIE works directly with a number of
governments to provide qualifications, training and system renewal.
OCR
The Oxford Cambridge and RSA
awarding body (OCR) provides general academic qualifications and vocational
qualifications for learners of all ages through 13,000 schools, colleges and
other institutions. It is one of the three main awarding bodies for school
qualifications in England.
ARD
The Assessment Research and
Development division (ARD) supports development and evaluation work across the
Cambridge Assessment group and administers a range of admissions tests for
entry to Higher Education. The ARD includes the Psychometrics Centre, a
provider and developer of psychometric tests.
1. A view of the scope of the enquiry
1. At
Cambridge Assessment we recognize that it is vital not to approach assessment
on a piecemeal basis. The education system is exactly that: a system. Later
experiences of learners are conditioned by earlier ones; different elements of
the system may be experienced by learners as contrasting and contradictory;
discontinuities between elements in the system (e.g. transition from primary to
secondary education) may be very challenging to learners.
2. Whilst
understanding the system as a system is important, we believe that the current
focus on 14-19 developments (particularly the development of the Diplomas and
post-Tomlinson strategy) can all too readily take attention away from the
serious problems which are present in 5-14 national assessment.
3. Our
evidence tends to focus on assessment issues. This is central to our
organisation's functions and expertise. However, we are most anxious to ensure
that assessment serves key functions in terms of supporting effective learning
(formative functions) and progression (summative functions). Both should be
supported by effective assessment.
4. We
welcome the framing of the Committee's terms of reference for this Inquiry,
which make it clear that it intends to treat these two areas as substantially
discrete. Cambridge Assessment's qualifications deliverers (exam boards), OCR
and University of Cambridge International Examinations, have tendered evidence
separately to this submission. They have looked chiefly at 14-19
qualifications.
5. This
particular submission falls into two sections: Firstly Cambridge Assessment's
views on the national assessment framework (for children aged 5-14). These are
informed by, but not necessarily limited to, the work which we carried out
through out 2006 in partnership with the Institute for Public Policy Research
(IPPR) and the substantial expertise in the Group of those who have worked on
national tests.
6. The
second section is on University Entrance Tests. Cambridge Assessment has been
involved in the development of these for nearly a decade and uses a research
base that stretches back even further. At first their scope was limited to
Cambridge University but over the last four years it has grown to include many
other institutions. That they are administered under Cambridge Assessment's
auspices (as opposed to those of one of our exam boards) is a reflection of
their roots within our research faculty and the non statutory nature of the
tests themselves.
Section
1
2. National assessment arrangements
7. In
this section we have sought to outline the problems that have built up around
the national assessment arrangements. We have then gone on to discuss the
changes proposed in our work with the IPPR.
We also then discuss the problems that appear to be inherent in the
'Making Progress' model that the Government is committed to trialling. Our
conclusion is that there is a window of opportunity before us at the present time,
just one of the reasons that the Committee's Inquiry is so timely, and that the
Government should not close it with the dangerous haste that it seems bent on.
There are a range of options and to pursue only one is a serious mistake.
8. We
have included two Annexes:
· an overview of
the evidence on national assessment dealing with questions ranging from 'teaching to the test' to 'measurement
error'
· a brief
discussion of why the sometimes mooted return of the APU might not deliver all
the objectives desired of it.
3. Diagnosis of the Challenge -
critique
and revision of national assessment arrangements
9. It
is important to note that Cambridge Assessment is highly supportive of the
principle of a National Curriculum and related national assessment. The concept
of 'entitlement' at the heart of the National Curriculum has been vital to
raising achievement overall; raising the attainment of specific groups (e.g.
females in respect of maths and science); and ensuring breadth and balance. We
recognise that enormous effort has been put in, by officials and developers, to
improving the tests in successive years. We support the general sentiment of
the Rose Review - that the system has some strong characteristics - but it is
clear that deep structural problems have built up over time.
10. Whilst
being concerned over these problems, Cambridge Assessment is committed to the
key functions supported by national assessment: provision of information for
formative and diagnostic purposes to pupils, teachers and parents; information
on national standards, and accountability at school level. We return to these
key functions in more detail below. However, Cambridge Assessment is critical
of the way in which national assessment has been progressively and successively
elaborated into a system which appears to be yielding too many serious and
systemic problems.
Accumulating
problems in National Assessment - a vessel full to bursting point?
11. There are two particularly significant
problems in the highly sensitive area of technical development of national
assessment arrangements. Firstly, previous statements by agencies, departments
and Government have exaggerated the technical rigour of national assessment.
Thus any attempts to more accurately describe its technical character run the
risk of undermining both the departments and ministers; '...if you're saying this
now, how it is that you said that, two years ago...'. This prevents rational
debate of problems and scientifically-founded development of arrangements.
Secondly,
as each critique has become public, the tendency is to breathe a sigh of relief
as the press storm abates; each report is literally or metaphorically placed in
a locked cupboard and forgotten.
12. In contrast, we have attempted here to
take all relevant evidence and integrate it; synthesising it in such a way that
underlying problems and tendencies can accurately be appraised - with the
intention of ensuring effective evaluation and refinement of systems.
14. Put simply, if a minister asks a sensible
question: '...are attainment standards in English going up or down and by how
much?...' there is no means of delivering a valid and sound response to that
question using current arrangements. This is a serious problem for policy
formation and system management. It is not a position which obtains in systems
which use independent light sampling methods such as the US NAEP (National Assessment of Educational
Progress).
Functions
15. Current
national curriculum assessment arrangements within England have attracted
increasing criticism in respect of the extent to which they are carrying too
many purposes (Brooks R & Tough S; Bell J et al; Daugherty R et al). Since
1988 a substantial set of overt and tacit functions have found themselves
added. The original purposes specified in the TGAT Report (Task Group on
Assessment and Testing) comprised:
1 formative (diagnostic for pupils; diagnostic
for teachers)
2 summative (feedback for pupils and parents)
3 evaluative (providing information at LEA and
school level)
4 informative (providing
information on educational standards at system level)
16. The following have been added, as
increasingly elaborated uses of the flow of detailed data from national
assessment:
- school accountability
- departmental accountability
- apportionment of funds
- inspection patterns and actions
- upwards pressure on standards/target setting
- structuring of educational markets and school
choice
- emphasis of specific curriculum elements and
approaches
- detailed tracking of individual attainment,
strengths and weaknesses
- quantification of
progress
17. Unsurprisingly, many educationalists
have expressed the view that the current tests carry too many functions and
that the underlying management processes are too elaborated. To carry this broad
range of functions, the system of assessing every child at the end of each Key
Stage is dependent on maintaining test standards over time in a way which is in
fact not practical.
18. If you want to measure change, don't
change the measure. But the nation does - and should - change/update the
National Curriculum regularly. Whenever there is change (and sometimes radical
overhaul) the maintenance of test standards becomes a particularly aggressive
problem. It does, of course, remain a constant problem in areas such as English
Literature when one could be pretesting a test on Macbeth which will be taken in 2008 but the pupils are currently
studying As You Like it when they sit
the pretest. There are remedies to some of the problems this creates - namely
switch to different sampling processes; announcing radical recalibration, or
switch to low stakes sampling of children's performance, using a NAEP or a
modernized APU-style model (Assessment of Performance Unit - see Annexe 2)
19. Attempting to use national
assessment to measure trends over time has produced some of the most intense
tensions amongst the set of functions now attached to national testing.
Stability in the instruments is one of the strongest recommendations emerging
from projects designed to monitor standards over time. Running counter to this,
QCA and the DfES have - in line with commitments to high quality educational
provision, the standards agenda and responses from review and evaluation
processes - sought to optimize the National Curriculum by successive revision
of content, increasing the 'accessibility of tests', and ensuring tight linkage
of the tests to specific curriculum content.
20. These
are laudable aims - and the emphasis on the diagnostic function of the data
from tests has been increasing in recent innovations in testing
arrangements However, pursuit of these
aims has led to repeated revision rather than stability in the tests. The
Massey Report suggested that if maintenance of standards over time remained a
key operational aim, then stability in the test content was imperative (Massey
A et al). In the face of these tensions, a light sampling survey method would
enable de-coupling of national assessment from a requirement to deliver robust
information on national educational standards.
This would enable testing to reflect curriculum change with precision,
to optimize the learning-focussed functions of testing, and enable constant
innovation in the form of tests to optimize accessibility.
21. It is therefore clear that the current
functions of national testing arrangements are in acute and chronic tension.
Using the pragmatic argument that 'every policy should have a policy
instrument' we conclude that national arrangements should indeed support school
accountability and improvement, report to parents and monitor national
standards but that a change of arrangements is required to achieve this. A range of approaches are necessary to
deliver these functions and we outline some viable options below.
3. Alternative
Approaches to National Assessment (KS1, KS2, KS3)
Objectives
22. There
is a need to conceptualise a number of possible models for consideration in an
attempt to address the problems of 'multipurpose testing'. It is vital to note
that we present here three alternatives. We do this to show that there are
credible alternatives for delivering on the key objectives of national
assessment - it is simply not the case that there is only one way of moving
forward.
23. We believe the aims should be to
· reduce the
assessment burden on schools
· provide
formative assessment for teaching and learning
· provide
information for school accountability
· provide
information on national standards.
24. In order to secure widespread support
within the education community (including parents) a firm re-statement of
educational purpose (values) and a commitment to high degrees of validity is
essential. It is not enough to initiate
changes merely because of concerns about the defects of existing arrangements.
We do not here outline values and validity in detail, but recognise that this
is a vital precondition of designing revised arrangements, putting them in
place, and monitoring their operation. It is important that a full discussion
of these matters precedes any executive decision regarding revised
arrangements.
Alternative
models for national assessment
Model 1: Validity in monitoring plus
accountability to school level
25. The
aim of this approach is to collect data using a national monitoring survey and
to use this data for monitoring standards over time as well as for moderation
of teacher assessment. This would
enable school performance to be measured for accountability purposes and would
involve a special kind of criterion referencing known as domain
referencing.
26. Question
banks would be created based on the curriculum with each measure focusing on a
defined domain. A sample of questions
would be taken from the bank and divided into lots of small testlets (smaller
than the current KS tests). These would
then be randomly allocated to each candidate in a school. Every question is therefore attempted by
thousands of candidates so the summary statistics are very accurate and there
are summary statistics on a large sample of questions. This means that for a particular year it is known,
for example, that on average candidates can obtain 50% of the marks in domain
Y.
27. The
following year it might be found that they obtain 55% of the marks in that
domain. This therefore measures the change and no judgement about relative
year-on-year test difficulty is required. Neither is there a need for a complex
statistical model for analysing the data, although modelling would be required
to calculate the standard errors of the statistics reported. However, with the correct design they would
be superfluous because they would be negligible. It would be possible to use a preliminary survey to link domains
to existing levels and the issue of changing items over time could be solved by
chaining and making comparisons based on common items between years. Although each testlet would be an unreliable
measure in itself, it would be possible to assign levels to marks using a
statistical method once an overall analysis had been carried out. The average
of the testlet scores would be a good measure of a school's performance given
that there are sufficient candidates in the school. The appropriate number of candidates would need to be
investigated.
28. The
survey data could also be used to moderate teacher assessment by asking the
teacher to rank order the candidates and to assign a level to each of
them. Teacher assessment levels would
then be compared with testlet levels and the differences calculated. It would not be expected that the
differences should be zero, but rather that the need for moderation should be
determined by whether the differences cancel out or not. Work would need to be
done to establish the levels of tolerance and the rules for applying this
process would need to be agreed. The school could have the option of accepting
the statistical moderation or going through a more formal moderation process.
29. There
would be a number of potential advantages related to this model. Validity would be increased as there would
be greater curriculum coverage. The data would be more appropriate for the
investigation of standards over time.
The test development process would be less expensive as items could be
re-used through an item bank, including past items from national curriculum
tests. There would also be fewer problems with security related to 'whole
tests'. No awarding meetings would be
needed as the outcomes would be automatic and not judgemental. Since candidates
would not be able to prepare for a specific paper the negative wash-back and
narrowing of the curriculum would be eliminated (i.e. the potential elimination of 'teaching to the test')..
There would also be less pressure on the individual student since the tests
would be low stakes.
30. Given
that there are enough students in a school, the differences in question
difficulty and pupil question interaction would average out to zero leaving
only the mean of the pupil effects. From the data it would be possible to
generate a range of reports e.g. equipercentiles and domain profiles. Reporting
of domain profiles would address an issue raised by Tymms (2004) that 'the
official results deal with whole areas of the curriculum but the data suggests
that standards have changed differently in different sub-areas'.
31. Work
would need to be done to overcome a number of potential disadvantages of the
model. Transparency and perception would be important and stakeholders would
need to be able to understand the model sufficiently to have confidence in the
outcomes. This would be a particularly
sensitive issue as students could be expected to take tests that prove to be
too difficult or too easy for them.
Some stratification of the tests according to difficulty and ability
would alleviate this problem. There is an assumption that teachers can rank order
students (Lamming D) and this would need to be explored. Applying the model to
English would need further thought in order to accommodate the variations in
task type and skills assessed that arise in that subject area.
32. Eventually
the model would offer the possibility of reducing the assessment burden but the
burden would be comparatively greater for the primary phase. Although security
problems could be alleviated by using item banking, the impact of item re-use
would need to be considered. Having
items in the public domain would be a novel situation for almost any other
important test in the UK (except the driving test).
33. Discussion and research would be
needed in a number of areas
· values and
validity
· scale and scope
e.g. number and age of candidates, regularity and timing of tests
· formal
development of the statistics model
· simulation of
data (based on APU science data initially)
· stratification
of tests / students
· pilots and
trials of any proposed system
Model
2: Validity in monitoring
plus a switch to 'school-improvement inspection'
34. Whilst
the processes for equating standards over time have been enhanced since the
production of the Massey Report, there remain significant issues relating to:
· teacher
confidence in test outcomes
· evidence of
negative wash-back into learning approaches
· over-interpretation
of data at pupil group level; inferences of improvement or deterioration of
performance not being robust due to small group size
· ambiguity in
policy regarding borderlining
· no provision to
implement Massey recommendations regarding keeping tests stable for 5 years and
then 'recalibrating' national standards
· publishing
error figures for national tests
35. In
the face of these problems, it is attractive to adopt a low-stakes,
matrix-based, light sampling survey of schools and pupils in order to offer
intelligence to Government on underlying educational standards. With a matrix
model underpinning the sampling frame, far wider coverage of the curriculum can
be offered than with current national testing arrangements.
36. However,
if used as a replacement for national testing of every child at the end of KS1,
2 and 3, then key functions of the existing system would not be delivered:
· data reporting,
to parents, progress for every child at the end of each key stage
· school
accountability measures
37. In
a system with a light sampling model for monitoring national standards, the
first of these functions could be delivered through (i) moderated teacher
assessment, combined with (ii) internal testing, or tests provided by external
agencies and/or grouped schools arrangements.
The DfES prototype work on assessment for learning could form national
guidelines for (i) the overall purpose and framework for school assessment, and
(ii) model processes. This framework of assessment policy would be central to
the inspection framework used in school inspection.
38. The
intention would be to give sensitive feedback to learners and parents, with the
prime function of highlighting to parents how best to support their child's
learning. Moderated teacher assessment
has been proven to facilitate staff development and effective pedagogic
practice. Arrangements could operate on a local or regional level, allowing
transfer of practice from school to school.
39. The
second of these functions could be delivered through a change in the Ofsted
inspection model. A new framework would be required since the current framework
is heavily dependent on national test data, with all the attendant problems of
the error in the data and instability of standards over time. Inspection could
operate through a new balance of regional/area inspection services and national
inspection - inspection teams operating on a regional/area basis could be
designated as 'school improvement teams'.
To avoid competition between national and regional inspection, national
inspections would be joint activities led by the national inspection service.
40. These
revised arrangements would lead to increased frequency of inspection (including
short-notice inspection) for individual schools and increased emphasis on
advice and support to schools in respect of development and curriculum
innovation. Inspection would continue
to focus on creating high expectations, meeting learner needs, and ensuring
progression and development.
Model 3: Adaptive, on-demand testing using
IT- based tests
41. In
2002, Bennett outlined a new world of adaptive, on-demand tests which could be
delivered through machines. He suggests that 'the incorporation of technology
into assessment is inevitable because, as technology becomes intertwined with
what and how students learn, the means we use to document achievement must keep
pace'. Bennett (2001) identifies a challenge, 'to figure out how to design and
deliver embedded assessment that provides instructional support and that
globally summarises learning accomplishment'.
He is optimistic that 'as we move assessment closer to instruction, we
should eventually be able to adapt to the interests of the learner and to the
particular strengths and weaknesses evident at any particular juncture...'. This is aligned to the commitments of
Government to encourage rates of progression based on individual attainment and
pace of learning rather than age-related testing.
42. In
the Government's five year strategy for education and children's services
(DfES, 2004) principles for reform included 'personalisation and choice as well
as flexibility and independence'. The White Paper on 14 - 19 Education and
Skills (2005) stated, 'Our intention is to create an education system tailored
to the needs of the individual pupil, in which young people are stretched to
achieve, are more able to take qualifications as soon as they are ready, rather
than at fixed times...' and 'to provide a tailored programme for each young
person and intensive personal guidance and support'. These intentions are
equally important in the context of national testing systems.
43. The
process relies on item-banking, combining items in individual test sessions to
feed to students a set of questions appropriate to their stage of learning and
to their individual level of attainment. Frequent low-stakes assessments could
allow coverage of the curriculum over a school year. Partial repetition in
tests, whilst they are 'homing in' on an appropriate testing level, would be
useful as a means of checking the extent to which pupils have really mastered
and retained knowledge and understanding.
44. Pupils
would be awarded a level at the end of each key stage based on performance on
groups of questions to which a level has been assigned. More advantageously,
levels could be awarded in the middle of the key stage as in the revised Welsh
national assessment arrangements.
45. Since
tests are individualised, adaptivity helps with security, with manageability,
and with reducing the 'stakes', moving away from large groups of students
taking a test on a single occasion. Cloned items further help security. This is
where an item on a topic can include different number values on a set of
variables, allowing the same basic question to be systematically changed on
different test administrations, thus preventing memorisation of responses. A
simple example of cloning is where a calculation using ratio can use a 3:2
ratio in one item version and 5:3 ratio in another. The calibration of the bank would be crucial with item parameters
carefully set and research to ensure that cloning does not lead to significant
variations in item difficulty.
46. Reporting
on national standards for policy purposes could be delivered through periodic
reporting of groups of cognate items. As pupils nationally take the tests and
when a critical nationally representative sample on a test is reached, this
would be lodged as the national report of standards in a given area. This would
involve grouping key items in the bank e.g. on understanding 2D representation
of 3D objects and accumulating pupils' performance data on an annual basis (or
more or less frequently, as deemed appropriate) and reporting on the basis of
key elements of maths, English etc.
47. This
'cognate grouping' approach would tend to reduce the stakes of national
assessment, thus gauging more accurately underlying national standards of
attainment. This would alleviate the problem identified by Tymms (2004) that
'the test data are used in a very high-stakes fashion and the pressure created
makes it hard to interpret that data. Teaching test technique must surely have
contributed to some of the rise, as must teaching to the test'.
48. Data
could be linked to other cognate groupings, e.g. those who are good at X are
also good at Y and poor on Z. Also, performance could be linked across
subjects.
49. There
are issues of reductivism in this model as there could be a danger to validity
and curriculum coverage as a result of moving to test forms which are
'bankable', work on-screen and are machine-markable. Using the Cambridge
taxonomy of assessment items is one means of monitoring intended and unintended
drift. It is certainly not the case
that these testing technologies can only utilise the most simple multiple-choice
(mc) items. MC items are used as part
of high-level professional assessment e.g. in the medical and finance arenas,
where well-designed items can be used for assessing how learners integrate
knowledge to solve complex problems.
50. However,
it is certainly true that, at the current stage of development, this type of
approach to delivering assessment cannot handle the full range of items which
are currently used in national testing and national qualifications. The
limitation on the range of item types means that this form of testing is best
used as a component in a national assessment model, and not the sole vehicle
for all functions in the system.
51. School
accountability could be delivered through this system using either (i) a school
accumulation model, where the school automatically accumulates performance data
from the adaptive tests in a school data record which is submitted
automatically when the sample level reaches an appropriate level in each or all
key subject areas, or (ii) the school improvement model outlined in model 2
above.
52. There
are significant problems of capacity and readiness in schools, as evidenced
through the problems being encountered by the KS3 ICT test project which has
successively failed to meet take-up targets. It remains to be seen whether
these can be swiftly overcome or are structural problems e.g. schools adopting
very different IT network solutions and arranging IT in inflexible ways. However, it is very important to note that
current arrangements remain based on 'test sessions' of large groups of pupils,
rather than true on-demand, adaptive tests. These arrangements could relieve
greatly the pressures on infrastructure in schools, since sessions would be
arranged for individuals or small groups on a 'when ready' basis.
53. There
are technical issues of validity and comparability to be considered. The
facility of a test is more than the sum of the facility on the individual items
which make up each test. However, this is an area of intense technical
development in the assessment community, with new understanding and
theorisations of assessment emerging rapidly.
54. There
are issues of pedagogy. Can schools and teachers actually manage a process
where children progress at different rates based on on-demand testing? How do
learners and teachers judge when a child is ready? Will the model lead to
higher expectations for all students, or self-fulfilling patterns of poor
performance amongst some student groups? These - and many more important
questions - indicate that the assessment model should be tied to appropriate
learning and management strategies, and is thus not neutral technology,
independent of learning.
Overall
55. Each of the models addresses the
difficulties of multipurpose testing. However, each model also presents
challenges to be considered and overcome. The Statistics Commission (2005)
commented that 'there is no real alternative at present to using statutory
tests for setting targets for aggregate standards'. The task is to find such an
alternative. The real challenge is to
provide school accountability data without contaminating the process of
gathering data on national standards and individual student performance. All
three models have their advantages and could lead to increased validity and
reliability in national assessment arrangements and - crucially - the flow of
reliable information on underlying educational standards; something which is
seriously compromised in current arrangements.
4. New progress tests - serious technical
problems
56. As
a possible line of development for new arrangements, the DfES recently has
announced pilots of new test arrangements, to be trialled in 10 authorities.
Cambridge Assessment has reviewed the proposals and, along with many others in
the assessment community, consider that the design is seriously flawed. The
deficiencies are significant enough to compromise the new model's capacity to
deliver on the key functions of national assessment; i.e. information on
attainment standards at system level; feedback to parents, pupils and teacher;
and provision of school accountability.
57. Cambridge
Assessment's response to the DfES consultation document on the progress tests
covered the subject in some detail and we reproduce it below for the Select
Committee.
i
We welcome the developing debate on the
function and utility of national assessment arrangements. We applaud the focus
on development of arrangements which best support the wide range of learning
and assessment needs amongst those in compulsory schooling.
ii
As specialists in assessment, we have
focused our comments on the technical issues associated with the proposals on
testing. However, it is vital to note that Cambridge Assessment considers
fitness for purpose and a beneficial linkage between learning and assessment to
be at the heart of sound assessment practice.
iii
We consider effective piloting, with
adequate ethical safeguards for participants, to be essential to design and
implementation of high quality assessment arrangements. It is essential that
evaluation method, time-frames, and steering and reporting arrangements all
enable the outcomes of piloting to be fed into operational systems. There is
inadequate detail in the document to determine whether appropriate arrangements
are in place.
iv
We remain concerned over conflicting
public statements regarding the possible status of the new tests (TES march 30th),
which make it very unclear as to whether existing testing arrangements will
co-exist alongside new arrangements, or whether one will be replaced by the
other. This level of confusion is not helpful.
v
We see three functions as being essential
to national assessment arrangements:
Intelligence on
national standards - for the policy process
Information on
individual pupil performance - for the learner, for parents, for teachers
Data on school
performance - for accountability arrangements
We do not feel that the new model will
meet these as effectively as other possible models. We would welcome
discussions on alternatives.
vi
We believe that, by themselves, the new
test arrangements will not provide robust information on underlying standards
in the education system. With entry to single-level tests dependent on
teachers' decisions, teachers in different institutions and at different times
are likely to deploy different approaches to entry. This is likely to be very
volatile, and effects are unlikely to always cancel out. This is likely to
contaminate the national data in a very new ways, compared with existing
testing arrangements. There are no obvious remedies to this problem within the
proposed arrangements, either in the form of guidance or regulation.
vii
Teachers are likely to come under
peculiar pressures, from institutions wishing to optimise performance-table
position, from parents of individual children etc. This is an entirely
different scenario to the 'all to be tested and then a level emerges' character
of current arrangements. Tiering invokes a similar, though not as here
all-pervasive, effect.
viii
Although advanced as 'on-demand' testing,
the regime is not an 'on-demand' regime, and it is misleading to promote it as
such. It provides one extra test session per year.
ix
The frequency of testing is likely to
increase the extent to which testing dominates teaching time. This is not a
problem where the majority of washback effects from testing are demonstrably
beneficial; we believe that other features of the tests mean that washback
effects are likely to be detrimental. It is not clear what kind of
differentiation in teaching will flow back from the tests. Ofsted and other
research shows differentiation to be one of the least developed areas of
teaching practices. We are concerned that the 'grade D' problem (neglect of
this those not capable of getting a C and those who will certainly gain a C)
will emerge in a very complex form in the new arrangements.
x
The tests may become MORE high stakes for
learners. Labelling such as '...you're doing level 2 for the third time!...' may
emerge and be very pernicious. Jean Rudduck's work shows such labelling to be
endemic and problematic.
xi
We are unclear regarding the impact on
those learners who fail a test by a small margin - they will wait 6 months to
be re-tested. Do teachers judge that they should 'lose 6 months of teaching
time' to get them up to the required level or just carry on with no special
support. If special support is given, what is the child not doing which they
previously would have done? This is a key issue with groups such as less able
boys - they will need to take time out of things which they are good at and
which can bolster their 'learning identities'. Those who are a 'near miss' will
need to know - the document does not make clear whether learners will just 'get
a level back'; will get a mark; or an item-performance breakdown.
xii
Testing arrangements are likely to become
much more complex - security issues, mistakes (such as wrong test for a child)
etc are likely to gain in significance.
xiii
The length of the tests may be an improvement over existing tests,
but further investigative work must be done to establish whether this is indeed
the case. 45-minute tests may, or may sample more from each subject domain at
an appropriate level, compared with existing national tests. This is an
empirical question which needs to be examined. Lower sampling would reduce the
reliability of the tests. Compounding this, the issue of pass marks must be
addressed - compensation within the tests raises not only reliability questions
but also washback effects into formative assessment. People who pass may still
need to address key areas of learning in a key stage, if compensation and pass
marks combine disadvantageously. The length of the tests and the need to cover
the domain will tend to drive tests to a limited set of item types, raising
validity issues. This in turn affects standards maintenance - if items are
clustered around a text, if the text is changed (remembering test frequency is
increased 100%) then all the items are no longer usable. This represents a
dramatic escalation of burden in test development. Constructing and maintaining
the bank of items will be very demanding.
xiv
If a high pass mark is set (and the
facility of items tuned to this) there will be little evidence of what a child
cannot do. Optimising the formative feedback element - including feedback for
high attainers - in the face of demand for high domain coverage, reasonable
facility, and accessibility (recognisable stimulus material etc) will be very
demanding for test designers. Level-setting procedures are not clear. The
regime requires a very fast turnaround in results - not least to set in place
and deliver learning for a 're-take' in the next test session (as well as
keeping up with the pace of progression through the National Curriculum content).
This implies objective tests. However, some difficult factors then combine. The
entry will be a volatile mix of takers and re-takers.
xv
While calibration data will exist for the
items, random error will increase due to the volatility of entry, feeding into
problems in the reliability of the item data in the bank. Put crudely, with no
awarding processes (as present in existing national tests) there will be a loss
of control over the overall test data - and thus reliability and standards over
time will become increasingly problematic. As one possible solution, we
recommend the development of parallel tests rather than successively different
tests. Pre-tests and anchor tests become absolutely vital - and the purpose and
function of these must be explained clearly to the public and the teaching
profession. More information on this can be provided.
xvi
Having the same tests for different key
stages (as stated by officials) is problematic. There is different content in
different stages (see English in particular). QCA has undertaken previous work
on 'does a level 4 mean something different in different key stages' - the
conclusion was that it did.
xvii
The 10-hour training/learning sessions
are likely to be narrowly devoted to the tests. This may communicate strong
messages in the system regarding the importance of drilling and 'surface
learning' - exactly the opposite of what the DfES is supporting in other policy
documents. Although superficially in line with 'personalisation', It may instil
dysfunctional learning styles.
xviii
We applaud the sensitivity of the
analysis emerging from the DfES in respect of the different populations of
learners who are failing to attain target levels. We also support the original
Standards Unit's commitment to a culture of high expectations, combined with
high support. However, this level of sensitivity of analysis is not reflected
in the blanket expectation that every child should improve by two levels.
xix
We do not support 'payment by results'
approaches - in almost any form these have successively been found wanting.
Undue pressure is exerted on tests and test administration - maladministration
issues escalate.
xx
In the face of the considerable challenge
of developing a system which meets the demanding criteria which we associate
with the operation of robust national assessment, we would welcome an
opportunity to contribute to further discussions on the shape of enhanced
national arrangements.
5. The
way forward for National Assessment
58. What is needed is a new look at options -
and both the technical and political space for manoeuvre. Cambridge Assessment
has not only attempted to assemble the evidence but have produced a '3 option'
paper which outlines possible approaches to confront the very real problems outlined
above. We commend a thoroughgoing review of the evidence. Not a - 'single
person review' like 'Dearing' or 'Tomlinson', but a more managed appraisal of
options and a sober analysis of the benefits and deficits of alternatives. For
this, we believe that a set of clear criteria should be used to drive the next
phase of development:
· technically-robust
arrangements should be developed
· the
arrangements should be consistent with stated functions
· insights
from trialling should fed into fully operational arrangements
· unintended
consequences are identified and remedied
· full support from all
levels of the system is secured in respect of revised arrangements
· a number of models
should be explored at the same time, in carefully designed programmes - in
other words there should be parallel rather than serial development, trialling
and evaluation
· appropriate ethical
safeguards and experimental protocols should be put in place during development
and trialling
59. It
is, of course, vital to consider not only the form of revised arrangements
which better deliver the purposes of national assessment but also to consider
the methods and time frame for development arrangements, as well as the means
of securing genuine societal and system support.
60. The
last two elements listed above are critical to this: currently, there are no
plans for trialling more than one revised model for national testing. However,
a cursory glance in the education research field shows that there is a range of
contrasting approaches to delivering the key functions of national testing,
many of which may well be presented to this Inquiry... It therefore would seem
important to trial more than one model rather than 'put all eggs in one basket'
or take forward only modifications of existing arrangements.
61. It
is unclear whether adequate safeguards have been put in place to protect
learners exposed to revised national assessment arrangements. Cambridge
Assessment recommends - in line with the standards being developed by the
Government's Social Research Unit -
that new protocols should be developed, as a matter of urgency for the
trialling of revised arrangements,.
National
Assessment - Annexe 1
An overview of the
evidence
1
Measurement
error and the problems with overlaying levels onto marks
This
does not refer to human error or mistakes in the administration of tests but to
the issue of intrinsic measurement error. Contemporary standards in the US lead
to the expectation that error estimates are printed alongside individual
results: such as '...this person has 3592 (on a scale going to 5000 marks) and
the error on this test occasion means that their true score lies between 3602
and 3582....'. Is this too difficult for people to handle (i.e. interpret)? In
the current climate of increasing statistical literacy in schools, it should
not be. Indeed, results could be presented in many innovative ways which better
convey where the 'true score' of someone lies.
Error
data are not provided for national tests in England, and both the Statistics Commission
and commentators (e.g. Wiliam, Newton, Oates, Tymms) have raised questions as
to why this international best practice is not adopted.
Of
course, error can be reduced by changing the assessment processes - which most
often results in a dramatic increase in costs. Note 'reduce' not 'remove' - the
latter is unfeasible in mass systems. For example, double marking might be
adopted and would increase the technical robustness of the assessments.
However, this is impractical in respect of timeframes; it is already difficult
to maintain existing numbers of markers, etc. Error can be reduced by increased
expenditure but is escalating cost appropriate in the current public sector
policy climate?
One
key point to bear in mind is that one must avoid a situation where the error is
significantly less than the performance gains which one is expecting from the
system, and from schools - and indeed from teachers within the schools.
Unfortunately, 1-2% improvement lies within the bounds of error - get the level
thresholds wrong by two marks either way (and see the section on KS3 Science
below) and the results of 16,000 pupils (i.e. just over 2%). could be moved.
Measurement
error becomes highly significant when national curriculum levels (or any other
grade scale) is overlaid onto the scores. If the error is as above, but a cut
score for a crucial level is 48 (out of 120 total available marks) then getting
47 (error range 45-49) would not qualify that person for the higher level, even
though the error means that their true score could easily be above the level
threshold. In some cases the tests are not long enough to provide
information to justify choosing cut-scores between adjacent marks even though
the difference between adjacent marks can have a significant effect on the
percentages of the cohort achieving particular levels. There are problems with misclassification of
levels applied. Wiliam reports that 'it is likely that the
proportion of students awarded a level higher or lower than they should be
because of the unreliability of the tests is at least 30% at key stage 2 and
may be as high as 40% at key stage 3'.
Criterion referencing fails to work well
since question difficulty is not solely determined by curriculum content. It
can also be affected by 'process difficulty' and/or 'question or stimulus
difficulty', (Pollitt et al). It is also difficult to allocate curriculum to
levels since questions testing the same content can cover a wide range of
difficulty.
It
is believed that error could be communicated meaningfully to schools, children,
parents and the press, and would enhance both intelligence to ministers and the
educational use of the data from national tests.
The
current practice of overlaying levels onto the scores brings serious problems
and it is clear that the use of levels should be reviewed. One key issue:
consider the following:
Level
5 upper boundary
-------------------
Child C
Child
B
Level
4 upper boundary
-------------------
Child
A
Level
3 upper boundary
-------------------
Both
Child B and Child C are level 5. But in fact Child A and B are closer in
performance, despite A being level 4 and B being level 5. Further, if Child A
progresses to the position of Child B over a period of learning, they have increased
by one level. However, if Child B progresses to the same position as Child C,
they have progressed further than Child A over the same time, but they do not
move up a level. Introducing sub-levels has helped in some ways (4a, 4b etc)
but the essential problem remains.
2
QCA
practice in test development
Pretesting
items is extremely helpful; it enables the performance characteristics of each
item to be established (particularly how relatively hard or easy the item is).
This is vital when going into the summer levels setting exercise - it is known
what is being dealt with in setting the mark thresholds at each level. But
subject officers and others involved in the management of the tests have had a
history of continuing to change items after the second pretest, which
compromises the data available to the level setting process, and thus impacts
on maintaining standards over time. In addition, the 'pretest effect' also
remains in evidence - learners are not necessarily as motivated when taking
'non-live' tests; they may not be adequately prepared for the specific content
of the tests; etc. This places a limit on the pre-test as an infallible
predictor of the performance of the live test.
3
Borderlining
The
decision was taken early in national assessment to re-mark all candidates who
fall near to a level threshold. QCA publish the mark range which qualifies
children to a re-mark. However, the procedure has been applied only to those
below the threshold and who might move up, and not to those just above, who
might move down. This has had a very distorting effect on the distributions.
Although done in the name of fairness, the practice is seriously flawed. For
years, arguments around changing the procedure or removing borderlining
completely foundered on the fact that this would effect a major (downward)
shift in the numbers gaining each level, and therefore could not be sanctioned
politically. A poorly-designed and distorting practice therefore continued.
Current practice is unjustifiable and would not be sanctioned in other areas of
public awarding (e.g. GCSE and A/AS).
It
has now been agreed between QCA and the DfES that borderlining will be removed
in 2008, when the marking contract passes from Pearson's to the American ETS
organization. At this point, a recalibration of standards could be effected to
mask the effect of correction and this standard could be carried forward, or a
clear declaration could be made on how removal of borderlining affects the
'fairness' of the test and has resulted in a change in the numbers attaining a
given level. First identified by Quinlan and Scharaskin in 1999, this issue has
been a long-running systemic problem. Again, it is a change in practice
(alongside changes in the accessibility of the tests, in inclusion of mental arithmetic
etc) which compromises the ability of the tests to track change in attainment
standards over time.
4
Fluctuations
in Science at KS3
At
levels 6 and above, standards of attainment have moved up and down in an
implausible fashion:
2005 37 (%
of children gaining levels 6 and 7)
2004 35
2003 40
2002 34
2001 33
The
movement over the three year period 2002 to 2004 has involved a 6% increase
followed by a 5% decrease - a movement of 11% over two years. This is
implausible, and points to problems in the tests and level setting, and not to
a real change in underlying standards or in the cohort taking the tests.
Significantly, when interviewed on causes, officials and officers gave very
different explanations for the effect - in other words, the true cause has not
been established with precision.
5
The
Massey Report and Tymms' analysis
The
Massey report used highly robust method to triangulate national tests 1996-2001
and yields solid evidence that attainment standards have risen over that period,
but not to the extent in all subjects and all key stages that has been argued
by DfES and ministers. Tymms' less robust method and research synthesis
suggests broadly the same. Massey made a series of recommendations, some of
which have been adopted by QCA, such as equating a number of years' tests and
not just the preceding year. However, the absence of a consistent triangulation
method and the failure to adopt the Massey recommendation that standards should
be held for five years and then publicly recalibrated has not been adopted.
6
Ofsted's
over-dependence on national test outcomes
The
new Ofsted inspection regime is far more dependent on the use of national
assessment data than previously. This delivers putative economies since Ofsted
feels it can better identify problematic and successful schools, and can use
the data to target areas of schools - e.g. weak maths departments, or poor
science etc. The revised regime is broadly welcomed by schools, and has a sound
emphasis on each school delivering on its stated policies. But the regime fails
to acknowledge the weaknesses of the data which lie at the heart of the
pre-inspection reports, and which guides Ofsted on the performance of schools.
The greatly increased structural dependence on data which is far less accurate
than is implied is problematic. The new regime delivers some valuable functions
- but the misapprehension of the real technical rigour of the assessment data
is a very serious flaw in arrangements.
7
Assessment overload
accusations whilst using many other non-statutory tests
This is an interesting phenomenon - the
optional tests are liked, the statutory tests are frequently disliked (QCA).
KS2 score are mistrusted (ATL). The use of 'commercial' CAT tests and CEM's
tests (MIDYIS etc) is widespread. CAT scores are trusted by teachers because
the results are more stable over time in comparison with national curriculum
tests; this reflects the different purpose of the respective instruments.
Children say 'I did SATs today' when they do a statutory key stage test. They
also frequently say that when they have taken a CAT test. There is widespread
misunderstanding of the purpose of the range of tests which are used. QCA was
lobbied over a five-year period to produce guidance on the function of
different tests - not least to clarify the exact purpose of national testing.
However, no such guidance has been produced. As a result of this, the arguments
regarding 'over-testing' are extremely confused, and adversely muddy the waters
in respect of policy.
8
Is
the timing right?
Changing the timing of the tests would
require a change in primary legislation. However, it is an enhancement of
testing which should be considered very seriously. In the final report of the
Assessment Review Group in Wales, Daugherty (2004) recommends that 'serious
consideration should be given to changing the timing of Key Stage 3 statutory
assessment so that it is completed no later than the middle of the second term
of Year 9'. The Group believed the current timing to be unhelpful in relation
to a process that could, in principle, inform,' and that, 'one source of
information that would be of use potentially to pupils and their parents is not
available until after the choice of pathway for Year 10 and beyond has been
made'. There are also implications for the potential use of Key Stage 1 and 2
data for transition between phases. 'School ownership' - taking the outcomes
very seriously in managing learning - would be likely to increase in this
re-scheduling of the tests.
9
The
reliability of teacher assessment
Particularly in the high stakes context
of performance tables, we feel that relying on teacher assessment, as currently
operated, is not a robust option. Work in 2000 by QCA Research Team showed a
completely unstable relationship between TA and test scores over time at school
level. This is compelling evidence against an over-dependence on teacher
assessment. There are means of delivering moderated teacher assessment for
reporting to parents, and bolstering accountability not by testing but by
regional inspection based on high expectations and school improvement models
(see recommendations below). National standards in underlying attainment could
be delivered through a light sampling model (with matrix sampling to cover all
key content of the national curriculum). This would enable a valid answer to
the ministerial question '....nationally, what's happening to standards in
English?'.
10
Teaching
to the test
The recent lobbying by Baroness Professor Susan Greenfield and eminent
colleagues is merely the most recent critique of the problems of teaching to
the test. The 'Texas Test Effect' (Wiliam, Oates) is well known but poorly
presented to Government. Bill Boyle (CFAS) is the latest empirical study of the
adverse effects of teaching to the test and its almost universal domination of
educational purposes in the English school system. It is a very serious issue,
and it may be one significant factor (not the sole one) lying behind the
'plateau effect' associated with the majority of innovations such as the
Primary Literacy and Numeracy Strategies. In other words - a succession of
well-intended and seemingly robust initiatives repeatedly run out of steam.
National
Assessment - Annexe 2
The
Assessment of Performance Unit - should it be re-instated?
The
origins and demise of the APU
1. The
inability of current arrangements to provide a robust flow of policy
intelligence on trends in pupil attainment
has emerged as a serious problem. The causes are multifaceted, and
include :
- instability
in standards within the testing system (Massey, Oates, Stats Commission)
- acute
classification error affecting assignment of pupils to levels (Wiliam, Tymms)
- teaching to
the test/'Texas Test Effect' (Wiliam)
2. Growing
awareness of this issue has prompted increasing calls for '...a return to the
APU...' (the Assessment of Performance Unit) - a separate, 'low stakes',
light-sampling survey for the purpose of reliable detection of patterns of pupil attainment, and of
trends in attainment over time. But there are dangers in an unreflective
attempt to re-instate arrangements which actually fell short of their aims. The
APU processes were innovative and progressive. They mirrored the fore-running
US NAEP (National Assessment of Educational Progress) and pre-dated the
arrangements now in place in New Zealand and Scotland. Running surveys from
1978 to 1988, politicians and civil servants saw it being redundant in the face
of the data on each and every child of age 7,11 and 14 which would be yielded
from National Curriculum assessment processes. The APU was hardly problem-free.
Significant issues emerged in respect of:
· under-developed
sampling frames
· tensions
between subject-level and component-level
analysis and reporting
· differing
measurement models at different times in different subjects
· lack of
stability in item forms
· escalating
sampling burden
· difficulty in
developing items for the highest attaining pupils
· the nature of
reporting arrangements
· replacement
strategy in respect of dated items
· ambiguity in
purpose re 'process skills' as a principal focus versus curriculum content
· research/monitoring
tensions
· compressed
development schedules resulting from pressure from Government
3. There
was acute pressure on the APU to deliver . Rather than recognize that the
function of the APU was essential for high-level policy processes and would
need to persist (NEAP has been in place in the US since 1969), the intense
pressure led to poor refinement of the technical processes which underpinned
the operation of the APU, and high turnover in the staff of the different
subject teams (Gipps and Goldstein 1983). Crucially, the compressed piloting
phases had a particularly adverse impact; there was no means of undertaking
secure evaluation of initial survey work and feeding in 'lessons learned':
'...the
mathematics Group, in particular, felt that they were continually being rushed:
their requests for a delay in the monitoring programme were rejected; their
desire for three pilot surveys was realized as only one; they experienced a
high turnover of staff and a resulting shortage of personnel. The constant rush
meant that there was no time for identifying and remedying problems identified
in the first year of testing.
4. In fact, all three teams suffered from
a rapid turnover of staff, put down to the constant pressure of work combined
with a lack of opportunity was no time to 'side track' into interesting
research issues...' (Newton P 2005 p14)
5. This
is not a trivial failure of a minor survey instrument. The APU was founded in 1974 after
publication of a DES White Paper (Educational Disadvantage and the Needs of
Immigrants). It was the result of a protracted strategic development process,
which led from the DES-funded Working Group on the Measurement of Educational
Attainment (commissioned in 1970) and
the NFER's DES-funded development work on Tests of Attainment in Mathematics in
Schools. If it had successfully attained its objectives, it would have relieved National Curriculum testing of the burden of
attempting to measure standards over time - a purpose which has produced some
of the most intense tensions amongst the set of functions now attached to
national testing. Stability in the instruments is one of the strongest
recommendations emerging from projects designed to monitor standards over time.
In sharp tension with this, QCA and the State has - in line with commitments to
high quality educational provision; the standards agenda; and responses from
review and evaluation processes - sought to optimize the National Curriculum by
successive revision of content; increasing the 'accessibility of tests'; and
ensuring tight linkage of the tests to specific curriculum content. These are
laudable aims - and the emphasis on the diagnostic function of the data from tests
has been increasing in recent innovations in testing arrangements. But pursuit
of these aims has led to repeated revision rather than stability in the tests.
6. The
Massey Report suggested that if maintenance of standards over time remained a
key operational aim, then stability in the test content was imperative. In the
face of these tensions, retaining an APU-style light sampling survey method
would enable de-coupling of national assessment from a requirement to deliver
robust information on national educational standards, and enable testing to
reflect curriculum change with precision, to optimize the learning-focussed
functions of testing, and enable constant innovation in the form of tests (e.g.
to optimize accessibility).
7. Thus,
the deficits and closure of the APU were, and remain, very serious issues in
the operation and structure of national assessment arrangements. Temporal
discontinuity played a key role in the methodological and technical problems
experienced by the APU developers. As outlined above, rushing the development
phases had a variety of effects, but the most serious of these was the failure
to establish with precision a clear set of baseline data, accompanied by stable
tests with known performance data; '...an effective national monitoring system
cannot be brought 'on stream' in just a couple of years...' (Newton P, 2005).
8. Our
conclusion is not 'bring back the APU', but develop a new light sampling,
matrix-based model using the knowledge from systems used in other nations and
insights from the problems of the APU. Models 1 and 2 which we outline as
alternatives in the main body of this evidence rely on the development of new
versions of the APU rather than simple re-instatement.
Section
2
5. Higher education admissions tests
Determining
role and function
1. Since
the publication, in September 2001, of the Schwartz report (Fair Admissions to
higher education: recommendations for good practice), the issue of the role and
function of admissions tests has been a controversial area. Cambridge
Assessment has been cautious in its approach to this field. We have based our
development programme on
carefully-considered criteria. We
believe that dedicated admissions tests
should:
· produce information
which does not duplicate information from other assessments and qualifications
· make a unique
and useful contribution to the information available to those making admissions
decisions
· predict
students' capacity to do well in, and benefit from higher education
2. Since
the Cambridge Assessment Group includes the OCR awarding body, we are also
heavily involved in refining A levels in the light of the 'stretch and
challenge' agenda - working to include A* grades in A levels, inclusion of more
challenging questions, and furnishing unit and UMS scores (Uniform Mark Scheme
scores -a mechanism for equating scores from different modules/units of
achievement) as a means of helping universities in the admissions process.
3. We
recognize that HE institutions have clear interests in identifying, with reasonable
precision and economy, those students who are most likely to benefit from
specific courses, are likely to do well, and who are unlikely to drop out of
the course. We also recognize that there is a strong impetus behind the
'widening participation' agenda.
4. Even
with the proposed refinements in A level and the move to post- qualification
applications (PQA), our extensive development work and consultation with HE
institutions has identified a continuing need for dedicated assessment
instruments which facilitate effective discrimination between high attaining
students and are also able to identify those students who possess potential,
but who have attained lower qualification grades for a number of reasons.
5. We
are very concerned not to contribute to any unnecessary proliferation of tests
and so have been careful only to develop tests where they make a unique and
robust contribution to the admissions process, enhance the admissions process,
and do not replicate information from any other source. To these ends, we have
developed the BMAT for medical and veterinary admissions. We have developed the
TSA (Thinking Skills Assessment), which is being used for admissions to some
subjects in Cambridge and Oxford and is being considered by a range of other
institutions. The TSA items (questions) also form part of the uniTEST which
were developed in conjunction with ACER (Australian Council for Educational
Research). UniTEST is being trialled with a range of institutions, both
'selecting' universities and 'recruiting' universities.
6. This
test is designed to help specifically with the widening participation
agenda. Preliminary data suggests that
this test is useful in helping identify students who are capable of enrolling
on courses at more prestigious universities than the ones for which they have
applied as well as those who should consider HE despite low qualification
results.
7. The
TSA should be seen more as a test resource rather than a specific test: TSA
items are held in an 'item bank', and this is used to generate tests for
different institutions. Although TSA items were originally developed for admissions processes in
Cambridge where discrimination between
very high attaining students is problematic and A level outcomes inadequate as
a basis for admissions decisions, Cambridge Assessment research team is
developing an 'adaptive TSA'. This utilizes the latest measurement models and
test management algorithms to create tests which are useful with a very broad
range of abilities.
8. The
validation data for the TSA items is building into a large body of evidence and
the tests are yielding correlations which suggest that they are both valid and
useful in admissions - and do not replicate information from GCSE and AS/A2
qualifications. In other words, they
are a useful addition to information from these qualifications and allow more
discriminating decisions to be made than when using information from those
qualifications alone. In addition, they yield information which is more
reliable than the decisions which are made through interviews and will provide
a stable measure over the period that there are major changes to AS and A2
qualifications.
The
American SAT
9. Cambridge
Assessment supports the principles which are being promoted by the Sutton Trust
and the Government in respect of widening participation. However, we have
undertaken evaluation work which suggests that the promotion of the American
SAT test as a general admissions test for the UK is ill-founded. The five-year
SAT trial in the UK is part-funded by Government (£800,000), the College Board
(the test developers) contributing £400,000 and with the Sutton Trust and NFER
each contributing £200,000.
10. The
literature on the SAT trial in the UK states that the
SAT1 is an 'aptitude' test. It also makes two strong claims that are contested:
"...Other selection tests are used by
universities in the United Kingdom, but none of these is as well constructed or
established as the SAT(c).
In summary, a review of existing research indicates that the
SAT(c) (or similar reasoning-type aptitude test) adds some predictive
power to school / examination grades, but the extent of its value in this
respect varies across studies. In the UK, it has been shown that the SAT(c)
is an appropriate test to use and that it is modestly associated with A-level
grades whilst assessing a different construct. No recent study of the
predictive power of SAT(c) results for university outcomes has been
undertaken in the UK, and this proposal aims to provide such information..."
Source: (http://www.nfer.org.uk/research-areas/pims-data/outlines/update-for-students-taking-part-in-this-research/a-validity-study-background.cfm)
11. The claim that 'none of these is as well constructed or
established as the SAT(c) ' fails to recognise that Cambridge
Assessment has assembled comprehensive data on specific tests amongst its suite
of admissions tests and ensures that validity is at the heart of the
instruments. These are certainly not as old as the SAT but it is entirely
inappropriate to conflate quality of construction and duration of use.
12. More importantly, the analysis below suggests that the claim
that the SAT1 is a curriculum-independent 'aptitude' test is deeply flawed.
This is not the first time that this claim has been contested (Jencks, C. and
Crouse, J; Wolf A and Bakker S), but it is
that first time that such a critique has been based on an empirical study of
content.
13. It is important to note that the SAT is under serious
criticism in the US (Cruz R; New York Times) and also, despite many
UK-commentators' assumptions, the SAT1 is not the sole, or pre-eminent, test
used as part of US HE admissions (Wolf A and Bakker S). The SAT2 is
increasingly used - this is an avowedly curriculum-based test. Similarly, there
has been a substantial increase in the use of the Advanced Placement Scheme -
subject-based courses and tests which improve students' grounding in specific
subjects, and are broadly equivalent to English Advanced Level subject-specific
qualifications.
14. It is also important to note that (i) the US does not have
standard national examinations - in the absence of national GCSE-type
qualifications, a curriculum-linked test such as the SAT1 is a sensible
instrument to have in the US, to guarantee that learners have certain
fundamental skills and knowledge - but GCSE fulfils this purpose in England;
(ii) the USA has a four-year degree structure, with a 'levelling' general
curriculum for the first year; and (iii) the SAT1 scores are used alongside
college grades, personal references, SAT2 scores and Advanced Placement
outcomes.
"...One of the
misunderstood features of college selection in America is that SATs are only
one component, with high school grades and other 'portfolio' evidence playing a
major role. The evidence is that high school grades are a slightly better
predictor of college achievement than SAT scores, particularly for females and
minority students. Combining both provides the best, though still limited,
prediction of success..."
(Stobart G)
Curriculum mapping - does the SAT mirror current arrangements?
15. In the light of research comment on the SAT and emerging
serious criticisms of the instrument in its home context, Cambridge Assessment
commissioned a curriculum mapping of the SAT in 2006 comparing it with content
in the National Curriculum (and, by extension, GCSE) and the uniTEST.
16. It is surprising that such a curriculum content mapping has
not been completed previously. Prior studies (McDonald et al) have focused on
comparison of outcomes data from the SAT and qualifications (e.g. A level) in
order to infer whether the SAT is measuring something similar or different to
those qualifications. But the failure to undertake a comparison of the SAT with
the content of the English National Curriculum is a serious oversight. The
comparison is highly revealing.
17. The study consisted of a comparison of published SAT
assessment criteria, items included in SAT1 sample papers, the National
Curriculum programmes of study, and items within the uniTEST. The SAT
assessment criteria and National Curriculum programmes of study were checked
for analogous content. The National Curriculum reference of any seemingly
relevant content was then noted and checked against appropriate SAT1 specimen
items. The full analysis was then
verified by researchers outside the admissions team, who were fully acquainted
with the content of the National Curriculum and GCSEs designed to assess
National Curriculum content. The researchers endorsed the analysis completed by
the admissions test developers.
The outcomes of the curriculum mapping study
18. The full results are shown in Higher education admissions
tests annexe #1. Column 1 shows the sections and item content of the SAT1.
Column 2 gives the reference number of the related National Curriculum content.
For example, MA3 2i refers to the statement:
Mathematics
key stage 4 foundation
Ma3 Shape, space and measures
Geometrical
reasoning 2
Properties of circles
recall
the definition of a circle and the meaning of related terms, including centre,
radius, chord, diameter, circumference, tangent, arc, sector, and segment;
understand that inscribed regular polygons can be constructed by equal division
of a circle
19. Column 3 in Annex #1 shows the relation between the content of
the SAT1, the relevant components of the National Curriculum and the
Cambridge/ACER uniTEST admissions test.
20. The analysis indicates that
· the
SAT1 content is largely pitched at GCSE-level curriculum content in English and
Maths, and replicates GCSE assessment of that content.
· the
item types and item content in the SAT1 are very similar to that of GCSEs.
It is therefore not clear
exactly what the SAT1 is contributing to assessment information already
generated by the national examinations system in England.
21. Previous appraisals of the SAT1 have been based on
correlations between GCSE, A level and SAT1 outcomes. This has shown less than
perfect correlation, which has been interpreted as indicating that the SAT1
assesses something different to GCSE and A level. But GCSE and A level are
based on compensation - particularly at lower grades, the same grade can be obtained by two candidates with different
profiles of performance. The inferences from the data were previously made in
the absence of a curriculum mapping. The mapping suggests that discrepancies
between SAT1 and GCSE/A level outcomes may be the result of the candidates not
succeeding at certain areas in these exams, nonetheless gaining a reasonable
grade - but this being re-assessed by SAT1 and thus their performance found
wanting.
22. The existence of such comprehensive overlap suggests that the
SAT either presents an unnecessary replication of GCSE assessment or an
indication of the problems of compensation in the established grading
arrangements for GCSE.
23. Identical analysis of uniTEST, currently being piloted
and developed by Cambridge Assessment and ACER, suggests that uniTEST does not
replicate GCSE assessment to the same extent as the SAT1 but focuses on the
underlying thinking skills rather than on formal curriculum content. There is
some overlap in the areas of verbal reasoning, problem solving, and
quantitative and formal reasoning. There are, however, substantial areas of
content which are not covered in the National Curriculum statements of
attainment nor in the SAT1. These are in critical reasoning and socio-cultural
understanding. This suggests that uniTEST is not replicating GCSE and does
offer unique measurement. Preliminary data from the pilot suggest that uniTEST
is detecting learners who might aspire to universities of higher ranking that
the ones to which they have actually applied.
June
2007