EVALUATING ASSESSMENT SYSTEMS
The following report highlights five common
confusions related to the evaluation of educational assessment
1. the difference between validation and
evaluation, where validation (concerning the accuracy of inferences
from results) is merely one component of evaluation;
2. the meaning of "purpose" in
"fitness-for-purpose", which can be interpreted in a
variety of different ways, all of which are (differently) relevant
3. the number of purposes which can be identified,
which is much higher than tends to be appreciated (for example,
national curriculum test results are probably used for at least
14 different purposes);
4. why it matters when results are used for
many different purposes, which is because different uses require
that different kinds of inference be drawn from results, so results
that warrant accurate inferences for one purpose may not warrant
accurate inferences for another; and
5. the many components of a rigorous evaluation,
which include analysis from the perspectives of technical accuracy,
moral defensibility, social defensibility, legal acceptability,
economic manageability and political viability.
The report emphasises the importance of: distinguishing
the different meanings of similar terms; distinguishing logically
separable evaluation questions; and distinguishing the many alternative
perspectives on evaluation.
1.1 This report explores the concept of
evaluation, as it applies to educational assessment systems; and
then presents a framework for evaluating them.ii It is intended
as a tool for helping the Education and Skills Select Committee
to grapple with the many questions which comprise its New Inquiry
into Testing and Assessment, particularly the very broad ones,
is the testing and assessment in
"summative" tests (for example, GCSE, AS, A2) fit for
should the system of national tests
be changed? If so, should the tests be modified or abolished?
1.2 This report does not offer a view on
the legitimacy of our present national assessment systems. Instead,
it offers a way of organising evidence and argument in order to
reach such a view. It helps to identify what makes for a good
evaluation question and what makes for a good evaluation conclusion:
issues which can seem deceptively straightforward at first glance.
In particular, it aims to expose a number of common confusions
which can mislead the unwary inquirer.
1.3 This report offers generic insights,
which apply in the same way across the spectrum of educational
assessment systems (occupational, vocational, general; tests,
examinations, teacher assessments; on paper, on-screen, online;
2. What does evaluation entail?
2.1 The first confusion to confront is the
nature of evaluation itself. An easy mistake to make is to reduce
the big programme of evaluation to the smaller programme of validation.iii
The central question at the heart of validation is this: are the
inferences that we draw from our assessment results sufficiently
accurate (for the uses to which they will be put)? Or, less formally:
are our results accurate or not? Although this is a necessary
and fundamental component of evaluation, it is still only one
component. Evaluation requires the inquirer to consider any question
that might bear upon the legitimacy of the assessment system,
might the way in which test results
are reported have positive or negative impacts (eg, is it better
simply to rank students or to tell them how much of a programme
of study they have "mastered")?
might the fact of testing itself
have positive or negative impacts (eg, does the inevitable "washback"
support or detract from good teaching and learning)?
2.2 Evaluation entails marshalling as much
relevant evidence and argument as possible, to judge whether systems
work as they are intended to and in the best interests of participants,
stakeholders and society. The central question at the heart of
evaluation is this: are our assessment systems fit-for-purpose?
3. What does "fit-for-purpose" mean?
3.1 A second confusion to confront is the
meaning of fitness-for-purpose. Before exploring the concept of
"fitness" we need to work out what we mean by "purpose".
This is not as straightforward as it might sound. Consider the
following three interpretations.
1. The purpose of assessment is to generate
a particular kind of result. For example, students sit an exam
in GCSE science to rank them in terms of their end-of-course level
2. The purpose of assessment is to enable
a particular kind of decision. For example, students sit an exam
in GCSE science so that we can decide whether they have learned
enough of the basic material to allow them to enrol on an A level
3. The purpose of assessment is to bring
about a particular kind of educational or social impact. For example,
students sit an exam in GCSE science to force them to learn the
subject properly, and to force their teachers to align their teaching
of science with the national curriculum.
3.2 Obviously, to judge whether a system
is fit-for-purpose, an evaluator needs to begin by identifying
the purpose, or purposes, for which the system is supposed to
be fit. However, if the evaluator is confused by the different
possible meanings of "purpose", no satisfactory conclusion
will be reached. This is why it is essential to distinguish these
different interpretations; all of which are perfectly reasonable;
and all of which need to be considered in their own right when
mounting an evaluation.
3.3 As it happens, there is yet another
interpretation to be wary of:
4. The purpose of the qualification is to
bring about a particular kind of educational or social impact.
For example, students study GCSE science to support progression
to a higher level of study (for those who wish to), and to equip
all students with sufficient scientific literacy to function adequately
as 21st century citizens.
Again, this is a perfectly reasonable interpretation
of "purpose". However, strictly speaking, it is not
within the scope of an evaluation into the legitimacy of an assessment
system. Instead, it implies a broader evaluation remit, into the
legitimacy of an educational programme. It would be perfectly
possible to have a legitimate educational programme with an illegitimate
assessment system; and vice versa. The two evaluation foci need
to be kept quite separate.
4. How many purposes are there?
4.1 A third confusion concerns the number
of purposes which need to be considered when evaluating an assessment
system. This is best illustrated by considering the uses to which
assessment results are put (interpretation 2 above). In England,
we have become familiar with classification schemes such as that
presented in 1988 by the Task Group on Assessment and Testing:
formative uses (assessment for learning)
summative uses (assessment of learning)
evaluative uses (assessment for accountability)
diagnostic uses (assessment for special
4.2 Although this kind of scheme is useful,
it fails to convey the full complexity of the situation. In fact,
there are many more categories of use to which educational assessment
results might be put. Figure 1 illustrates 22. The categories
presented in Figure 1 are quite looseand occasionally shade
into each otherbut the point isn't to present a definitive
taxonomy, merely to illustrate just how many possible kinds of
use there are. In fact, distinctions can often be made within
categories, between uses which might recommend quite differently
designed assessment systems (eg, long-, medium- and short-term
SOME EXAMPLES OF THE MANY KINDS OF USE TO
WHICH ASSESSMENT RESULTS CAN BE PUT
1. social evaluation (to judge the social
or personal value of students' achievements)
2. formative (to identify students' proximal
learning needs, guiding subsequent teaching)
3. student monitoring (to decide whether
students are making sufficient progress in attainment in relation
to expectations or targets; and, potentially, to allocate rewards
4. diagnosis (to clarify the type and extent
of students' learning difficulties in light of well-established
criteria, for intervention)
5. provision eligibility (to determine whether
students meet eligibility criteria for special educational provision)
6. screening (to identify students who differ
significantly from their peers, for further assessment)
7. segregation (to segregate students into
homogeneous groups, on the basis of aptitudes or attainments,
to make the instructional process more straightforward)
8. guidance (to identify the most suitable
courses, or vocations for students to pursue, given their aptitudes)
9. transfer (to identify the general educational
needs of students who transfer to new schools)
10. placement (to locate students with respect
to their position in a specified learning sequence, to identify
the level of course which most closely reflects it)
11. qualification (to decide whether students
are sufficiently qualified for a job, course or role in lifethat
is, whether they are equipped to succeed in itand whether
to enrol them or to appoint them to it)
12. selection (to predict which studentsall
of whom might, in principle, be sufficiently qualifiedwill
be the most successful in a job, course or role in life, and to
select between them)
13. licensing (to provide legal evidencethe
licenceof minimum competence to practice a specialist activity,
to warrant stakeholder trust in the practitioner)
14. certification (to provide evidencethe
certificateof higher competence to practise a specialist
activity, or subset thereof, to warrant stakeholder trust in the
15. school choice (to identify the most
desirable school for a child to attend)
16. institution monitoring (to decide whether
institutional performancerelating to individual teachers,
classes or schoolsis rising or falling in relation to expectations
or targets; and, potentially, to allocate rewards or sanctions)
17. resource allocation (to identify institutional
needs and, consequently, to allocate resources)
18. organisational intervention (to identify
institutional failure and, consequently, to justify intervention)
19. programme evaluation (to evaluate the
success of educational programmes or initiatives, nationally or
20. system monitoring (to decide whether
system performancerelating to individual regions or the
nationis rising or falling in relation to expectations
or targets; and, potentially, to allocate rewards or sanctions)
21. comparability (to guide decisions on
comparability of examination standards for later assessments on
the basis of cohort performance in earlier ones)
22. national accounting (to "quality
adjust" education output indicators)
5. Why does the large number of purposes matter?
5.1 Confusion number four concerns why it
matters that results can, and often are, used for multiple purposes.
Surely, some would claim, as long as assessment results are accurate,
then we ought to be able to use them for any purpose we like?
Unfortunately, it's not quite as straightforward as that. The
point is best illustrated by considering what it might mean to
explore validity for different uses of results.
5.2 As mentioned earlier, the central question
at the heart of validation is this: are the inferences that we
draw from our assessment results sufficiently accurate (for the
uses to which they will be put)? This has become the standard
technical definition, and the word "inference" is significant
because different kinds of inference may be drawnfrom the
same assessment resultto support different kinds of use.
This is not at all obvious, so it warrants a brief technical detour.
5.3 Assessment instruments are designed
to support specific kinds of inference. So, an end-of-Key-Stage
test will be designed primarily to support an inference concerning
a student's "level of attainment at the time of testing".
Let's call this the primary design-inference. And let's imagine,
for the sake of illustration, that our assessment instrumentour
Key Stage 2 science testsupports perfectly accurate design-inferences.
That is, a student who really is a Level X on the day of the test
will definitely be awarded a Level X as an outcome of testing.
5.4 In fact, when the test result is actually
used, the user is likely to draw a slightly (or even radically)
different kind of inference, tailored to the specific context
of use. Let's call this a use-inference. Consider, by way of example,
some possible use-inferences associated with the following result-based
1. A placement/segregation use. The inference
made by a Key Stage 3 head of sciencewhen allocating a
student to a particular set on the basis of a Key Stage 2 resultmay
concern "level of attainment at the beginning of the autumn
2. A student monitoring use. The inference
made by a Key Stage 3 science teacherwhen setting a personal
achievement target for a student on the basis of a Key Stage 2
resultmay concern "level of attainment at the end
of Key Stage 3".
3. A guidance use. The inference made by
a personal tutorwhen encouraging a student to take three
single sciences at GCSE on the basis of a Key Stage 2 resultmay
concern "general aptitude for science".
4. A school choice use. The inference made
by parentswhen deciding which primary school to send their
child to on the basis of its profile of aggregated results in
English, maths and sciencemay concern "general quality
5. A system monitoring use. The inference
made by a politicianwhen judging the success of educational
policy over a period of time on the basis of national trends in
aggregated results in English, maths and sciencemay concern
"overall quality of education".
5.5 Each of these result-based decisions/actions
is premised on the use of Key Stage 2 test results.iv Yet, in
each case, a slightly different kind of inference is drawn from
them. None of these use-inferences are precisely the same as the
primary design-inference (the inference that the Key Stage 2 test
result was primarily designed to support). Indeed, some of the
use-inferences are radically different in nature from the design-inference.
5.6 So, when it comes to validation (establishing
the accuracy of inferences from results for different purposes)
the implication should be clear: accuracy needs to be established
independently for each different use/inference. Results will inevitably
be less accurate when used as indicators of future attainment
than when used as indicators of attainment at the time of testing.
And results may be less accurate still when used as indicators
of general aptitude rather than as indicators of attainment. When
it comes to using results as indicators of quality of teaching,
or quality of education, we should expect less accuracy still,
since the qualitative difference between the design-inference
and the use-inference is so great.
5.7 This begins to ground the most important
observation of the present report: an assessment system which
is fit for one purpose may be less fit for another and could,
conceivably, be entirely unfit for yet another.v
5.8 Recall that, for the sake of illustration,
this section has focused purely upon the exploration of validity
for different uses of results. The full story of evaluation needs
to be far more embracing.
6. Can we construct a framework for system-level
6.1 The fifth and final confusion concerns
what an overall evaluation ought to look like. This is where we
begin to explore the concept of "fitness" in requisite
detail. There are at least six more-or-less discrete perspectives
from which assessment systems need to be evaluated:
5. economic manageability
Each of these will be considered briefly below.
6.2 Technical accuracy
6.2.1 The first evaluation perspective is
technical accuracy; essentially, the concept of validation. It
poses the question: overall, how accurate can we expect inferences
from results to be? And, as explained previously, this question
needs to be explored independently, for each discrete use of results,
ie, for each discrete use-inference.
6.2.2 Unfortunately, it isn't always obvious
which inference underlies (or ought to underlie) each use, which
complicates the matter greatly. An example from system monitoring
is helpful here. When considering trends in the percentage of
students who attain at or above Level 4 in science at Key Stage
2, are we (or ought we to be) drawing inferences concerning:
the level of attainment of specific
cohorts of students from one year to the next (where attainment
is defined in terms of an explicit programme of study in science)?
the level of proficiency of the national
cohort over time (where proficiency is defined in terms of an
implicit "fuzzy set" of essential core competencies
the level of performance of teachers
of science over time?
the overall effectiveness of policy
and practice related to the teaching of science over time?
6.2.3 The first of the above use-inferences
will be closest to the design-inference (being defined in terms
of an explicit programme of study) and would, therefore, be likely
to facilitate greatest accuracy. However, it's arguably of least
interest as far as system monitoring goes, because it's furthest
away from the ultimate system monitoring ideal of identifying
whether "things are better now than before". For instance,
in the first years of a new curriculum for science: we would expect
average attainment to increase gradually as teachers became better
at delivering the new curriculum (with more practice and training
in teaching the new elements, with an improved selection of curriculum-specific
text books and resources, and so on); and we would expect average
test performance to increase gradually as teachers became better
at preparing students for the specific form of assessment associated
with the new curriculum. Such gradual increases would seem to
be inevitable.vi However, they would not necessarily imply that
teachers were becoming better at teaching, per se; nor even that
they were necessarily becoming better at teaching science; nor
would it necessarily mean that students of the new curriculum
were more accomplished than students of the old curriculum. As
far as system monitoring is concerned, we probably ought to be
validating in terms of more distant use-inferences (eg, inferences
concerning the performance of teachers, or the overall effectiveness
of the system), since these have greater real-world significance.
Unfortunately, these are correspondingly much harder to validate.
6.2.4 In theory, the analysis of accuracy
is largely technical, using established methods for eliciting
evidence of content validity, predictive validity, reliability,
and so on. However, in practice, exactly how the various sources
of evidence are synthesised into an overall judgement of accuracy
is often not clear and, consequently, not that technical after
6.2.5 The logic of this perspective is essentially
that: all other things being equal, more accuracy is better; and
that accuracy must significantly exceed a threshold of chance.
6.3 Moral defensibility
6.3.1 The second evaluation perspective
is moral defensibility. It poses the question: given the likelihood
of inaccurate inferences from results, and the severity of consequences
of error for those assessed inaccurately, is the specified use
of results defensible?
6.3.2 This perspective starts by acknowledging
thatwithin any assessment systemthere will be a
proportion of students who get assessed incorrectly and, consequently,
for whom incorrect decisions will be made (be those selection
decisions, provision eligibility decisions, placement decisions,
and so on). It then proposes thateven if the system is
just as far as most students are concernedif it is sufficiently
unjust for a sufficiently high number of students, then the system
may have to be judged morally indefensible. This is analogous
to why many countries refrain from executing serial murderers.
It's not that execution, per se, is necessarily judged to be morally
indefensible; it's the risk of executing even a small number of
innocent people. So the assessment parallel might be:
when the stakes are low for studentsas
is often true of everyday formative assessmentit would
not matter too much if it were fairly error-prone (such errors
can often be identified quickly through ongoing dialogue)
but when the stakes are high for
studentsas when examination results are used for selectionit
would matter (such errors can often negatively affect life chances
time and time again).
6.3.3 This results in a utilitarian analysis
(emphasising the minimisation of "horror" more than
the maximisation of "utility") for which two kinds of
evidence need to be taken into account:
the amount of inaccuracy that might be expected (stemming from
the analysis of technical accuracy)
the severity of negative consequences for those students who are
6.3.4 This final point raises a fundamental
question for the evaluator: whose value judgements ought to be
taken into account in this analysis, and how? The answer is far
from clear, especially since different stakeholders (eg, politicians,
students, evaluators) are likely to have different values.
6.4 Social defensibility
6.4.1 The third evaluation perspective is
social defensibility. It poses the question: is the trade-off
between the positive and negative impacts from operating the assessment
system sufficiently positive?
6.4.2 On the one hand, there will inevitably
be a range of intended positive outcomes. In particular, the assessment
results ought to empower users to make important educational and
social decisions appropriately (such as selection decisions, placement
decisions, school choices, and so on); to enable society to function
more fairly and effectively than it otherwise would. Although,
in theory, it may be judged entirely possible to draw sufficiently
accurate inferences to support a range of important decisions;
in practice, that doesn't guarantee that users will actually do
so. So this needs to be investigated empirically. In addition,
features of the assessment system itself may well be designed
to facilitate important educational and social impacts (such as
the improved attainment of students when assessed through modular
rather than linear schemes) and these impacts need to be investigated
6.4.3 On the other hand, there will inevitably
also be a range of unintended, and possibly unanticipated, negative
outcomes. In particular, features of the assessment system which
appear to be innocuous may turn out not to be so. Consider, for
example, standards-referenced systems, which employ a single scale
to report absolute level of attainment at key stages of an educational
experience that may span many years (eg, the national curriculum
assessment system). The theory is that this should be motivating
for even the lowest-attaining students; since it enables them
to see that they are making progress as time goes by.vii However,
such systems could conceivably turn out to be demotivating for
precisely this group of students. Not only does the assessment
reveal them to have attained lower than their peers at each key
stage; they also see the gap between themselves and others widen
on each assessment occasion.
6.4.4 The evaluator needs to be careful
to distinguish those impacts which relate to the legitimacy of
the assessment system, per se, and those which relate primarily
to broader evaluation questions; for example, those concerning
the legitimacy of educational or social policies or practices.
School choice, for example, (even if based upon entirely accurate
inferences concerning the general quality of teaching at a school)
could conceivably result in a more socially divided society, which
might be judged to be a bad thing. These are obviously important
issues to be evaluated. However, they are issues for an evaluation
of the policy of school choice, rather than for an evaluation
of the assessment system which enables it. In practice, it is
actually quite complicated to judge which impacts bear primarily
upon the legitimacy of an assessment system and which relate primarily
to broader evaluation questions; but it is useful to recognise
the distinction and to try to work towards separation where possible.
A rough rule-of-thumb might be: would we expect a different kind
of impact if an alternative assessment system was in operation?
If so, then the impact probably ought to be considered. If not,
then the impact is probably attributable primarily to a broader
policy or practice and, therefore, probably ought not to be considered.
In the example above, relating to school choice, the "divided
society" impact might be expected to occur regardless of
the system used to generate results; so this impact might therefore
not be relevant to scrutinise during an evaluation into the legitimacy
of the underlying assessment system.
6.4.5 As with the moral defensibility perspective,
the social defensibility perspective requires that two kinds of
evidence be taken into account:
the nature and prevalence of relevant intended and unintended
the costs of the negative impacts and the benefits of the positive
The synthesis of this evidence is based upon
the utilitarian principle that: if, on balance, there appears
to be too little benefit, for too few individuals, then the system
may have to be judged socially indefensible.
6.5 Legal acceptability
6.5.1 The legal acceptability perspective
asks: can the assessment system be operated without contravening
6.5.2 This is becoming increasingly salient,
both nationally and internationally. In England, the 1995 Disabilities
Discrimination Act introduced legal rights for people with disabilities
covering employment, access to services, education, transport
and housing. The 2005 version of the Act included a new chapter
which specifically covered qualification bodies; a provision which
is intended to be extended to general qualifications from 1 September
6.5.3 The new legislation raises questions
such as whether it is legally acceptable, within high-stakes general
qualifications like GCSE English, to require specific forms of
competence. For example, to be competent in English, is it absolutely
necessary to be able to speak and listen fluently? Might there
be a legal right for speaking- and hearing-impaired students to
access this crucial "gatekeeper" qualification? Nowadays,
we routinely need to consider whether our systems can be designed
to be more inclusive without unduly compromising them.
6.5.4 The analytical bases for evaluation,
from this perspective, are the principles and precedents of law;
the basic premise being that contravention of the law is unacceptable.
Significantly, judgements from the legal acceptability perspective
can, and sometimes will, contradict judgements from the perspective
of technical accuracy. Indeed, it may occasionally be necessary
to make an assessment less valid in order for it to comply with
the law. This is because technical analyses typically elevate
the majority (sometimes at the expense of minorities) while legal
analyses often elevate minorities (sometimes at the expense of
the majority).viii Legal experts and assessment experts do not
always share the same concept of fairness.
6.6 Economic manageability
6.6.1 From the perspective of economic manageability,
the evaluator asks: is the burden of the assessment system upon
6.6.2 The idea of burden does not reduce
simply to financial cost, but also extends to issues such as:
human resources (eg, the availability of skilled examiners); workload
(eg, the time spent by students and teachers in preparing coursework);
processing infrastructure (eg, the demands made of the postal
system when delivering scripts); and even ecological impact (eg,
the "rainforest cost" of the paper which flows through
the system each year).
6.6.3 The analytic basis for answering this
kind of evaluation question is economic, grounded in the principles
that: all other things being equal, less expense and consumption
is better; and that there will be a threshold of expense and consumption
which cannot reasonably be exceeded.
6.7 Political viability
6.7.1 The final perspective is political
viability, which poses the question: is society prepared to buy
into the assessment system?
6.7.2 Clearly, if society is not prepared
to buy into the system thenno matter how good it might
seem to be from the other perspectivesit will remain unviable.
Unfortunately, such failures are not uncommon in the world of
educational assessment. In England, the following might be mooted
as examples: S papers; Records of Achievement; the Certificate
of Pre-Vocational Education; parity of esteem between academic
and vocational qualifications.
6.7.3 Unlike the other perspectives, the
underlying principle here is essentially arational. It is best
illustrated by platitudes of folk psychology such as: the customer
is always right; or, you can lead a horse to water but you can't
make it drink.
7. How should we conduct system-level evaluation?
7.1 Turning the framework for system-level
evaluation into a real-life evaluation is far from straightforward.
As noted, each discrete use of results ought to be evaluated,
independently, from each of the six perspectives. This clearly
implies a very large amount of research; and the more uses to
which results are put, the more research is required. Moreover,
it ought not to be restricted to the intended or "official"
uses either, since the unintended uses and even the proscribed
ones are important too.
7.2 The example of national curriculum testing
is useful here. Certainly, test results are not used for licensing
nor for the certification of higher professional skills. Diagnosis
and provision eligibility probably require results from more specialist
tests; while selection and qualification would typically be based
upon results from exams taken later in an educational career.
Whether test results have (or ought to have) a role in guidance
and national accounting is less clear. What does seem likely,
though, is that results from national curriculum tests are used
for the remaining 14 purposes, whether legitimately or not. Again,
the question of legitimacy would need to be explored independently
for each use.
7.3 At some point, evidence and argument
from the independent analysis of specific uses, and specific impacts,
needs to be brought together into an overall evaluation argument.
This will require judgements concerning the acceptability of compromises
and trade-offs, with reasoning along the lines of: "the system
may not be particularly good for this use, but it's probably better
than nothing; admittedly it does have a big negative impact for
a small number of students, but perhaps not too many; and, ultimately,
the system is quite good for that purpose, and that's the principal
purpose, after all . . ." (obviously, this is simply a caricatured
microcosm of an overall evaluation argument).
7.4 The previous paragraph hints at another
important point: to reach overall evaluation conclusions, it is
necessary somehow to weight the importance of alternative uses
and impacts. There needs to be some indication of which are the
most valued uses of results, and impacts of system operation,
and which are more like fringe benefits. This might be a problem
if there is neither any consensus among stakeholders nor formal
specification from policy makers. Again, whose value judgements
ought to be taken into account in this analysis, and how?
7.5 Ultimately, the legitimacy of the assessment
system cannot be judged in isolation, but only in relation to:
1. a new-improved version of the same assessment
2. an entirely different assessment system;
3. a suite of more tailored systems, operating
in parallel; or
4. no assessment system at all (which, admittedly,
would be unlikely ever to triumph as an evaluation conclusion,
but which is still useful as an anchor point).
7.6 This raises yet another complication:
that each alternative will need to be put through the evaluation
mill in its own right. That is, an overall evaluation argument
will need to be constructed for each of the alternatives, to pit
them against the present state of affairs. Unfortunately, since
these are likely to be largely hypothetical at this stage, the
construction of evidence and argument will inevitably be patchy
7.7 Finally, it is worth emphasising that
the aspiration of system evaluation is not perfection, but legitimacy;
and this is also true for the flip side of evaluation, design.
This legitimacy is a real-world, pragmatic aspiration, which might
be characterised as: "overall, at least satisfactory, and
preferably good, but inevitably not perfect". So, for example,
whereas the principle of maximising validity (from a technical
accuracy perspective) might go so far as to recommend a separate
system for each discrete use of results, the principle of minimising
burden (from an economic manageability perspective) might recommend
just one. The overall evaluation conclusion (bearing in mind all
perspectives) might recommend something in-between; say, two or
three separate systems, operating in parallel, each supporting
a distinct set of three or four different uses of results, and
each with its own particular impacts. Compromise and trade-off
are fundamental to the design and evaluation of assessment systems.
8. Is system-level evaluation feasible?
8.1 Given all of the above, is it humanly
possible to undertake a rational and rigorous system-level evaluation?
Is it possible to reach a straightforward conclusion to a straightforward
question like "should the system of national tests be changed?"
This is a challenging issue. It's probably true to say that no
educational assessment system has ever been evaluated quite as
rigorously as recommended above. Indeed, it's an inevitability
of real life that decisions are generally made in the absence
of complete evidence and argument; and the world of educational
assessment is no different in that respect. Having said that,
the inevitability of falling short of the ideal evaluation does
not detract from the importance of constructing as rigorous an
evaluation as is possible.
8.2 Frameworks like the one presented above
can help inquirers to scaffold useful answers to thorny evaluation
questions. They can be particularly helpful for identifying holes
in the overall evaluation argument: where research still needs
to be undertaken; and where argument still needs to be constructed.
And they can also help stakeholders to reflect upon, to clarify
and to articulate their different priorities for national assessment;
to distinguish between the crucial uses and impacts and those
which are more like fringe benefits.
9. Using the framework to identify common
9.1 Finally, frameworks like the one presented
above can also help inquirers to identify limitations in evaluation
arguments presented to them by others. In this last section, a
few common limitations will be illustrated.
9.2 The conflation of different evaluation questions
9.2.1 In constructing a robust evaluation
argument, it is important to put to one side issues which appear
to be relevant, but which actually fall under a broader evaluation
remit. For example, when evaluating the use of test results for
school choice purposes, it is clearly relevant whether the system
supports sufficiently accurate inferences concerning differences
in the general quality of teaching between institutions. However,
as suggested earlier, the positive and negative impacts arising
from school choice, per se, are probably not directly relevant
and, as such, should not enter into the evaluation argument.ix
Of course, they are crucial to evaluating the policy of school
choice, and this evaluation needs to happen independently.
9.2.2 A particularly common limitation of
many formal and informal evaluation arguments is the failure to
distinguish between the impacts attributable to testing, per se,
and the impacts attributable to the high-stakes uses of results
which the testing is designed to support. So, for example, to
the extent that high stakes can trigger behaviour which corrupts
the validity of test results and the effectiveness of teaching,
high stakes can similarly trigger behaviour which corrupts the
validity of teacher assessment results and the effectiveness of
teaching. In short, it may not be the operation of the assessment
system, per se, which is problematic, but the policies or culture
underlying those high-stakes uses. Having said that, there are
important differences in this situation from the one described
above. First, although the impacts might be primarily attributable
to the high-stakes uses, they directly affect the accuracy of
results from the system; which thereby renders those impacts directly
relevant to the evaluation. Second, the impacts upon teaching
and learning, even if primarily attributable to the high-stakes
uses, are likely to be different across different assessment systems;
which again recommends that they enter into the evaluation.
9.2.3 When there is a range of equally valid,
although logically separable, evaluation questions to ask, then
these ought somehow to be arranged within a meta-framework. For
example, it makes sense to interrogate the purposes of curriculum
and qualification, before interrogating the purposes of assessment.
Or, to put it less formally: the assessment-tail should not wag
the curriculum-dog. At least, not too much; where the rider `too
much' is essential. In fact, the process of meta-evaluation needs
to be iterative and will necessitate inevitable trade-offs and
compromises. By way of extreme example, it would not be legitimate
to promulgate a radically new curriculum for school-leaving examinations,
if the learning outcomes which were elevated could not be assessed
with sufficient accuracy: we need to remember that the examination
results have important functions in their own right, as the basis
for making the kind of qualification and selection decisions that
are necessary to support a fair society.
9.3 The lack of a specified alternative
9.3.1 A common limitation of evaluation
arguments is the lack of a specified alternative system. It is
not foreseeable that society would tolerate the rejection of educational
assessment entirely. So it would seem to be incumbent upon any
critic of present arrangements to explain, in some detail, how
an alternative system would, on balance, be more legitimate. The
key issue, here, is one of detail. For instance, the "test
versus teacher assessment" debate is literally meaningless
unless the detail of the alternative system is spelled out.
9.4 Too incomplete an analysis of uses and impacts
9.4.1 Even when two or more systems are
specified in sufficient detail, and pitted against each other,
it is often the case that the evaluation argument remains incomplete,
through omission of central components. This frequently occurs
when an alternative system is proposed which is particularly effective
in relation to certain uses and impactsperhaps genuinely
more so than the present systembut which leaves crucial
other uses or impacts unaddressed. In England, numerous protagonists
have argued recently for employing moderated teacher assessment
(for certain uses and impacts) alongside a national monitoring
unit (for others); instead of the present system of national curriculum
testing. Few protagonists, though, have also grappled effectively
with how best to support uses which require the comparison of
teachers and schools in a high-stakes context. This particular
debate is very importantbecause the arguments in favour
of certain forms of teacher assessment alongside a national monitoring
unit are persuasive. However, due attention also needs to be paid
to satisfying the demand for trustworthy data on school effectiveness.
9.4.2 Another limitation of many evaluation
arguments is the lack of available evidence, or a reliance upon
evidence which is easy to challenge. A particular example of this
at present is the impact of national curriculum testing upon teaching
and learning, especially at Key Stage 2. Despite the system having
been in operation for over a decade, and despite considerable
anecdotal evidence of negative washback, there is remarkably little
systematically documented evidence. This greatly hinders effective
9.5 The gulf between real and hypothetical
9.5.1 Finally, while extant systems must
inevitably be evaluated in the context of real-world operationmired
in the kind of intricate relationships which give rise to unforeseen
problemsalternative systems will typically be evaluated
as promising-hypothetical. In this context, it is easy to give
the alternative system undue benefit of the doubt, without recognising
that its implementation will inevitably necessitate certain compromises
and will result in its own unforeseen problems. At the very least,
the root cause of the problems which beset the extant system need
to be extrapolated to the promising-hypothetical.
i The term "assessment" is used generically,
to refer to any instrument or process through which student competence
or attainment is evaluated (eg, test, teacher assessment, examination,
etc). The term "system" is used to encapsulate, in a
broader sense, the detail of the structure and mechanism through
which students are assessed. In relation to national curriculum
testing, for instance, this detail would include procedures for
test development, distribution, administration, marking, reporting,
evaluating (and so on), as well as the technical, professional,
managerial and administrative employees required to develop and
operate those procedures.
ii It is based upon insights from the international
literature on validation and evaluation, although references to
specific sources have not been included (further information can
be provided on request).
iii Although there is a huge debate in the technical
literature on the precise extension of the term "validation",
this does not significantly affect the tenor of the argument developed
in this report.
iv In reality, it would be advisable to use more
than one source of evidence to support important decisions (such
as placement, monitoring, guidance and so on). Indeed, assessment
professionals are increasingly preaching this dictum. However,
that does not change the basic principle that, when results are
used to support different purposeswhether alone or in combination
with other sources of evidencedifferent kinds of inference
are drawn from them.
v There are different ways of emphasising the
point that results which are fit for one purpose may not be fit
for another. The approach in the text is to focus upon the different
inferences which need to be drawn from results. Another approach
would be to stress that systems need to be designed differently
for different purposes and different design compromises will be
made. (Compromises are made so as not to over-engineer the system,
because increased accuracy comes at a price; assessment design
aspires to sufficient accuracy, for a specific purpose, rather
than maximum accuracy.) Ultimately, design characteristics and
compromises which are legitimate for one use may be illegitimate
vi Note that this is not to implicate the phenomenon
of "teaching-the-test" (whereby, over time, teachers
reduce the scope of their teaching, excluding those aspects of
the curriculum that the tests tend not to cover). This practice
is neither appropriate nor inevitable. Were it to occur, it would
occur in addition to the impact of practice, training and improved
resources (described in the text).
vii This contrast with norm- or cohort-referenced
systems, in which the lowest-attaining students may be awarded
the same very low rank at every stage of their educational career,
despite making real progress in learning and despite achieving
respectably given their particular situations.
viii Any technical analysis which is based upon
an average (which is frequently the case for large-scale educational
assessments) thereby tends to elevate the majority.
ix Other than when considering the negative impacts
which arise from inappropriate school choices, consequent upon
inaccurate results data (the moral defensibility perspective).
Paul E Newton
Head of Assessment Research, Regulation and Standards