Examination of Witnesses (Questions 100
MONDAY 17 DECEMBER 2007
Q100 Stephen Williams: That is based
on the aggregate of the results for each child in the school,
so would you reject alternatives completely?
Dr Boston: It depends on the purpose.
If your purpose is to find out whether children are writing and
reading as well as they did 10 years ago nationally, the best
test is to give a sample of them virtually the same test as was
given to a sample of them 10 years ago. That will tell you whether
we have gone up or down. If you want to report on the performance
of a school this year in relation to the school next door, the
sample will clearly not do that, but the full cohort test will.
It is again about purpose. Both purposes are legitimate and some
of the difficulties with the testing programme or the examination
programme when looking at whether standards have changed significantly
year on year or over a 20-year period, are that the curriculum,
teaching methods and other things such as class size have changed.
If you really want to know whether people are better at reading
than they were 20 years ago, give them the same test.
Q101 Stephen Williams: I see that
you have had time to look at the note that has been passed to
Dr Boston: This is Dr Horner's
contribution in handwriting. The report of the task group on assessment
and testing in 1989while I was still elsewhere, Barrydecided
on a 10-point scale and proposed a graph of progress that identified
Level 4 at age 11. That report was then published, presumably.
We are now on an eight-point scalearen't we?so that
10-point scale has been reduced. That does not fully answer your
Q102 Stephen Williams: No, four out
of 10 is 40% and, on the same scale, four out of eight is 50%,
so it seems to be a completely different target. Perhaps we are
getting a bit too technical. I was trying to get at whether the
Government were feeling the QCA's collar in respect of how we
set the standards for children. Therefore, is it right that we
have a separate regulator of standards for the future?
Dr Boston: Barry, I should very
much like to give you a written statement tomorrow in answer to
Q103 Stephen Williams: Before the
note, I was going to mention the difference between how a child's
performance is assessed and how a school's performance is assessed.
You were going into how children's performance was assessed over
time. Is there another way in which you can assess a school's
effectiveness apart from the league table mentality that we have
at the moment? Is there an alternative? After all, they do not
have league tables in Wales or Scotland.
Dr Boston: You can certainly assess
the performance on the basis of teacher reporting, as against
the school reporting its performance against a template of benchmarks,
perhaps, as occurs in some other countries, and reporting its
testing of its students against national averages in literacy,
numeracy and so on. I know of cases where that occurs.
Q104 Stephen Williams: You are a
man of international experience. Do you think that anywhere else
does it better than Englandwhether a state in Australia
or anywhere elsewithout this sort of national frenzy every
August, with people wondering whether things are going down the
pan or the Government saying, "No, things have only ever
Dr Boston: I think England is
pretty rare in the way it does this in August, although I would
not say unique.
Q105 Chairman: Is that good or bad?
Dr Boston: The annual debate about
whether too many have passed and whether standards must have fallen
is a very sterile debate and I would be glad to see the back of
it. If it is right that this new regulator will lead to the end
of that, it is a good thing. We are not so sure that it will.
There are other, better, ways of celebrating success and achievementnot
Q106 Stephen Williams: Do you think
that any particular country does it a lot better than England?
Dr Boston: No. I think that in
other countries where the results come out there is less public
criticism of youngsters on the basis that, because they have three
A grades, the result must be worthless. Such criticism is a very
bad thing. From my previous experience, I have a great interest
in Aboriginal education. There was 200 years of Aboriginal education
in Australia with absolutely no impact on the performance of Aboriginal
kids until we introduced full cohort testing and reporting at
school level. Then, suddenly, people took Aboriginal education
seriously and it began to improve.
Q107 Stephen Williams: If nobody
else wants to come in on this section, I should like to ask one
last question, going back to the start, about introducing the
new regulator. We have only had the report, Confidence in Standards,
from the Secretary of State today, and we have not been able to
digest it fully yet. When were you consulted about the split in
the QCA's responsibilities? Was it before, during or after the
August round of exam results that we had a few months ago?
Dr Boston: It was after.
Q108 Fiona Mactaggart: You have been
talking clearly about the difficulty of tests fulfilling 14 different
purposes. The fact is that they fulfil some of those inadequately.
You suggested that the best way to see if children over time are
able to achieve the same standard is through sampled testing.
Do we do very much of that, and if not, why not?
Dr Boston: Those tests will tell
you whether performance on a particular task has improved over
time. We do not do that as a country. We pay a lot of attention
to PIRLS and PISA and the national maths and science study. In
developing its Key Stage tests from year to year, the QCA does
pre-test. Part of those pre-tests in schools, which youngsters
think are just more practice tests, are pre-tests for what we
will use in 18 months' time. In them we often use anchor questions,
which are the same questions that have been asked a few years
before or in consecutive years. They might be only slightly disguised
or might not be changed at all. That is to help develop tests
that maintain standards so that Level 4 is Level 4 year on year.
The boundary between the levels is set by the examiners. It might
be 59 in one year and 61 in another, but they know that in their
judgment that is a Level 4. They draw on those tests. We have
not used the tests systematically enough to say, "We used
these six questions for the past eight years and we know that
students are getting better at reading or worse at writing,"
but that is the basis on which we develop and pre-test those emerging
Q109 Fiona Mactaggart: I am struck
by this. You are saying that we do it a bit to ensure the comparability
of tests over time. We all accept that some of that kind of work
is a necessary function of getting accurate summative tests, but
there is a constant threat in debate in assessment about whether
standards have changed over time. I do not think that I properly
understand why we have not bothered to invest what does not strike
me as a very large amount of resource in producing that kind of
sampling over time to see whether standards are improving or weakening,
and where. We would then have a national formative assessment
about where the strengths and weaknesses of our education system
are over time. Do we have a mechanism that is designed to do that?
If not, why not?
Dr Boston: No, we do not. We use
PIRLS and PISA and in the recent results, they confirmed what
we already know; for example, at PIRLS level, the line I talked
about is not as steep as it was before. It has flattened off,
but has not come to a plateau. The notion of a sampling programme
is something that we have raised with Government. Some years ago,
before I came into this job, there was the Assessment of Performance
Unit, which did some of that work. That is no more. I do not know
the background and the reasons why the work was not pursued, but
it was work of this sort. It would seem to me that we need to
be thinking not of either/or. That is the message that I really
want to get across. We are not thinking of Key Stage tests or
single level tests or sample tests. If we want to serve those
22 legitimate purposes of testingI am sure there are morewe
need a number of tests that will deliver between them all those
things, but which are designed so that they are very close to
what Paul Newton calls the design inference, where the user inference
and the design inference are very close indeed.
Q110 Fiona Mactaggart: What I do
not understand about the proposed new system is that if we developed
a wider range of tests to separate some of these functions more
precisely so that we get more accurate information rather than
trying to infer information from tests that are designed to do
something else, which is what we do at present, who would take
the lead in developing the sample tests and introducing them?
Would it be the QCA or the new regulatory authority? I have not
had time to read through the document, but I do not understand
whose job is what.
Dr Boston: It would be the QCA,
and it would do its work partly through stimulating the private
sector market and the awarding bodies to work with it. Presumably
the QCA would take the initiative on remit from the Government.
That would be critical: the Government would decide that they
wanted a set of new tests. We did not go out and invent single
level tests. We were remitted to do them. We produced them at
Government request, and with our very strong support. So the initiative
would rest fundamentally with the Government, but the body that
would lead on it would be the QCA, or whatever the QCA might end
up being called some time in the future. The regulatory authority
is to ensure that, once the productthe assessmentis
there, it delivers on standards and maintains standards. The regulator
is not a development authority; it is an authority to regulate
products and ensure their quality once they are there.
Q111 Fiona Mactaggart: When you were
remitted to develop the concept of single level tests, were you
remitted to develop a test that was a one-way street, rather than
a test that could be re-administered? I gather that the National
Foundation for Educational Research is concerned about the fact
that this is just a single pass test and that someone who chooses
when they do it might pass then but might not necessarily pass
it a month later.
Dr Boston: We were remitted to
produce a test which would be taken as a one-off. Further down
the track if we get to a point, as I think we might, where single
level tests are available virtually on line, on demand, we would
need to go to a data bank of test items. What we have at the moment
is a Level 3 test or a Level 4 test. A judgment is then made on
the score you get about whether you are secure in Level 4. That
test is then finished with. The time may come in the future, as
with Key Stage 3 ICT tests, where there is a computer in the corner
on which you can take at any stage your Level 4 or Level 5 reading
test. That would depend on a data bank. In that sense it is constantly
renewable, if I understand the question correctly.
Q112 Fiona Mactaggart: It was not
so much about whether it was renewable. If the teacher of the
child can choose the moment at which the child takes a single
level test and it is a propitious day for that particular child,
the child may do well in the test and succeed, but it might still
be rather a frail attainment. There is anxiety about whether that
is a fully accurate picture of the child's capacity and the general
learning level even though they can do it on a fair day with wind
Dr Boston: I am remiss in that
I have not fully explained the relationship between the Assessment
of Pupil Performance and the tests. The APP programme is designed
essentially to produce greater understanding among teachers about
what is represented by a levelthe profile of a Level 4
in reading, the profile of a Level 5 in reading and the difference
between them. It represents the different indicators that show
a child is either at Level 4 or Level 5, and the child is then
entered for the test. The test is meant to be confirmation that
the teacher has made the judgment correctly.
Sitting suspended for fire evacuation.
Chairman: Dr Boston, we are back in business,
although only briefly. I suspect that we will have to call you
or your team back at some stage, because this has been unfortunate.
I will give a question to each member of the team, and you will
answer speedily. I will start with David, followed by Stephen,
then Fiona, and I will finish.
Q113 Mr Chaytor: On maintenance of
standards, will the new A* grade at A-level have the same pass
rate in all subjects across all examining boards?
Dr Boston: No.
Q114 Mr Chaytor: Does the existing
A-level threshold have the same pass rate in all subjects?
Dr Boston: No.
Q115 Mr Chaytor: Does that cause
Dr Boston: No.
Q116 Mr Chaytor: Will there not be
a huge discrepancy between different subjects in different boards?
Dr Boston: The A/B boundary is
set by professional judgment. The reality is that subjects are
different; there is no attempt to say that, for example, 10% must
pass or have an A grade in every subject. No country in the world
achieves precise comparability between subjects in terms of standards.
Australia tries to do so: it takes all the youngsters who get
a certain grade in, for example, English, geography, and art,
and, if they find that a lot of the youngsters who are taking
those three are getting higher grades in geography than in the
other two subjects, then they deflate the mean of geography. Some
pretty hairy assumptions underlie that. Here, an A/B boundary
is set by professional examiners broadly at the level that a hard-working,
well-taught, student who has applied himself or herself fully
would achieve on a syllabus or specification.
Q117 Mr Chaytor: Are the thresholds
for subjects on examining boards matters of public record? That
is, is the percentage score that triggers a B, an A or an A* on
the record and available to pupils and parents?
Dr Boston: The answer is no, I
Q118 Mr Chaytor: My next question
is, should it be?
Dr Boston: I would think not.
Q119 Mr Chaytor: Why not?
Dr Boston: The essential point
is that you might have a harder paper one year than another, in
which case the boundaries might change significantly. The point
is not the numerical score where the boundary is drawn. The fundamental
point is the professional judgment of the examiners, who decide
where the A/B boundary is and where the E/U boundary is. They
do that on the basis of their experience and past statistical
evidence using papers of similar demand.
5 Note by witness: In 1988 the Task Group on
Assessment and Testing (TGAT) designed the assessment system for
the national curriculum. This included the development of a then
10 level scale to cover the years of compulsory schooling. Level
4 was pitched as the reasonable expectation for the end of the
primary phase, to ensure pupils could move on with confidence
in their skills to tackle the secondary curriculum. Back