Select Committee on Children, Schools and Families Written Evidence


Memorandum submitted by National Foundation for Educational Research (NFER)

EXECUTIVE SUMMARY

Purposes of Assessment

  In essence, educational assessment serves two major purposes. The first of these is to provide immediate feedback to teachers and students, in order to facilitate the learning process. This is often termed formative assessment. Recently it is also frequently referred to as Assessment for Learning. The second major purpose is to provide information which summarises a particular phase of learning or education. This information may be for a variety of institutional uses such as monitoring progress, certification or selection, and it may be for a variety of users, such as parents, teachers, governmental bodies and the learners themselves. This type of purpose is termed summative assessment. Both these purposes are important in the educational system as a whole, but they have different requirements and different characteristics.

  Formative assessment is vital to teaching and learning. It is embedded in effective classroom practice and it should be tailored to the individual and their own stage of learning. Such processes are essential for progress, providing a motivational effect for students as well as information on what has been recently achieved and the next steps in order to make progress. We are very supportive of the principles of Assessment for Learning and believe these must be further promoted. There is also a need for more, and more rigorous, research which explores the successful elements of Assessment for Learning.

  However, we do not believe Assessment for Learning can or should provide summative information. Instead, we believe that this function should be a largely separate system with its own priorities, features and requirements.

  Formal summative assessment can serve many purposes. This paper concentrates on one of these: summative assessment within schooling, principally through National Curriculum Assessment. This has had a consistent structure for about a decade, but there is currently renewed discussion on its purposes and methods. This has culminated in the department for Education and Skills' Consultation document Making Good Progress which proposes shorter, better focused "when ready" tests.

NATIONAL CURRICULUM ASSESSMENT

  In commenting on testing in the National Curriculum, it is apparent that it is now a complex system, which has developed many different purposes over the years and now meets each to a greater or lesser extent. It is a tenet of current government policy that accountability is a necessary part of publicly provided systems. We accept that accountability must be available within the education system and that the assessment system should provide it. However, the levels of accountability and the information to be provided are open to considerable variation of opinion. It is often the view taken of these issues which determines the nature of the assessment system advocated, rather than the technical quality of the assessments themselves.

  Our review of National Curriculum Assessment shows that there are many purposes, and any calls for change needs to consider which are the most important and which can be downgraded. In the existing system, the current National Curriculum tests are a compromise which attempts to meet all these functions. The accountability role means that they must achieve high levels of reliability and validity. As one of the developers of National Curriculum tests, we are aware of the thorough development process they undergo and the underlying statistical data on their performance. In our view, the current tests achieve the necessary technical and psychometric requirements to a reasonable extent. Any development of the existing system and its tests for which the accountability purposes remain, would properly need to demonstrate that it has equivalent or higher reliability. We do not believe it would be defensible to introduce a system in which levels of reliability are not known or cannot be demonstrated.

  We therefore believe that there should not be changes to the existing system without careful consideration of what the purposes of the system are and a statement of this. Any proposals for change should set out carefully which of the purposes they are attempting to meet and which they are not. The level of requirements for validity and reliability could then be elucidated and the balance with manageability and the resources required determined. If accountability is no longer to be required then a different assessment regime could be implemented. However, this should not be done without evidence that any replacement would meet its own purposes validly.

  Our stance in relation to assessment is that there must be a clear statement of the intended purposes of the assessment system and that its processes and instruments should have an appropriate level of validity and reliability to provide sound evidence for those purposes. This implies that there should be a sound development process for instruments, and evaluative research to demonstrate that the judgements being reached on the basis of the system are soundly based.

NATIONAL MONITORING

  One of the current purposes of National Curriculum Assessment is the provision of central information on the education system as a whole, for monitoring standards over time and reporting on the curriculum in detail. There are difficulties in maintaining a constant standard for the award of a level in a high stakes system where tests or questions cannot be repeated. We do though believe that the methods used for this currently which include year on year equating and the use of a constant reference point through an unchanging "anchor test" are the best available. A second consideration is that the curriculum coverage each year is limited to the content of that year's tests. In response to these (and also other issues), there has been considerable advocacy of a light sampling model for monitoring the curriculum and changes in performance.

  However, there are some problems with this approach, which should be recognised. The lightly-sampled low stakes assessment would provide one view of standards, but because it is low stakes it may well underestimate what students are really capable of when they are more motivated. The research literature shows that there is a large difference in scores on the same test in high and low stakes situations. This is a validity issue related to the view taken of standards. If we are interested in monitoring what pupils can achieve when not motivated to achieve, low stakes surveys would be acceptable. We also believe that the practical difficulties of conducting such surveys voluntarily have been underestimated.

  Nevertheless, we would support the introduction of a properly planned regular national monitoring exercise, to examine changes in performance at regular intervals, on a sample basis, and to monitor the curriculum widely. To assess the full curriculum in a valid manner may well require assessment methods other than written tests (eg for speaking and listening, or science experimentation). Such surveys would need to be regarded as a proper research exercises with the collection of background data on pupils and schools, in order to examine educational and social questions. They should also ensure a wide agreement on the appropriateness of its methodology and analysis techniques, reducing the possibility of attacks on the results.

MAKING GOOD PROGRESS PROPOSALS

  The Making Good Progress proposals range widely over assessment, personalised learning and target setting. These should properly be regarded as a whole. However, since this paper deals with assessment issues, we will concentrate on that part of the proposals.

  In general, we are supportive of the notions of testing when ready and the close tie to teaching and learning. The concept of testing when ready can be a useful one, particularly if it is used formatively and incorporated into the teaching-learning process as in Assessment for Learning. As such, "progress tests" could provide a useful stimulus to teaching and learning. However, as described in Making Good Progress, we would doubt that they can fulfil that function. As a single level test, awarding a level, the test would generally show what a student could do but it would not be able at the same time to provide diagnostic information about the next steps since these would not be included in the test. Similarly, because it would have to cover the curriculum broadly at that level, and levels represent two years of teaching (on average), it could not identify the small next steps needed for personalised learning.

  The argument advanced in Making Good Progress is that success at one level will stimulate progress toward the next level, acting motivationally. This will need to be evaluated in practice. It may be that the levels are so far apart (they are intended to cover two years of development) that achieving one may actually slow progress, since the next target may be too distant. This is particularly a concern because of the "one way ratchet" proposal. The achievement of a level and the knowledge that it cannot be removed may act to demotivate rather than motivate. We would advise that the "one way ratchet" is abandoned and that the system allows for re-testing of doubtful cases so that high levels of certainty are achieved and so that misclassification is minimised.

  For these reasons, we do not believe that the tests as described could support teaching in any direct way. If this is the desired intention, a different model with a suite of short tests, relating to specific elements of the curriculum, and providing information both on what has been achieved and the next steps would be more appropriate. There would need to be a large bank of such tests available for testing when ready on an individual basis. To be most useful they would be marked by the teacher, immediately, rather than through an external system. Such tests would be low stakes and have little accountability function.

  To summarise, we do believe that some version of Progress Tests may be a useful addition to the system, but believe that their purpose must be carefully defined. That purpose should then lead to a specification and a development process that produces tests which are fit for use in terms of their reliability and validity.

NATIONAL FOUNDATION FOR EDUCATIONAL RESEARCH (NFER)

  The National Foundation for Educational Research (NFER) was founded in 1946, and is Britain's leading independent educational research institution. It is a charitable body undertaking research and development projects on issues of current interest in all sectors of education and training. The Foundation's mission is to gather, analyse and disseminate research based information with a view to improving education and training. Its membership includes all the local authorities in England and Wales, the main teachers' associations and a large number of other major organisations with educational interests, including examining bodies. It is overseen by a Board of Trustees and Council.

  The NFER's Department for Research in Assessment and Measurement is one of two research departments of the Foundation. It specialises in test development and research into assessment-related questions. The work of the Department involves projects of importance to national educational policy and its implementation through research, the development of assessment instruments and the evaluation of assessment initiatives. It has a consistent track record of developing high quality assessment materials to meet the needs of a variety of sponsors. The Department's experience covers the whole range of tests and other assessments. NFER's work in assessment and surveys stretches back over its entire history, such that the Foundation has a unique experience of test development and the use of tests. In addition to developing assessments, we also carry out major evaluation studies, large scale surveys and international surveys for a number of sponsors including DfES, QCA, SEED and DELLS.

EXPERIENCE IN ASSESSMENT

  The following list of projects illustrates the variety of experience in assessment matters:

NATIONAL ASSESSMENT BY SAMPLING THE COHORT

  NFER was responsible for the greater part of the work of the Assessment of Performance Unit (APU) in the UK. National monitoring of performance in mathematics, English and foreign language, in England, Wales and Northern Ireland, was undertaken by the Foundation from the late 1970s to the late 1980s, when testing all of the cohort replaced a sampling approach. as part of this work a range of assessments going beyond pencil and paper tests was involved.

NATIONAL ASSESSMENT BY TESTING THE WHOLE COHORT

  Since 1989, the Foundation has undertaken much work in producing National Curriculum tests to be used by the whole cohort in England and / or Wales. Such work has encompassed English, mathematics and science for various ages: 7, 11 and 14 and has been undertaken under contract to QCA or its predecessors. This work has provided many insights into the relationships between reliability, validity and manageability. Each of these tests is taken by 600,000 students, and the results have high stakes for schools since they are published as part of the accountability of the education system.

UK ASSESSMENT IN THE INTERNATIONAL CONTEXT

  The Foundation has had a long involvement with international assessment, and was a founder member of the International Association for the Evaluation of Educational Achievement (IEA), which was set up in the 1960s and organises international comparative studies of educational achievement. NFER has been responsible for managing the testing for all of the IEA surveys in which England has participated, including both TIMSS (Trends in International Mathematics and Science Surveys) and PIRLS (Progress in International Reading Literacy Survey).

  In 2005, NFER became responsible for the OECD PISA (Programme For International Student Assessment) surveys in England, Wales and Northern Ireland for 2006, which will report this year and will also be undertaking the 2009 surveys in all four UK countries.

NFER POSITION PAPER ON ASSESSMENT (2007)

Introduction

  This paper has been produced in order to inform some of the current debates on National Curriculum Assessment in England. The Education and Skills Committee of the House of Commons has announced an inquiry into Testing and Assessment. In part this will examine testing and assessment in primary and secondary education as a key issue. Currently, pupils in England take Key Stage tests at seven years-old, 11 years-old and 14 years-old in English, mathematics and science. This system has developed and evolved since its introduction in 1991. In January 2007 the Government announced that it would pilot several measures at Key Stages 2 and 3, including allowing pupils to sit national Curriculum assessments as soon as they are ready instead of waiting until the end of the key stage.

  Our paper sets out the background to these debates, concentrating on the purposes of assessment, and the desirable characteristics which flow from these purposes. This leads to a statement of NFER's stance in relation to assessment. Finally, a commentary is given on two specific proposals for change currently under discussion: a national monitoring system based on a sampling approach; and the "Progress Tests" proposed in the DfES discussion paper Making Good Progress.

Dimensions of Assessment

  In essence, educational assessment serves two major purposes. The first of these is to provide immediate feedback to teachers and students, in order to facilitate the learning process. This is often termed formative assessment, but may also be referred to as diagnostic assessment. Recently it is also frequently referred to as Assessment for Learning. The second major purpose is to provide information which summarises a particular phase of learning or education. This information may be for a variety of institutional uses such as monitoring progress, certification or selection, and it may be for a variety of users, such as parents, teachers, governmental bodies and the learners themselves. This type of purpose is termed summative assessment. Both these purposes are important in the educational system as a whole, but they have different requirements and different characteristics.

  This dimension of purposes is only one categorisation. A different categorisation, which cuts across the assessment purpose, is between formal and informal processes of assessment. The distinction here is between, on the one hand formal processes such as exams, tests and other assessments in which students encounter the same tasks in controlled and regulated conditions and, on the other hand, those less formal activities that form part of on-going teaching and learning. This second group would encompass question and answer, teacher observations, group discussion and practical activities together with classroom and homework writing and assignments.

  Using this two-fold classification (as shown in the table below), it can be seen that formal and informal assessments can each be used for both formative and summative purposes. The formal processes are often managed externally to the school, though they need not be, and the informal processes are often internal to the school though they may provide information which is reported externally. The four cells of the table can be used to discuss the role and requirements of assessment systems and instruments.


Purposes
Processes
Formative
Summative

Informal
Questioning

Feedback

Peer assessment

Self assessment
Essays in uncontrolled conditions

Portfolios

Coursework

NC teacher assessment
Formal
Analysis of tests, exams, essays

Target setting
Tests

Exams

Essays in controlled conditions


FORMATIVE ASSESSMENT

Informal Formative Assessment

  Formative assessment is vital to teaching and learning. It is embedded in effective classroom practice and it should be tailored to the individual and their own stage of learning. Such processes have long been regarded as essential for progress, providing a motivational effect for students as well as information on what has been recently achieved and the next steps in order to make progress.

  Although such practices have always been intrinsic to successful teaching, recent research and policy has characterised them as a particular approach to assessment, leading to the principles that have been set out under the heading of "Assessment for Learning" and its characteristics are as follows:

    —  sharing learning goals with the pupils;

    —  helping pupils know and recognise the standards they must aim for;

    —  providing feedback that helps pupils to know how to improve;

    —  both teachers and pupils reviewing pupils' performance and progress;

    —  pupils learning self-assessment techniques to discover areas they need to improve;

    —  pupils helping each other to learn; and

    —  including both motivation and self-esteem within effective assessment techniques.

  Assessment for Learning is a simple idea which is far-reaching in its implications and quite difficult to put into practice. If teachers obtain information from assessment and use it to identify the next steps in learning, their teaching will be much more effective. Better still, if pupils are "let in on the secret", so that they, too, understand what the next steps are, they will be better motivated and more successful learners. However, putting this into practice well can make formidable demands on teachers in terms of their professional knowledge and skills.

  We are very supportive of the principles of Assessment for Learning and believe these must be further promoted and new and better supportive materials must be produced and supplied to teachers. There is also a need for more, and more rigorous, research which explores the successful elements of Assessment for Learning. Our own work in this area has been concerned with providing helpful materials for teachers and in researching the possibilities of e-assessment in supporting Assessment for Learning.

  There is a scarcity of good quality formative assessment materials to be used by teachers and students in classrooms. It seems to have been assumed that the very openness of formative assessment, and its devolving of responsibility to the student, renders such materials undesirable. In contrast, we believe that well-designed support materials can encourage the spread of formative assessment, and have undertaken projects to develop such materials, with the specific intention of fostering peer assessment in literacy for pupils. (See Twist, L. & Sainsbury, M. (2006) Assessment for Learning: Literacy 10-11. (London: nferNelson)).

  It is often asserted that Assessment for Learning leads to greater gains in pupil's knowledge and understanding and these claims are impressive. We do though believe that there remains a need for more research evidence demonstrating what leads to such gains. There are limitations to Assessment for Learning, which arise from its classroom role. Because of its immediacy and the focus on what has just been learned and what is about to be learned, it is difficult to give information on the overall level of attainment or on the curriculum as a whole. The involvement of the teacher (and also the student as a self-assessor and other students as peer-assessors) introduces problems of reliability (and also bias) so that Assessment for Learning data is not necessarily good for comparing pupils.

  A further difficulty is its detail. If it is to be used for summative purposes, then the essentially atomised data needed for Assessment for Learning alone needs collating in a systematic manner to allow an overall judgement which is reliable and comparable. This can be a time consuming task.

  For all these reasons, we do not believe Assessment for Learning can or should provide summative information. Instead, we believe that this function should be a largely separate system with its own priorities, features and requirements.

Formal Formative Assessment

  Formal assessments can also be used for formative purposes, whenever an analysis is made of performance in a formal test and the results used to plan teaching for classes, groups or individuals.

  The national Key Stage tests over the last few years have been systematically analysed in order to draw out implications for teaching and learning, which have then been published on the QCA website; a large part of this work has been carried out by NFER teams. An investigation of patterns of performance over a large sample of pupils can provide indications for teachers of typical patterns of errors. This can aid overall curriculum planning, but does not, in itself, give formative information for particular individuals or groups.

  An additional problem with this approach is its timing. Formative information of this kind is of most use when the teacher is at the beginning of a programme of study, whereas the national tests are taken at the end of the key stage. A change in the timing of the national tests would in itself introduce greater potential for formative value.

  A major focus of NFER's current work is the formative use of assessment information gained by more formal means. We are researching the potential of e-assessment in low-stakes contexts and to support assessment for learning. It is clear that teachers are required to focus on the understanding and attainment of individual pupils in order to develop effective plans for personalised learning. This will involve the management of a great deal of assessment evidence for planning teaching, in the form of test data and information on progress through the ongoing curriculum. E-assessment can occupy a central role, first in gathering detailed information about the nature of individual pupils' understanding and attainment, and then in collating and analysing this data. Rather than supplanting the teacher's role in relation to the child, it could supplement it, reducing the marking and recording workload while increasing and easing the flow of genuinely useful information.

  In order to explore this opportunity an NFER research project is currently testing some of these principles. Experimental prototype questions are being trialled with samples of pupils and a variety of exploratory statistical analyses are being undertaken. This work may give rise to a clearer understanding of how e-assessment can provide a sensitive and unobtrusive evidence base for classroom activities and informative progress records.

SUMMATIVE ASSESSMENT

Informal Summative Assessment

  Since teachers have, by the very process of teaching, a wealth of informal assessment information on each pupil, there is a strong incentive to find ways of summarising that information so that it serves a summative purpose. Ongoing informal assessment information covers pupils' performance across a range of contexts, and is thus potentially both more valid and more reliable than a single test result.

  The National Curriculum assessment system recognises this by requiring teacher assessment judgements alongside test results. Although this is, and has always been, an intrinsic element of the system, it has tended to have been given less prominence than the test results. In the early 1990s, there were indications that the structured attainment targets and level descriptions were introducing a useful element of standardisation and a common language to teachers' informal assessments. This "community of practice" tended to decline around the beginning of the new century, however, because of the introduction of the national strategies, which had a strong focus on pedagogy but very little on informal assessment. However, over the last few years the balance has begun to change. The ideas of Assessment for Learning have been integrated into policy and with this has come a renewed interest in making use of informal assessment in more systematic and summative ways. Currently, the QCA initiatives on Assessing Pupils' Progress in secondary schools and Monitoring Children's Progress in the primary sector have reintroduced some of the original ideas and methods of the early National Curriculum, restructured in accordance with later thinking, technology and strategies. In Wales, the Key Stage tests have been replaced by a system of teacher assessment only, supported by publications in which standards are exemplified. NFER staff members have worked with DELLS[10] to develop optional assessment materials and exemplification to support summative teacher judgement.

  In order to be used summatively, teachers' assessment information needs to be related to the standards which are provided by the National Curriculum level descriptions. However, the descriptions are broad and general, including many imprecise judgemental terms, so there is work to be done in reaching a shared interpretation of their meaning and application. This would involve a process of moderation between teachers, both within a school and between schools, which would require local leadership, possibly by a local authority adviser. Typically, the moderation process would involve discussion of specific pieces of pupils' work, chosen to represent the characteristics of a level, leading to agreement on the criteria to be applied. It would aim to result in an agreed, shared portfolio of exemplars. This process is professionally valuable but costly and extremely time-consuming.

  A further time-consuming and potentially unmanageable aspect of informal summative assessment is the collection of evidence to support a judgement for each pupil. The system can collapse under an avalanche of paperwork if this is not managed carefully. The provision of an e-portfolio for each pupil could help greatly in managing the storage of examples of work and access to these, but will do nothing to reduce the time necessary to select, store, label and annotate the examples.

  There is currently a debate about how far this kind of summative information can be used instead of test results, as in Wales and like coursework in public examinations. On the one hand, it has strong advantages in terms of scope and teacher involvement. On the other, its manageability is in question and its reliability has not been demonstrated.

  Our view is that such a system in England is conceivable, but distant. There are three conditions that must be fulfilled before it could be introduced successfully. Firstly, a major investment—comparable to the introduction of the national strategies—has to be made in professional development in order to bring about a shared understanding of criteria. This would be supported by published exemplification materials and could include the use of some formal tests (as is currently the case at Key Stage 1). Secondly, a part of this professional development would need to address teachers' and advisers' understanding of the nature and purposes of the four quadrants of assessment, as described in this paper. It is necessary to reach a point where teachers perceive high-stakes summative assessment as professionally useful and complementary to formative approaches before a system of sufficient robustness could be introduced. Rigorous piloting and evaluation would be necessary in order to demonstrate appropriate levels of reliability. Finally, the system would need an element of external monitoring and accountability that commanded public and professional confidence.

Formal Summative Assessment

  Formal summative assessment can serve many purposes. Among these are certification of schooling (as with GCSE) and selection (as with A-levels for university entrance). We will not consider these purposes here but concentrate on summative assessment within schooling, principally through National Curriculum Assessment. This has had a consistent structure for about a decade, but there is currently renewed discussion on its purposes and methods. This has culminated in the department for Education and Skills' Consultation document Making Good Progress which proposes shorter, better focused "when ready" tests. This paper will give general observations on National Curriculum testing (for summative purposes) and specific comments on Making Good Progress.

  In commenting on testing in the National Curriculum the purposes of summative information need to be set out. Here, we are taking them to be, as follows:

    A.  The provision of comparable reliable information for children and their parents on their current levels of attainment.

    B.  The provision of comparable reliable information for children and their parents on the progress being made.

    C.  The provision of individual and grouped information for teachers to inform them of national standards and expectations in their subjects and to assist them generally with teaching pupils in the future.

    D.  The provision of grouped information for school managers and governors to inform them of the quality of learning of their students (and by inference the quality of teaching with the school) through the study of progress of their classes.

    E.  The provision of grouped school information for the public, providing an accountability function and contributing to choice for parents.

    F.  The provision of grouped school information to accountability agencies, such as LAs and Ofsted, to contribute to their judgements and measure improvement and decline.

    G.  The provision of central information to government and others on the education system as a whole, for monitoring standards over time and reporting on the curriculum in detail.

  These seven purposes move from individual information to grouped information. They also move from levels of personal accountability to system accountability. It is a tenet of current government policy that accountability is a necessary part of publicly provided systems. There is a broad consensus on this and we accept that accountability must be available within the education system and that the assessment system should provide it. However, the levels of accountability and the information to be provided are open to considerable variation of view. It is often the view taken of these issues which determines the nature of the assessment system advocated, rather than the technical quality of the assessments themselves.

  It is worth remarking that in addition to the purposes set out above, National Curriculum tests have served other indirect but nevertheless important functions within the system.

    H.  For professional development of teachers, informing them of the nature of the National Curriculum and its interpretation. (This was particularly true of the early years of implementation, but continues to have a role. In some subjects, notably English, this has brought about a community of practice among teachers such that their judgements are much more aligned and standardised than they were before at Key Stages 1, 2 and 3. It is not necessarily also the case in mathematics, where many teachers continue to prefer test outcomes.)

    I.  To introduce positive change into the emphasis of the curriculum as taught (the delivered curriculum)—sometimes called a "backwash" effect.[11] Examples of this have included mental mathematics, spelling at Key Stages 1 and 2, and science processes at Key Stage 2 and 3.

    J.  The accountability functions themselves contribute to a further indirect purpose for the assessment system, which has a political motivation: that of putting pressure on schools and teachers to maximise the attainment of pupils and students. The testing regime is intended to motivate students to perform to high standards, teachers to teach better and parents and school governors to raise the quality of schools. The underlying reason behind this is what is perceived as a stagnation in standards from the 1950s to 1980s at a time when educationalists alone were responsible for the curriculum and schooling. The rise of economic globalisation and the widespread belief that raising educational standards was vital to future economic survival, led to the accountability and pressure models of the current system. (Education of course, is not alone among public services in being subject to this sort of pressure.)

  To these can be added some additional purposes which have arisen almost accidentally, but now have a useful function.

    K.  In recent years there has been an acknowledgement of the importance of using national test data for school self-evaluation and improvement, often in partnership with other agencies such as the School Improvement Partner. The provision of sophisticated indicators based on national testing data, such as DfES/Ofsted's Contextualised Value Added (CVA) measures or those provided by the Fischer Family Trust (FFT) has led to a significant improvement to schools' ability to evaluate their own performance. These indicators rely crucially on the current national testing system, and any replacement system proposed would need to offer equivalent or better measures if there is any desire not to lose the progress which has been made in this area.

    L.  The availability of comprehensive national data with attached to it detailed pupil information has provided a powerful tool for the evaluation of the impact of educational initiatives on attainment and performance. Examples include NFER's work on evaluating Excellence in Cities, the National Healthy School Standard, Playing for Success, and the Young Apprentices Programme. Such data provides an important instrument for informing educational policy.

  This account of the purposes of National Curriculum Assessment shows that there are many of these, and any calls for change needs to consider which are the most important and which can be downgraded. In the existing system, the current National Curriculum tests are a compromise which attempts to meet all these purposes. The accountability functions mean that they must achieve high levels of reliability. This means that the results must be reliable and subject to a limited amount of error and misclassification. (It is important to recognise that all tests, indeed all judgement processes have some component of error—this includes examinations, interviews, teacher judgement, and legal processes.) Any development of the existing system and its tests for which the accountability purposes remain, would properly need to demonstrate that it has equivalent or higher reliability. We do not believe it would be defensible to have a system in which levels of reliability are not known or cannot be demonstrated.

  As one of the developers of National Curriculum tests, we are aware of the thorough development process they undergo and the underlying statistical data on their performance. In our view, the current tests achieve the necessary technical and psychometric requirements to a reasonable extent. They have good to high levels of internal consistency (a measure of reliability) and parallel form reliability (the correlation between two tests). Some aspects are less reliable, such as the marking of writing, where there are many appeals / reviews. However, even here the levels of marker reliability are as high as those achieved in any other written tests where extended writing is judged by human (or computer) grades. The reliability of the writing tests could be increased but only by reducing their validity. This type of trade off is common in assessment systems with validity, reliability and manageability all in tension.

  The present tests do provide as reliable a measurement of individuals as is possible in a limited amount of testing time. When results are aggregated over larger groups such as (reasonably large) classes or schools, the level of reliability is extremely high.

  A second requirement of the National Curriculum tests (and all assessments) is that they should be valid for their purpose. According to current thinking[12], the validation of a test consists of a systematic investigation of the claims that are being made for it. In the case of National Curriculum tests, the claims are that the tests give an accurate and useful indication of students' English, science or mathematical attainment in terms of National Curriculum levels. The tests do have limited coverage of the total curriculum: the English tests omit Speaking and Listening, the science tests formally omit the attainment target dealing with scientific enquiry (though questions utilising aspects of this are included) and mathematics formally omits using and applying mathematics. Outside of these the coverage of content is good. The fact that the tests change each year means that the content is varied and differing aspects occur each year. In general, the content validity of the tests can be regarded as reasonably good in relation to this coverage of the National Curriculum. However, a full validation has other aspects and these are seldom considered in relation to the National Curriculum tests, principally because of their numerous purposes. In general, the current tests adequately serve the accountability requirements, listed above as A to F. They may not meet the monitoring requirement (Purpose G) so well and we address that below.

  We therefore believe that there should not be changes to the existing system without careful consideration of what the purposes of the system are and a statement of this. Any proposals for change should set out carefully which of the above purposes they are attempting to meet and which they are not. The level of requirements for validity and reliability could then be elucidated and the balance with manageability and the resources required determined. If accountability is no longer to be required then a different assessment regime could be implemented. However, this should not be done without evidence that any replacement would meet its own purposes validly.

NFER Assessment Philosophy

  The NFER view of assessment is to acknowledge and embrace the variety of assessment purposes and processes that the discussion above has set out. Both broad purposes and both types of process have their place in the overall assessment enterprise. It is meaningless and unhelpful to dismiss summative assessment because it is not formative, or to dismiss informal assessment because it is not formal. Our work encompasses all four quadrants and it is important to recognise the distinctive features and requirements of each.

  Correspondingly, the need is for education professionals and policymakers to develop the same kind of understanding. The classroom teacher, like the assessment researcher, is required to deal with all four quadrants. The best approach to this is to understand and accept the distinctions and relationships between them, and to give appropriate attention to each one. Similarly, policymakers, officials and teacher educators must recognise that teachers have this variety of assessment responsibilities and opportunities and give attention and respect to all of them.

  Our stance in relation to assessment is that there must be a clear statement of the intended purposes of the assessment system and that its processes and instruments should have an appropriate level of validity and reliability to provide sound evidence for those purposes. This implies that there should be a sound development process for instruments, and evaluative research to demonstrate that the judgements being reached on the basis of the system are soundly based.

Specific Proposals for Change

NATIONAL MONITORING

  One of the current purposes of National Curriculum Assessment is the provision of central information on the education system as a whole, for monitoring standards over time and reporting on the curriculum in detail (purpose G). It is here that the present system may be less valid. First, there are difficulties in maintaining a constant standard for the award of a level in a high stakes system where tests or questions cannot be repeated. We do though believe that the methods used for this currently which include year on year equating and the use of a constant reference point through an unchanging "anchor test" are the best available and lead to the application of a consistent standard. A second consideration is that the curriculum coverage each year is limited to the content of that year's tests. In response to these (and also other issues), there has been considerable advocacy of a light sampling model for monitoring the curriculum and changes in performance.

  National Curriculum Assessment currently has monitoring national performance as only one of its many purposes, and is probably not optimal for this, as is the case for most assessment systems which attempt to meet many purposes. NFER conducted a review of educational statistics across the UK for the Statistics Commission, which was included in their report on the subject.[13] They concluded that the current national monitoring system in England was sufficiently fit for purpose that an additional survey would not be cost-effective.

  We believe that, in principle, if the sole goal of an assessment system is to derive comparable measures of national attainment at different time points, then a low-stakes, lightly-sampled survey is probably the best way of meeting this one aim. Low-stakes testing has the advantage that there is no incentive to "teach to the test", reducing the effects in schools. (Though, as we have seen a positive backwash effect is one of the current uses of National Curriculum tests.) Because of reduced or negligible security issues it is possible to repeat substantial numbers of items from survey to survey, thus enabling relatively reliable measures of change over time to be adduced. It may not be necessary to monitor national performance on a yearly basis, and in this case less frequent surveys would be possible. A well-stratified national sample should enable good estimates of the uncertainty in the national performance measures to be made. A matrix sampling design, in which different pupils take different combinations of test items, would enable a wide coverage of curriculum areas to be maintained while minimising the burden on individual pupils.

  However, there are some problems with this approach, which should be recognised. The lightly-sampled low stakes assessment would provide one view of standards, but because it is low stakes it may well underestimate what students are really capable of when they are more motivated. Our experience and the research literature shows that there is a large difference in scores on the same test in high and low stakes situations. This is a validity issue related to the difference between performance in motivated and unmotivated conditions. If we are interested in monitoring what pupils can achieve when not under motivated to achieve, low stakes surveys are well and good. If we are interested in performance when the results matter, this approach would not give it. It would also mean that such survey results would not align with any high stakes measures that continue eg GCSE.

  There is considerable opposition in schools to taking part in optional assessment exercises, particularly secondary schools. However, anything other than a very high school response rate would cast serious doubts on the results, due to non-response bias, but it would be hard to find suitable incentives for schools to take part. Problems with response rates in international studies such as PISA, TIMSS etc. illustrate this—considerable efforts have been put into the attempt to persuade enough schools to take part to achieve the sample response rate constraints. It would probably be necessary in the modern climate to make participation in the survey compulsory for the selected schools in order to assure proper representative national samples.

  Nevertheless, we would support the introduction of a properly planned regular national monitoring exercise, to examine changes in performance at regular intervals, on a sample basis, and to monitor the curriculum widely. To assess the full curriculum in a valid manner may well require assessment methods other than written tests (eg for speaking and listening, science experimentation). Such methods were attempted in the Assessment of Performance Unit (APU) surveys in the 1980s, conducted by the NFER and others, but proved difficult and expensive to implement. The lessons of the experience of that monitoring exercise also need to be learned. It would need to be regarded as a proper research exercise with the collection of background data on pupils and schools, in order to examine educational and social questions. It should also ensure a wide agreement on the appropriateness of its methodology and analysis techniques, reducing the possibility of attacks on its results.

MAKING GOOD PROGRESS PROPOSALS

  The Making Good Progress proposals range widely over assessment, personalised learning and target setting. These should properly be regarded as a whole. However, since this paper deals with assessment issues, we will concentrate on that part of the proposals. Within the Making Good Progress document, it is proposed that there should be a new type of tests. The features of these are described briefly, and appear to be as follows:

    —  Single level tests.

    —  Testing when ready—shorter more focused and more appropriate tests.

    —  Externally set and marked.

    —  "One way Ratchet"—never going back, only forward.

  In general, we are supportive of the notions of testing when ready and the close tie to teaching and learning. This fits within the context of Personalised Learning/ Assessment for Learning. As such, "progress tests" could provide a useful stimulus to teaching and learning. However, as described in Making Good Progress, we would doubt that they can fulfil that function. As a single level test, awarding a level, the test would generally show what a student could do but it would not be able at the same time to provide diagnostic information about the next steps since these would not be included in the test. Similarly, because it would have to cover the curriculum broadly at that level, and levels represent two years of teaching (on average), it could not identify the small next steps needed for personalised learning.

  For these reasons, we do not believe that the tests as described could support teaching in any direct way. If this is the desired intention, a different model with a suite of short tests, relating to specific elements of the curriculum, and providing information both on what has been achieved and the next steps would be more appropriate. There would need to be a large bank of such tests available for testing when ready on an individual basis. To be most useful they would be marked by the teacher, immediately, rather than through an external system. Such tests would be low stakes and have little accountability function.

  In fact, Making Good Progress makes clear that the proposed progress tests would be used for accountability purposes, with the levels awarded being retained and reported. This means that the tests will need to have the characteristics of tests for accountability: high levels of reliability and validity. The following sections examine the proposals from this viewpoint.

  The meaning of the phrase "single level tests" will need some exploration. In a sense, the existing tests (or tests in the same style) could be utilized as tests which simply give a pass/fail at a single level. Given their length and their coverage of the curriculum this would lead to results with a reasonable (and measurable) level of reliability. However, Making Good Progress states that the tests will be "shorter and more focused." There is a strong relationship between reliability and test length, so an unfortunate implication of this is that the tests will have lower levels of reliability and reduced curriculum coverage.

  In this context, the important aspect of reliability is the consistency of the decisions made. It there were two progress tests at the same level, what would be the percentage of students classified the same way on both occasions? For the tests to be shown to be useful, this needs to be considerably above chance levels. In the current reading and writing tests at Key Stage 2, the degree of decision consistency for each level is at least 80% and for some levels is as high as 98%. The progress tests would have to match these levels of consistency. This would need careful examination during development, as reducing the length of test inevitably leads to lower levels of reliability.

  A second aspect of the "shorter more focused" approach is curriculum coverage. In the current National Curriculum tests considerable efforts are made to include as wide a representation of the curriculum as is practically possible in a written test. This is essential for demonstrations of validity. Moreover, the annually changing tests mean that, over time the tests have even wider coverage. In writing for example, different text types/genres are sought from children each year and, within the test each year, two different tasks are required. Hence, reducing the length of the tests could also reduce the validity of the test.

  The concept of testing when ready can be a useful one, particularly if it is used formatively and incorporated into the teaching-learning process as in Assessment for Learning. However, its utility within a summative system may not be as apparent. The provision of information from the "progress tests" (which are aimed at making judgements about a single level) is unlikely to have the diagnostic element useful for Assessment for Learning. The argument advanced in Making Good Progress is that success at one level will stimulate progress toward the next level, acting motivationally. This will need to be evaluated in practice. It may be that the levels are so far apart (they are intended to cover two years of development) that achieving one may actually slow progress, since the next target may be too distant. This is particularly a concern because of the "one way ratchet" proposal. The achievement of a level and the knowledge that it cannot be removed may act to demotivate rather than motivate.

  We have further concerns about the "one way ratchet". Its underlying assumption seems to be that children's learning is an ordered progression and that movement is always forward. This is not in fact the case, and children can decline in terms of skills or knowledge. It is therefore useful to have later checks that a level previously achieved has been maintained. If this is not the case, we do not believe the "one way ratchet" should be implemented.

  This issue may interact with that of the reliability of the test. If the decision consistency of the tests at a given level is low, then a large proportion of candidates could be misclassified as achieving the level when they should not. If this is coupled with the "one way ratchet", the misclassification would become enshrined, possibly being harmful to such children's progress as they would be being treated (and taught) as if they were at a higher level than was actually the case.

  It is not the case that the levels of the National Curriculum are, in practice, as even and well ordered as the underlying model would suggest. In a given strand of a subject, the difficulty of the content may not increase in regular steps. Similarly, in different strands, the difficulty of the processes or skills at a given level may not be the same. It was this type of difficulty that led to the abandonment of the strong criterion referencing model of the early national curriculum assessment in the 1990s. This was replaced by a weak criterion referencing model in which content from various levels and across a broad range has been included in the National Curriculum tests, leading to the setting of an overall subject level (or within English, reading and writing levels). It also marked a return to a traditional psychometric principles and a mark-based scoring system.

  There is a naïve view that questions can be written at a single level, derived from the level descriptors and these will have comparable difficulty. Taken to its extreme, it is sometimes thought that a single level test could be constructed by having material drawn from the level descriptor at that level. Candidates would then be expected to answer a set proportion of this correctly. This might be 50%, or more usually 80%, or sometimes all. There have been examples of such systems which have been constructed with these principles and in which the consequence has been very low pass rates. We therefore would advise that although the Progress Test may award a single level, they should have the following characteristics:

    —  Sufficiently long (in terms of numbers of questions and marks awarded) to have a good curriculum coverage, leading to good evidence of validity.

    —  Sufficiently long (in terms of numbers of questions and marks awarded) to have high levels of reliability, so that decision consistency is good and the number of misclassifications (particularly false positives) is small.

    —  Include content from the level below as well as the target level in order to elicit a range of outcomes, and also to allow some simple questions to give pupils confidence and to motivate them.

    —  Include content from the level above as well as the target level in order to elicit a range of outcomes, and to allow some formative information to be provided for next steps.

    —  For writing, continue to allow a range of levels to be demonstrated, through differentiation by outcome.

    —  Set the criterion for achieving the level through soundly based equating or judgemental processes, not through the application of strict algorithms which assume equal difficulty in questions and in tests.

  In addition, we would advise that the "one way ratchet" is abandoned and that the system allows for re-testing of doubtful cases so that high levels of certainty are achieved and so that misclassification is minimised. A useful refinement would be to have a system in which there are three levels of outcome: level X awarded; level X not awarded; and a band of uncertainty in which a retest is advised in the following test round. Hence teachers could report only success which is assured to a high probability, requiring pupils with scores in a defined range of uncertainty to be retested.

  To summarise, we do believe that some version of Progress Tests may be a useful addition to the system, but believe that their purpose must be carefully defined. That purpose should then lead to a specification and a development process that produces tests which are fit for use in terms of their reliability and validity.

June 2007






10   The Department for Education, Lifelong Learning, and Skills (DELLS) within the Welsh Assembly Government. Back

11   The term "backwash" is often used of the negative consequences of testing on the curriculum. The effects can though be either positive or negative. Back

12   See, for example: American Educational Research Association, American Psychological Association and National Council on Measurement in Education (1999). Standards for Educational and Psychological Testing. Washington, DC: AERA. Back

13   See: http://www.nfer.ac.uk/publications/other-publications/downloadable-reports/pdf_docs/serfinal.PDF Back


 
previous page contents next page

House of Commons home page Parliament home page House of Lords home page search page enquiries index

© Parliamentary copyright 2008
Prepared 13 May 2008