Memorandum submitted by Professor Stephen Gorard

 

 

Keywords

 

School effectiveness, measurement error, PLASC, league tables

 

 

Abstract

 

This paper considers the model of school effectiveness (SE) currently dominant in research, policy and practice in England (although the concerns it raises are international). It shows, principally through consideration of initial and propagated error, that SE results cannot be relied upon. By considering gain scores as the difference between assessment results at different stages of schooling, and then the residual difference between the predicted and obtained score at each stage, SE calculations leave the results to be disproportionately made up of relative error terms. Adding contextual information confuses but does not help this situation. Having shown and illustrated the sensitivity of SE to this propagation of initial errors, the paper considers some of the reasons why SE has become dominant, outlines the damage this dominant model causes, and begins to shape alternative ways of considering what schools do.

 

 

A case against school effectiveness

 

 

Numbers are like people; torture them enough and they will tell you anything.

 

 

Abstract

 

This paper considers the model of school effectiveness (SE) currently dominant in research, policy and practice in England (although the concerns it raises are international). It shows, principally through consideration of initial and propagated error, that SE results cannot be relied upon. By considering gain scores as the difference between assessment results at different stages of schooling, and then the residual difference between the predicted and obtained score at each stage, SE calculations leave the results to be disproportionately made up of relative error terms. Adding contextual information confuses but does not help this situation. Having shown and illustrated the sensitivity of SE to this propagation of initial errors, the paper considers some of the reasons why SE has become dominant, outlines the damage this dominant model causes, and begins to shape alternative ways of considering what schools do.

 

 

The dominance of the school effectiveness model

 

There a number of valid possible reasons for wanting to be able to judge school performance. In most countries, the majority of schools are publicly funded, and so the custodians of public money may want to assess how well that money is being used, for example. Policy-makers will be interested in how well this public service is working, and what the impact has been of any recent reforms. Parents and students might want to use a measure of school quality when making educational choices. Heads and teachers might want feedback on what is working well and what is in need of improvement at their own schools. There are also, of course, a number of differing ways of judging school performance. Schools could be evaluated in terms of financial efficiency, student attendance, student enjoyment of education, future student participation in education, student aspiration, preparation for citizenship and so on. Another perfectly proper indicator of school success can be based on student scores in assessments intended to discover how much or how well students have learnt what is taught in the school. What is interesting is how dominant this last version of school effectiveness has become over the last 50 years in the UK and elsewhere. This paper looks at the dominant approach to evaluating school performance, presenting fatal flaws in its logic, and so arguing that it is time to stop using this now traditional but limited view of what schools are for.

 

For any set of schools, if we rank them by their student scores in assessments of learning (the comparability and validity of such assessments is discussed in later sections), then we would tend to find that schools at the high and low ends differed in more than their student assessments. Schools in areas with more expensive housing (or more local income in the US), schools that select their student intake by ability, aptitude or even religion, and schools requiring parents to pay for their child's attendance, will be more prevalent among the high scores. Schools with high student mobility, in inner-cities, taking high proportions of children living in poverty or with a different home language to the language of instruction, may be more prevalent among the low scores. This is well known, and means that raw-score indicators are not a fair test of school performance. Some early studies of school effectiveness famously found very little or no difference at all in the outcomes of schools once these kinds of student intake differences had been taken into account (Coleman et al. 1966). Such studies, using either or both of student prior attainment and student family background variables, have continued since then (Coleman et al. 1982, Gorard 2000a), and continue today (Lubienski and Lubienski 2006). The differences in student outcomes between individual schools, and types and sectors of schools, can be largely explained by the differences in their student intakes. The larger the sample, the better the study, and the more reliable the measures involved, the higher percentage of raw-score difference between schools that can be explained (Shipman 1997, Tymms 2003). Looked at in this way, it seems that which school a student attends makes little difference to their learning (as assessed by these means).

 

However, over the past 30 years a different series of studies have come to an almost opposite conclusion, based on pretty much the same evidence. Perhaps starting with Rutter et al. (1979) in the UK, school effectiveness researchers have accepted that much or most of the variation in school outcomes is due to school intake characteristics. But they have claimed that the residual variation (any difference in raw-scores unexplained by student intake) is, or can be, evidence of differential school effectiveness (e.g. Nuttall et al. 1989, Gray and Wilcox 1995, Kyriakides 2008). Like the first set of studies, these have tended to become more sophisticated and more technical over time. But the fundamental difference in view remains. Is the variation in school outcomes unexplained by student background just the messy stuff left over by the process of analysis? Or is it large enough, robust and invariant enough over time, to be accounted a school 'effect'? Almost by default the answer to the second question has been assumed by most research users to be 'yes'. There has been generally weak opposition to the dominant technical model of school effectiveness, perhaps stemming from inability to understand the technicalities (e.g. Slee et al. 1998).[1]

 

Governments, such as that in the UK at time of writing, generally assume that there is a school effect (almost before seeing any evidence either way). In England, DCSF (2007) rightly report that in comparing the performance of schools we must recognise that pupils have different starting points when arriving at any school, that schools have different proportions of pupils at any starting point, and that other external factors will affect the progress made by pupils.[2] They conclude from this that their Contextual Value Added analysis (CVA) 'gives a much fairer statistical measure of the effectiveness of a school, and provides a solid basis for comparisons' (p.2, emphasis added). How does CVA work?

 

CVA is based on a value-added (VA) score for each pupil, calculated as the difference between their own outcome point score and the median outcome score for all pupils with the same prior (input) score. For example, in Key Stage 2 to Key Stage 4 CVA, the average points score at KS2 is calculated for all KS4 pupils in all maintained schools (and non-maintained special schools) in England.[3] The average is of the scores ('fine grades') for each pupil in three core subjects (English, maths, and science). Then the 'best 8' (capped GCSE equivalent) KS4 score is calculated for each pupil. These figures yield the median KS4 score for each KS2 score. The difference between the median and the actual KS4 score for each pupil is their individual VA score. This difference is adjusted for the individual pupil characteristics, including sex, special needs, ethnicity, eligibility for free school meals (FSM), first language, mobility, precise age, whether in care, and an areal measure of proportion of households on low income (IDACI). The result is further adjusted for the school-level prior attainment of each pupil's school (mean and standard deviation), where the results are at the extremes (threshold effects), and by a 'shrinkage factor' determined by the number of pupils in each school cohort.

 

More formally and precisely, the KS4 prediction for any pupil in 2007 is given as:[4]

 

162.1

+0.3807 * (school average KS2 score)2

-5.944 * school average KS2 score

+1.396 * (KS2 English points - school average KS2 score)

-0.109 * (KS2 maths points - school average KS2 score)

-27.1 (if in care)

-59.51 * IDACI score

-34.37 (if School Action SEN)

-65.76 (if Action Plus or statement of SEN)

-73.55 (if joined after September of year 10)

-23.43 (if joined not in July/August/September of years 7-9)

+14.52 (if female)

-12.94 * (age within year, where 31st August is 0 and 1st September is 1)

+ for English as an additional language pupils only (-8.328 -0.1428*(school average KS2 score)2 + 4.93 * school average KS2 score)

+ ethnicity coefficient, from a pre-defined table

+ for FSM pupils only (-22.9 + FSM/ethnicity interaction, from a pre-defined table)

+ 1.962 * cohort average KS2 score

- 4.815 * standard deviation of cohort average KS2 score

 

Is the claim that a complex calculation such as this provides a solid basis for comparing school performance actually true?

 

 

Relative errors in measurement

 

For any real measurement that we use for analysis we can assume the possibility of measurement error. Measurement error in this context means a difference between the ideal or perfectly isomorphic representation of something and our achieved measure. If someone actually has three children but our measurement claims that they have two children, then our measurement of the number of children is in error by one. This simple discrepancy is often termed the absolute error. A more useful way of envisaging such an error is as a fraction of the measurement itself - the relative error. In this example, the relative error is 1/3. In trying to measure 3 we are out by 1. If we were out by 1 in attempting to measure the number of children in the entire country, this would not be such a serious measurement error, and the relative error would be much smaller than 1/3. Of course, in almost all conceivable practical situations we will not know the true measure, and will have only our achieved measure to base the calculation on. Thus, we cannot calculate either the absolute or relative errors precisely - but we can often estimate their boundaries (dealt with in a later section). For any real number X which we seek to capture in our measurement, we could denote our achieved measure as x. The absolute error in our achieved measure is |X-x|, and the relative error is |X-x|/|X|.

 

 

Sources of errors in measurement

 

In a typical school effectiveness model, there are prior attainment scores (such as Key Stage 2 in England) and subsequent attainment scores for each pupil (such as Key Stage 4). In addition, there will be contextual variables for each pupil (and perhaps also for their schools) such as special needs, eligibility for free school meals, and sex. If we assume that none of these figures is necessarily perfect, what are some of the potential sources of error?

 

Obviously, there will tend to be missing cases. Any survey will have non-response caused by a number of factors. Potential respondents may refuse to participate, may be absent when the survey takes place, may be in transition between schools, or may not be in, or registered for, a school. In England, all schools are annually required to provide figures for the National Pupil Database (NPD) and the Pupil-level Annual Schools Census (PLASC). Both databases ostensibly have records for all pupils at school in England (but necessarily exclude any pupils not registered). A glimpse of the importance of missing data is revealed by the fact that in some years around 10% of the individual pupil records are un-matched across the two databases (see below). Even for those cases that are in the databases, some have missing values for some variables. So, any analysis involving those variables is faced with additional missing cases, over and above the unmatched cases. NPD/PLASC is a high quality dataset, much better than any analyst would hope to generate through primary data collection, and yet missing data remains a substantial problem.

 

Just as obviously, the information in such databases could be incorrect in the first place. Assessment via examination, project, coursework or teacher's grading is an imperfect process. There are huge and well-documented issues of comparability in assessment scores between years of assessment, curriculum subjects, modes of assessment, examining boards, and types of qualifications (among other issues, see Nuttall 1979, Newton 1997, Gorard 2000b). Public assessment is generally handled well in England, and yet moderation will be imperfect and mistakes will be made. If we take the underlying competence of the pupil as the true measure wanted in an assessment, even a perfect assessment instrument could lead to error in the achieved measure due to differences in the setting for the assessment (a fire alarm going off in one examination hall, for example), time of day, inadvertent (and sometimes deliberate) teacher assistance, the health of the candidate, and so on. Competence is not an easy thing to measure, unlike the length of the exam hall or the number of people in it. However well-constructed the assessment system, we must assume a reasonable level of measurement error. The same applies to any contextual variables. Even in NPD/PLASC with a simple binary code for sex, a few pupils are coded as male in one and female in the other database (more have nothing coded, and one or two have an invalid code, presumably from a data entry error). The error component in variables such as FSM, ethnicity, first language, and perhaps most particularly SEN, is even greater (see next section).

 

Once the measurements have been taken, they must be coded; the real world data is converted into a standard format. Like all processes this is subject to a low level of error even when conducted diligently, and not all such errors will be spotted by quality control systems dealing with hundreds of variables relating to millions of pupils every year. Then the data must be entered (transcribed) and low level errors are liable to creep in again. Data can even be corrupted in storage (magnetic dropout undetected by parity checks and similar) and in sorting and matching of cases (most often caused by incorrect selection of rows or columns). Even a value for a pupil that is present and entered and stored 'correctly' is liable to be in error, due to the change in number base and/or the finite number of binary digits used to store it. The simple decimal fraction 0.1, for example, cannot be exactly represented in the binary numbering system used by computers and calculators. Thus, representing 0.1 in one byte (8 bits) would lead to an error of over 6% even in an otherwise perfect measurement. However many bits are allocated by a computer to storage of a number there will be, by definition, an infinite number of denary values that cannot be stored precisely. Increased accuracy decreases these representational errors, but can not eliminate them.[5]

 

It is perhaps also worth pointing out at this stage in the argument that any analysis using real data with some combination of these (almost) inevitable measurement errors will be biased, and so will lead to an incorrect result. Of course, the more accurate the measures are the closer to the ideal correct answer we can be. However, we have no reason to believe that any or all of these sources of error lead to random measurement error (of the kind that might come from random sampling variation, for example). Those refusing to take part in a survey, those not registered at school, those unwilling to reveal their family income or benefit (for free school meal eligibility purposes) cannot be imagined as some kind of random sub-set of the school population. Similarly, representational errors in denary/binary conversion are part of the numbering systems involved and entirely predictable (given enough time and care). Like every stage in the error generation process described so far, they are not random in nature, occurrence or source.

 

 

Estimated accuracy of relevant measures

 

Returning to the case study of CVA calculations in England (formula above), consider first the issue of missing data. We know that independent fee-paying schools are not involved. So the PLASC/NPD dataset only includes 92% of the age cohort at best (minus also those educated at home, by other means, and some cases simply not registered at all). In 2007, the dataset for the KS4 (15-year-old) cohort contained records for 673,563 pupils. However, every variable, including both contextual and attainment variables, had a high proportion of missing cases. For example, at least 75,944 were missing a code for FSM eligibility. This represents over 11% of cases. Whether a pupil was in care had at least 80,278 cases missing. This represents 12% of the total. Even estimating pupil background via the characteristics of households in their postcode area does not help. There are at least 69,902 (well over 10%) of the IDACI scores missing. Even when data is not coded as missing, it is effectively missing, such as the codes 'Refused' and 'Not obtained' which are additional to the missing data on pupil ethnicity. There is some overlap between these missing cases, but only some. For example, if we delete from the 2007 PLASC/NPD all cases missing FSM, in care, special needs, sex and/or ethnicity data then the database drops in size to 577,115 pupils. This means that the CVA calculation is conducted with only 85% of all records complete just in terms of these five key contextual variables. If we consider missing values in other contextual variables such as first language, and also in attainment scores (there are many of these), it becomes clear that missing cases and missing data are a huge problem for any analysis of PLASC/NPD. Any pupil moving to or from one of the other home countries of the UK such as Wales, where some statutory testing has been abolished, will have missing scores for one or more Key Stages. And then, of course, there will be equivalent proportion of additional missing values in KS2 dataset (for a KS2 to KS4 CVA calculation).

 

In practice, missing cases are simply ignored, and missing values are replaced with a default substitute - usually the mean score or modal category (and male for sex of pupil). So, the DCSF analysts assume that pupils without IDACI scores (usually because they have no post code) live in average income neighbourhoods, and that where we do not know when a pupil joined their present school we should assume that they have been in attendance for a long time. Anyone whose eligibility for FSM is not known is assumed not to be living in poverty, anyone without a KS2 or KS4 score is an average attainer, and so on. These are very questionable assumptions. There is plenty of evidence of differences between pupils with complete and incomplete values in such datasets (Amrein-Beardsley 2008). These kinds of assumptions have to be made in order not to lose the 20% or more of cases with at least one missing value in a critical variable. But making these unjustified assumptions then means that 20% or more of cases are very likely to have an incorrect value in at least one critical variable. I do not believe that users of CVA are aware of the scale of this unresolved problem, and its likely impact on the results.[6]

 

The scores that are present in PLASC/NPD will themselves contain reasonably high levels of error (over and above those errors caused by missing data or assumptions about missing data). In fact, all pupil records must be assumed to contain variables in error, since some variables are in fact general aggregated scores. For example, the IDACI scores for all pupils are calculated on the basis of scores for all households in England. Since the dataset used for this purpose does not, in fact, contain data for all households (Gorard 2008a), all IDACI scores have an error component due to missing data over and above any errors in measuring household income. There will be errors in the various assessments used to generate the point scores at KS2 and KS4. Then the CVA analyst is faced with issues of aggregation and comparability. For example, the KS4 analysis involves GCSEs handled by different examining boards, sometimes taken in different years, and for all different subjects and tiers of entry. Some GCSEs will be short and some full. Even if an analyst is fairly sure about the comparability and reliability of such scores, these will have to be aggregated with results from an increasing number of different qualifications. In 2007, these included GNVQ Intermediate, NVQ, National Certificate in Business, BTEC, Key Skills, Basic Skills, and Asset Language Units. These all have to be converted to the common 'currency' of point scores, despite the fact that their grading structures are completely different. No one should try to claim that this aggregation to 'best 8' points scores does not add further errors.

 

In many ways, the contextual measures in CVA are even more problematic (over and above their missing values). Special educational needs, for example, are represented by a variable having three possible sources (School Action, Action Plus, or a statement). Some of these are the responsibility of the school, and some are sensitive to the actions of parents motivated to gain extra time in examinations for their children. The number of pupils with recorded SEN shows huge variation over years in the same schools, and appear in very different proportions in different parts of England (Gorard et al. 2003). Ethnic groups (based on 19 categories for CVA) are notoriously difficult to classify (Gorard 2008a). Here they are used in interaction with FSM eligibility (itself an incomplete measure). First language is almost as complex to classify as ethnic group. Is it home language, language of origin, or language of choice? Here it is used in interaction with prior attainment scores, since having a language other than English is deemed a disadvantage for low prior attainers, but not for high attainers.

 

The CVA formula used by DCSF takes points scores for each individual represented to two decimal places accuracy, such as the worked example on their website, based on 29.56 (i.e. claiming to be correct to 5/1000ths of a point), and multiplies them by coefficients with four decimal places (such as +0.3807). So the first term after the constant in the CVA formula could be 29.56 squared times 0.3807. This would be 332.6532235 (correct to 5 parts in 10 million!). This is pseudo-quantification of the worst kind. There is no way that the initial figures in the CVA calculation are accurate enough to sustain this kind of calculation.

 

Each of the two attainment scores in a school effectiveness model (and of course the contextual variables used) will have the kinds of errors illustrated so far. It would be conservative to imagine that the national assessment system was 90% accurate as a whole (or that 90% of pupils achieved the correct mark/grade). It would also be quite conservative to imagine that, overall, only around 10% of the cases or variables used in a school effectiveness calculation were missing (or incorrectly replaced by defaults). This means that each attainment score is liable to be no more than 80% accurate - or put another way the relative error is at least 20% in each set of figures used in a school effectiveness calculation. Of course, the relative error might be much higher than this, it will vary between datasets and variables anyway, and some analysts might object that 20% is too high. In what follows, it does not really matter if we can argue that each score is 70% or 90% accurate. Let us proceed with an estimate of 20% error and see what happens.

 

 

Propagation of relative errors

 

Errors are said to 'propagate' through calculations, meaning that everything we do with our achieved measures we also do with their measurement errors. And, of course, we introduce further small errors when the intermediate and final results of our calculations cannot be represented exactly in the number of bits allocated by a computer/calculator to store them (see above). The relative error changes as a consequence. If we have two numbers X and Y measured imperfectly as x and y with corresponding absolute errors εx and εy then:

 

x = X ± εx

 

and

 

y = Y ± εy

 

When we attempt to calculate X-Y, we actually get (X ± εx) - (Y ± εy). The upper bound for this is X-Y + εxy. Put another way, since we do not know whether the errors in either number are positive or negative when we subtract we may be adding the error components (and vice versa of course). I focus on subtraction here for two reasons. First, school effectiveness models are at heart based on a gain score from one attainment period to another and this can be expressed as a subtraction. Second, in school effectiveness both of the attainment scores are positive (or zero). This means that X-Y (or whatever) will be smaller than X (and probably smaller than Y as well). So, finding the difference (gain) between X and Y reduces the number we use as our achieved measure (X-Y) while at the same time increasing the upper error bound by adding together the individual error components of X and Y. Put more starkly, the maximum relative error in the result increases.

 

Imagine that one prior attainment score (perhaps a KS2 point score for one pupil in England) is 70 and that the subsequent attainment score for the same pupil (perhaps a KS3 point score) is 100. This gives a manifest gain of 30 points (from KS2 to KS3). Using the conservative estimates above, we could say that the first score, being only 80% accurate, actually represents a true figure somewhere between 56 and 84. The second true figure is somewhere between 80 and 120. Thus, under these assumptions our achieved estimate of the gain score, 30, really lies between -4 and 64. The maximum relative error has changed from an estimated 20% in each of the original figures to well over 100% in our computed answer. Subtracting two positive numbers of a similar order of magnitude dramatically increases the relative error bounds in the answer, and this applies whatever the relative error was in the original figures. For example, even using an unrealistically high estimate of 90% overall accuracy in our original figures leads to nearly 60% error in the result. And there is no way to avoid it. A measurement with a 20% margin for error is frequently usable in social science, but an answer with a 100% margin is generally useless. It is certainly no basis for making national policy, rewarding heads, informing parents, condemning teachers, or closing schools.

 

Of course, the true answer will not be out by 100% for all cases (the figures are bounds), and one might imagine that the errors would not grow but diminish around half of the time (when both initial errors are either positive or negative, in a subtraction). Even so, this does not help. Who would want to base real life decisions about education on evidence that was only around half sensible, especially when we would not know which half? It is also important to recall that these bounds do not represent likelihoods, or have any kind of normal distribution, because the errors are not random in nature. An error of 100% is as likely as one of 50% or zero. And for any one school, region, key stage, subject of assessment, examination board, or socio-economic group all (or most) of the errors could be in the same direction. There is no kind of statistical treatment based on probability theory that can help overcome these limitations. Whether as simple as confidence intervals or as complex as multi-level modelling, these techniques are all irrelevant.

 

In England, the model for contextualised value-added analysis used by the Department for Children, Schools and Families (DCSF) involves finding for all pupils 'the difference (positive or negative) between their predicted and actual attainment' (DCSF 2007, p.7). The predicted attainment for any one pupil is based on the average gain score for all pupils with the same prior attainment (adjusted for contextual information). Averaging the gain scores cannot increase, and will almost inevitably reduce, the maximum relative error in the gain scores discussed above. Of course, there will still be an error component. The average cannot adjust for missing cases or values; nor can it eliminate systematic bias. But averaging can clearly reduce the maximum error. On the other hand, each Key Stage points score has a range that is higher than the previous (Key Stage 4 pupils attain scores in the hundreds). Therefore, the gain scores across Key Stages tend to be substantial (30 in our example above). But the difference between any pupil's predicted and actual attainment in one Key Stage will tend to be insubstantial for two reasons. First, the predicted and actual attainment scores are using the same points system, and so are more closely related even than being of the same order of magnitude. Second, if the predicted and actual attainment scores were not very close for a majority of pupils then the model would not be any good. This means that the manifest figure we use for the pupil value-added score is usually very small, perhaps even negligible, in comparison to the attainment scores from which it is calculated.

 

Imagine that one pupil has the same subsequent attainment score as above (100) and their predicted score was also 100 (i.e. the prediction was perfect). Their individual CVA score is then 0, making the relative error have infinity as a theoretical maximum. As the actual and predicted scores diverge (i.e. as the prediction worsens) the relative error in the results becomes finite and declines somewhat, simply because the manifest answer is becoming larger. If the actual attainment was 100 with only a 10% relative error, and the predicted score was 95, and ignoring the error component in the prediction for the moment, then the manifest CVA score is 5. But the actual attainment lies between 90 and 110, so the true CVA score lies between -5 and +15. The relative error in the CVA score, even under these favourable assumptions and ignoring problems in the predicted score, is 200% or more. Any such figure with a 200% maximum relative error should not be used to make practical decisions about real lives. We do not even know whether the score should be positive (the pupil has done better than expected) or negative (the pupil has done worse).

 

In reality the situation is far worse than this because the relative error in the predicted score, calculated via CVA, is liable to be considerably greater than in the actual attainment score. There will, anyway, be errors in both scores making the range of possible residuals and the relative error in the residuals much greater. In the preceding example, if the predicted score of 95 has a 20% error component, then the actual score lies between 90 and 110 and the true predicted score lies between 76 and 114. Thus, the computed CVA scores is still +5 but the real CVA score could be anything from +34 (110-76) to -24 (90-114). An initial error component of 10 to 20% leads to a result with a maximum relative error 1,160%.

 

If the predicted score was 99, the CVA score would be 1, and the relative error in that score would be an enormous 5,900% (it would truly lie between -28 and +31).

 

Even if the predicted score was 90 with a CVA score of 10, the relative error in that score would be 760% (it would truly lie between -28 and +38). As predicted and attained scores diverge the relative error declines, but we then have to start questioning the premise of the models. If the predictions are so far out that we can begin to ignore the error components is this better or worse for the school effectiveness model? In order to retain something like the relative error of 20% in the original predicted score, the prediction would have to be out by a long way. For example, even a predicted score of 50 for an actual score of 100, leads to a CVA score of 50 which really lies between +30 and +70. This result has a maximum relative error of 80%. So the result (at this stage in the model) has a measurement error component still four times greater than in the initial score(s). But the prediction is way out. If we assume that the school effectiveness model is capturing anything sensible at all, this pupil can be deemed to have done very well (or to have done very badly in the prior assessment, or both). This is true even if the maximum error applies. How can we tell whether any CVA score (for pupil, teacher, department, school or area) is of this kind, where we cannot be sure about the precise figure but we can be sure that the result is so far away from that predicted as to dwarf any error component?

 

 

The allure of technical solutions

 

Unfortunately the field of school effectiveness research works on the invalid assumption that errors in the data are random in nature and so can be estimated, and weighted for, by techniques based on random sampling theory. These techniques are fatally flawed, in their own terms, even when used 'correctly' with random samples (Gorard 2010). The conditional probabilities generated by sampling theory tell us, under strict conditions and assumptions, how often random samples would generate a result as extreme or more extreme as the one we might be considering. The p-value in a significance test tells analysts the probability of observing a score as or more extreme, assuming that the score is actually no different from zero (and so that the divergence from zero is the result of random sampling variation alone). Of course, this conditional probability of the data given the nil null hypothesis is not what the analysts want. In a school effectiveness context such as the ones outlined above, the analyst wants to know whether the CVA score (the residual, whether for individual or school) is large enough to take note of (to dwarf its relative error). They actually want the probability of the null hypothesis given the data they observed. They could convert the former to the latter using Bayes' Theorem, as long as they already knew the underlying and unconditional probability of the null hypothesis anyway. But they cannot know the latter. So they pretend that the probability of the data given the null hypothesis is the same as, or closely related to, the probability of the null hypothesis given the data, and use the p-value from significance tests to 'reject' the null hypothesis on which the p-value is predicated. This modus tollens kind of argument does not work with likelihoods, for a number of reasons, including Jeffrey's so-called paradox that a low probability for the data can be associated with a high probability for the null hypothesis, or a low one, or a mid-range value, and vice versa. It depends on the underlying probability of the null hypothesis - which we do not know.

 

So, even used as intended, p-values can not help most analysts in the SE field. The same applies to standard errors, and confidence intervals and their variants. But the situation is worse than this because in the field of school effectiveness, these statistical techniques based on sampling theory are hardly ever used as intended. Most commonly, the sampling techniques are used with population figures such as NPD/PLASC. In this context, the techniques mean nothing.[7] There is no sampling variation to estimate when working with population data (whether for a nation, region, education authority, school, year, class, or social group). There are missing cases and values, and there is measurement error. But these are not generated by random sampling, and so sampling theory cannot estimate them, adjust for them, or help us decide how substantial they are in relation to our manifest data. Despite all this, DCSF use and attempt to defend the use of confidence intervals with their population CVA data.

 

A confidence interval, remember, is an estimate of the range of values that would be generated by repeated random sampling. If equivalent random samples were drawn repeatedly from a specified population, and a 95% confidence interval was calculated for each one, then 95% of these intervals would probably include the population mean for any measure. This is of no real use to an analyst, even when calculated with appropriate data, for the same reasons as for p-values. The analyst wants a probable range for the true value of the estimate, but to get this they would have to have access to underlying data that is never available to them. And as with p-values, it does not even make sense to calculate a confidence interval for population data of any kind. Confidence intervals are therefore of no use in standard school effectiveness research.[8]

 

However, the field as a whole simply ignores these quite elementary logical problems, while devising more and more complex models comprehended by fewer and fewer people. Perhaps the most common inappropriate complex technique used in this field is multi-level (hierarchical linear) modelling. This technique was devised as one of many equivalent ways of overcoming the correlation between cases in cluster-randomised samples (Gorard 2009a). This, like all other techniques based on sampling theory, is of no consequence for school effectiveness work based on population figures. Advocates now claim that such models have other purposes - such as allowing analysts to partition variation in scores between levels such as individuals, schools and districts. But such partitioning can, like overcoming the inter-correlation in clusters, be done in other and generally simpler ways. Anyway, the technique is still pointless. Most such models do not use districts or areas as a level, and those that do tend to find little or no variation there once other levels have been accounted for (Smith and Street 2006, Tymms et al. 2008). We know that pupil-level variables, including prior attainment and contextual values, are key in driving school outcomes. The question remains, therefore, whether there is a school effect. If our pupil-level predictions of subsequent attainment are less than perfect, we could attribute much of the residual unexplained variation to the initial and propagated measurement error in our data. To use multi-level modelling to allocate most of this residual variation to a 'school effect' instead is to assume from the outset that which the modelling is supposed to be seeking or testing.

 

 

So why does school effectiveness seem to work?

 

Why, if the foregoing is true, do so many analysts, policy-makers, users and practitioners seem to believe that school effectiveness yields useful and practical information? It is tempting to say that perhaps many of them have not really thought about the process and have simply bought into what appears to be a scientific and technical solution to judging school performance. I use the term 'bought' advisedly here because part of the answer might also lie in the money to be made. In England, school effectiveness has become an industry, employing civil servants at DCSF and elsewhere, producing incentives for teachers deemed CVA experts in schools, creating companies and consultants to provide data analysis, paying royalties to software authors, and funding academics through research and funding council money from the taxpayer. A cynical view would be that most people in England do not understand CVA, and many that do (or might) understand it stand to gain from its use in some way.

 

There is sometimes no consistent adherence to school effectiveness as a model, even among individual policy-makers and departments. Some of the schools required by DCSF in 2008 to take part in the National Challenge, because their (raw-score) results were so poor, were also sent a letter from DCSF congratulating them on their high value-added results and asking them to act as models or mentors for emergent Academies. The 'paradox of the National Challenge Scheme' continues (Maddern 2008, p.23). Again, a cynic might say that users use raw scores when it suits them (traditional fee-paying schools seem uninterested in value-added while often having very high raw scores, for example), and they use value-added when that paints a better picture.

 

However, it is possible that the problem stems chiefly from our lack of ability to calibrate the results of school effectiveness models against anything except themselves. In everyday measurements of time, length, temperature and so on we get a sense of the accuracy of our measuring scales by comparing measurements with the qualities being measured (Gorard 2009b). There is no equivalent for CVA (what Amreain-Beardsley refers to as criterion-related validity). The scores are just like magic figures emerging from a long-winded and quasi-rational calculation. Their advocates claim that these figures represent 'solid' and fair school performance measures, but they can provide nothing except the purported plausibility of the calculation to justify that. Supposing, for the sake of argument, that the calculation did not work for the reasons given in this paper so far. What would we expect to emerge from it? The fact that the data is riddled with initial errors and that these propagate through the calculation does not mean that we should expect the results for all schools to be the same, once contextualised prior attainment is accounted for. The bigger the deviations between predicted and attained results, of the kind that SE researchers claim as evidence of effectiveness, the more this could also be evidence of the error component. In this situation, the bigger the error in the results the bigger the effect might appear to be to some. So, we cannot improve our approach to get a bigger effect to outscore the error component. Whatever the residuals are we simply do not know if they are error or effect. We do know, however, that increasing the quality and scale of the data is associated with a decrease in the apparent school effect (Tymms 2003).

 

If the VA residuals were actually only error, how would the results behave? We would expect CVA results to be volatile and inconsistent over years and between key stages in the same schools. This is what we generally find (Hoyle and Robinson 2003, Tymms and Dean 2004, Kelly and Monczunski 2007). Of course, in any group of schools under consideration, some schools will have apparently consistent positive or negative CVA over a period of time. This, in itself, means nothing. Again imagine what we would expect if the 'effect' were actually all propagated error. Since CVA is zero-sum by design, around half of all schools in any one year would have positive scores and half negative. If the CVA were truly meaningless, then we might expect around one quarter of all schools to have successive positive CVA scores over two years (and one quarter negative). Again, this is what we find. Post hoc, we cannot use a run of similar scores to suggest consistency without consideration of what we would expect if the scores meant nothing. Thomas, Peng and Gray (2007) looked at successive years of positive VA in one England district from 1993-2002. They seemed perplexed that 'it appears that only one in 16 schools managed to improve continuously for more than four years at some point over the decade in terms of value-added' (p.261). Yet 1 in 16 schools with four successive positive scores is exactly how many would be predicted assuming that the scores mean nothing at all (since 2-4 equals 1/16). Leckie and Goldstein (2009) explain that VA scores for the same schools do not correlate highly over time. A number of studies have found VA correlations of around 0.5 and 0.6 over two to five years for the same schools. Whatever it is that is producing VA measures for schools it is ephemeral. A correlation of 0.5 over 3 years means that only 25% of the variation in VA is common to all years. Is this any more than we would expect by chance? What is particularly interesting about this variability is that it does not appear in the raw scores. Raw scores for any school tend to be very similar from year to year, but the 'underlying' VA is not. Is this then evidence, as Leckie and Goldstein (2009) would have it, that VA really changes that much, or does it just illustrate again the central point in this paper that VA is very sensitive to the propagation of relative error?

 

The coefficients in the CVA model, fitted post hoc via multi-level regression, mean nothing in themselves. Even a table of complete random numbers can generate regression results as coherent (and convincing to some) as SE models (Gorard 2008b). With enough variables, combinations of variables and categories within variables (remember the 19 ethnic groups in interaction with FSM in CVA, for example) it is possible to create a perfect (R2=1.00) from completely nonsensical data. In this context, it is intriguing to note the observation by Glass (2004) that one school directly on a county line was attributed to both counties in the Tennessee Value Added Assessment System and two VA measures were calculated. The measures were completely different - almost as though they did not really mean anything at all. Even advocates and pioneers of school effectiveness admit that the data and models we have do not allow us to differentiate, in reality, between school performances. 'Importantly, when we account for prediction uncertainty, the comparison of schools becomes so imprecise that, at best, only a handful of schools can be significantly separated from the national average, or separated from any other school' (Leckie and Goldstein 2009, p.16).

 

Of course, the key calculation underlying CVA is the creation of the residual between actual and predicted pupil scores. Since this is based on two raw scores (the prior and current attainment of each pupil), it should not be surprising to discover that VA results are highly correlated with both of these raw scores (Gorard 2006, 2008c). The scale of this correlation is now routinely disguised by the contextual figures used in CVA, but it is still there. In fact, the correlation between prior and current attainment is the same as the correlation between prior attainment and VA scores. Put more simply, VA calculations are flawed from the outset by not being independent enough of the raw scores from which they are generated. They are no more a fair test of school performance than raw scores are.

 

 

Damage caused by school effectiveness

 

Does any of this matter? I would argue that it does. Schools, heads and teachers are being routinely rewarded or punished on the basis of this kind of evidence. Teachers are spending their time looking at things like departmental value-added figures and distorting their attention to focus on particular areas or types of pupils. School effectiveness results have been used to determine funding allocations and to threaten schools with closure (Bald 2006, Mansell 2006). The national school inspection system in England, run by OFSTED, starts with a CVA, and the results of that analysis partly pre-determine the results of the inspection (Gorard 2008c). Schools are paying public funds to external bodies for value-added analyses and breakdowns of their effectiveness data. Parents and pupils are being encouraged to use school effectiveness evidence (in league tables, for example) to judge their schools and potential schools. If, as I would argue, the results are largely spurious this means a lot of time and money is wasted and, more importantly, pupils' education is being needlessly damaged.

 

However, the dangers of school effectiveness are even greater than this. School effectiveness is associated with a narrow understanding of what education is for. It encourages, unwittingly, an emphasis on assessment and test scores - and teaching to the test - because over time we tend to get the system we measure for and so privilege. Further, rather than opening information about schools to a wider public, the complexity of CVA and similar models excludes and so disempowers most people. These are the people who pay tax for, work in, or send their children to schools. Even academics are largely excluded from understanding and so criticising school effectiveness work (Normand 2008). Relevant academic work is often peer-reviewed and 'quality' checked by a relatively small clique. School effectiveness then tends to monopolise political expertise on schools and public discussion of education, even though most policy-makers, official bodies like OFSTED, and the public simply have to take the results on trust.

 

The widespread use of CVA for league tables, official DCSF performance data, and in models of school effectiveness also has the inadvertent impact of making it harder to examine how well schools are doing with different groups of pupils. One of the main reasons for initially setting up a free (taxpayer-funded), universal and compulsory system of schools was to try and minimise the influence of pupil family background. The achievement gaps between rich and poor, or between ethnic and language groups, give schools and society some idea of how well that equitable objective is being met. What CVA does is to recognise that these gaps exist but then makes them invisible by factoring them into the VA prediction. It no longer makes sense to ask whether the CVA is any different in a school or a school system for rich and poor, or different ethnic and language groups. DCSF (2007) appear to recognise this danger when they say (in bold, p.2) 'CVA should not be used to set lower expectations for any pupil or group of pupils'. This means - bizarrely - that a school with a high level of poverty will be correctly predicted to have equivalently lower outcomes, but at the same time must not 'expect' lower outcomes.

 

Finally, for the present section, it is important to recall that VA, CVA and the rest are all zero-sum calculations. The CVA for a pupil, teacher, department, school, or district is calculated relative to all others. Thus, around half of all non-zero scores will be positive and half negative. Whether intentionally or not, this creates a system clearly based on competition. A school could improve its results and still have negative CVA if everyone else improved as well. A school could even improve its results and get a worse CVA than before. The whole system could improve and half of the schools would still get negative CVA. Or all schools could get worse and half would still get positive CVA scores. And so on. It is not enough to do well. Others have to fail for any school to obtain a positive result. Or more accurately, it is not even necessary to do well at all; it is only necessary to do not as badly as others. This is a ridiculous way of calculating school performance sui generis, as shown in this paper so far. But why, in particular, design the monitoring system like that at the same time as asking schools in England to form partnerships and federations, and to co-operate more and more in the delivery of KS3 and the 14-19 Reform Programme?

 

 

What does it all mean?

 

In my opinion the whole school effectiveness model, as currently imagined, should be abandoned. It clearly does not and could not work as intended, so it causes all of the damage and danger described above for no good reason. It continues partly as a kind of Voodoo science (Park 2000), wherein adherents prefer to claim they are dealing with random events, making it easier to explain away the uncertainty and unpredictability of their results. But it also continues for the same reasons as it was created in the first place. We want to be able to measure school performance, and we know that mere raw-score figures tell us largely about the school intake. However, we must not continue with school effectiveness once aware of its flaws simply because we cannot imagine what to do instead. The purpose of this paper is to try and end the domination of school policy and much of research by the standard school effectiveness model. That is a big task for one paper. It is not my intention here to provide a fully worked-out alternative.[9]

 

We perhaps need to re-think what we mean by a school effect. In traditional models, a school effect refers to the difference going to one school makes in comparison to going to another school. There are many other possible meanings we could operationalise, including what difference it makes going to one school as opposed to not going school at all. We need to decide whether we are happy for a school effect to be zero-sum, whether a school is really a proper unit of analysis, and how we will estimate the maximum propagation of errors. I would like to see much greater care in the design of research and the collection of data, rather than effort expended on creating increasingly complex and unrealistic methods to analyse the poorer data from existing poor designs and models. We need more active designs for research, such as randomised controlled trials to find out what really works in school improvement, rather than post hoc data dredging. We need more mixed methods studies, and more care and humility in our proposed research claims. Education research is, rightly, a publicly-funded enterprise with potential impacts for all citizens. If our research is as poorly crafted as school effectiveness seems to be then we face two possible dangers. If it has no real-life impact then the research funding has been wasted. There is a significant opportunity cost. Worse, if it has the kind of widespread impact that school effectiveness has had for thirty years, then in addition to the waste of money, incorrect policy and practice decisions will be made, and pupils and families will suffer the consequences.

 

One clear finding that is now largely unremarked by academics and unused by policy-makers is that pupil prior attainment and background explain the vast majority of variation in school outcomes. This finding is clear because its scale and consistency over time and place dwarfs the error component in the calculation (largely because the error does not have a chance to propagate in the same way as for CVA analysis). Why is this not more clearly understood and disseminated by politicians? In England, we have built a system of maintained schools that remains loosely comprehensive, and is funded quite equitably (more so than the USA, for example), on a per-pupil basis adjusted for special circumstances. The curriculum is largely similar (the National Curriculum) for ages 5 to 14 at least, taught by nationally-recognised teachers with Qualified Teacher Status, inspected by a national system (OFSTED), and assessed by standardised tests up to Key Stage 3. Education is compulsory for all, and free at the point of delivery. In a very real sense it sounds as though it would not matter much which school a pupil attends, in terms of qualifications as an outcome. And indeed, that is what decades of research have shown is true.

 

Are parents and pupils being misled into thinking that which school they use does make a substantial difference? Perhaps, or perhaps qualifications at age 16 are not what parents and pupils are looking for when they think of a new school for a child aged 4 or even 10.[10] School choice research suggests that what families are really looking for is safety and happiness for their children (Gorard 1997). When thinking about moving a 10-year-old from a small primary school in which they are the oldest to a much larger, more distant secondary school with students up to the age of 19, security is often the major concern. This is why proximity can be seen as a rational choice. It is also possible that parents know perfectly well that raw-scores are not an indication of the quality of the school attended but of the other pupils attending. Using raw scores might be a rational way for a lay person to identify a school in which learning was an important part of everyday school life. Raw scores, like bus stop behaviour, are used as a proxy indication of school intake.

 

If so, several conclusions might follow. Politicians could disseminate the truth that in terms of traditional school outcomes it makes little difference which school a pupil attends. This might reduce the allure of specialisms, selection by aptitude or attainment, faith-based schools, and other needlessly divisive elements for a national school system. It could reduce the so-called premium on housing near to what are currently considered good schools, and reduce the journey times to schools (since the nearest school would be as good as the furthest). All of this would be associated with a decline in socio-economic and educational segregation between schools. SES segregation between schools has been a rising problem in England since 1997 (Gorard 2009b). Reduced segregation by attainment and by student background has many advantages both for schools and for wider society, as well as becoming a repeating cycle, making schools genuinely comprehensive in intake as well as structure, so giving families even less reason to look beyond their nearest schools. It would also mean that, on current figures, no schools would be part of the National Challenge. Schools are earmarked for the National Challenge if their KS4 raw score benchmark of pupils attaining the equivalent of five good GCSEs is less than 30%. Since the overall national figure is considerably higher than 30%, the National Challenge is less an indication of poor schools and more an indictment of the levels of academic segregation in the system. Redistributing school intakes solves the problem at a stroke.

 

Perhaps even more importantly, once policy-makers understand how CVA works and that they cannot legitimately use it to differentiate school performance, they may begin to question the dominance of the school effectiveness model more generally. We might see a resurgence of political and research interest in school processes and outcomes other than pencil-and-paper test results. Schools are mini-societies in which pupils may learn how to interact, what to expect from wider society, and how to judge fairness (Gorard and Smith 2009). Schools seem to be a key influence on pupils' desire to take part in future learning opportunities, and on their occupational aspirations (Gorard and Rees 2002). All of these outcomes have been largely ignored in three decades of school effectiveness research. It is time to move on.

 

February 2009

 

 

References

 

Amrein-Beardsley, A. (2008) Methodological concerns about the education value-added assessment system, Educational Researcher, 37, 2, 65-75

Bald, J. (2006) Inspection is now just a numbers game, Times Educational Supplement, 26/5/06, p.21

Brown, M. (1998) The tyranny of the international horse race, pp.33-47 in Slee, R., Weiner, G. and Tomlinson, S. (Eds.) School Effectiveness for Whom? Challenges to the school effectiveness and school improvement movements, London: Falmer Press

Coleman, J., Campbell, E., Hobson, C., McPartland, J., Mood, A., Weinfield, F. and York, R. (1966) Equality of educational opportunity, Washington: US Government Printing Office

Coleman, J., Hoffer, T. and Kilgore, S. (1982) Cognitive outcomes in public and private schools, Sociology of Education, 55, 2/3, 65-76

DCSF (2007) A technical guide to the contextual value added 2007 model, http://www.dcsf.gov.uk/performancetables/primary_07/2007GuidetoCVA.pdf, accessed 16/12/08

Glass, G. (2004) Teacher evaluation: Policy brief, Tempe, Arizona: Education Policy Research Unit

Goldstein, H. (2008) Evidence and education policy - some reflections and allegations, Cambridge Journal of Education, 38, 3, 393-400

Gorard, S. (1997) School Choice in an Established Market, Aldershot: Ashgate

Gorard, S. (2000a) 'Underachievement' is still an ugly word: reconsidering the relative effectiveness of schools in England and Wales, Journal of Education Policy, 15, 5, 559-573

Gorard, S. (2000b) Education and Social Justice, Cardiff: University of Wales Press

Gorard, S. (2006) Value-added is of little value, Journal of Educational Policy, 21, 2, 233-241

Gorard, S. (2008a) Who is missing from higher education?, Cambridge Journal of Education, 38, 3, 421-437

Gorard, S. (2008b) Quantitative research in education, London: Sage

Gorard, S. (2008c) The value-added of primary schools: what is it really measuring?, Educational Review, 60, 2, 179-185

Gorard, S. (2009a) Misunderstanding and misrepresentation: a reply to Schagen and Hutchison, International Journal of Research and Method in Education, 32, 1

Gorard, S. (2009b) Measuring is more than assigning numbers, in Walford, G., Tucker, E, and Viswanathan, M. (Eds.) Handbook of Measurement, Sage, (submitted)

Gorard, S. (2009c) Does the index of segregation matter? The composition of secondary schools in England since 1996, British Educational Research Journal, (forthcoming)

Gorard, S. (2010) All evidence is equal: the flaw in statistical reasoning, Oxford Review of Education, (forthcoming)

Gorard, S. and Rees, G. (2002) Creating a learning society?, Bristol: Policy Press

Gorard, S. and Smith, E. (2009) The impact of school experiences on students' sense of justice: an international study of student voice, Orbis Scholae, 2, 2, 87-105

Gorard, S., Taylor, C. and Fitz, J. (2003) Schools, Markets and Choice Policies, London: RoutledgeFalmer

Gray, J. and Wilcox, B. (1995) 'Good school, bad school' Evaluating performance and encouraging improvement, Buckingham: Open University Press

Hammond, P. and Yeshanew, T. (2007) The impact of feedback on school performance, Educational Studies, 33, 2, 99-113

Hoyle, R. and Robinson, J. (2003) League tables and school effectiveness: a mathematical model, Proceedings of the Royal Society of London B, 270, 113-199

Kelly, S. and Monczunski, L. (2007) Overcoming the volatility in school-level gain scores: a new approach to identifying value-added with cross-sectional data, Educational Researcher, 36, 5, 279-287

Kyriakides, L. (2008) Testing the validity of the comprehensive model of educational effectiveness: a step towards the development of a dynamic model of effectiveness, School Effectiveness and School Improvement, 19, 4, 429-446

Leckie, G. and Goldstein, H. (2009) The limitations of using school league tables to inform school choice, Working Paper 09/208, Bristol: Centre for Market and Public Organisation

Lubienski, S. and Lubienski, C. (2006) School sector and academic achievement@ a multi-level analysis of NAEP Mathematics data, American Educational Research Journal, 43, 4, 651-698

Luyten, H. (2006) An empirical assessment of the absolute effect of schooling: regression-discontinuity applied to TIMSS-95, Oxford Review of Education, 32, 3, 397-429

Maddern, K. (2009) Adding value, but still a challenge, Times Educational Supplement, 16/1/09, p.23

Mansell, W. (2006) Shock of low score drives heads to resign, Times Educational Supplement, 9/6/06, p.6

Newton, P. (1997) Measuring comparability of standards across subjects: why our statistical techniques do not make the grade, British Educational Research Journal, 23, 4, 433-449

Normand, R. (2008) School effectiveness or the horizon of the world as a laboratory, British Journal of Sociology of Education, 29, 6, 665-676

Nuttall, D. (1979) The myth of comparability, Journal of the National Association of Inspectors and Advisers, 11, 16-18

Nuttall, D., Goldstein, H., Presser, R. and Rasbash, H. (1989) Differential school effectiveness, International Journal of Educational Research, 13, 7, 769-776

Park (2000) Voodoo science, Oxford, OUP

Rutter, M., Maughan, B., Mortimore, P. & Ouston, J. (1979). Fifteen thousand hours: Secondary schools and their effects on children. London: Open Books.

Shipman, M. (1997) The limitations of social research, Harlow: Longman

Slee, R., Weiner, G. and Tomlinson, S. (1998) School Effectiveness for Whom? Challenges to the school effectiveness and school improvement movements, London: Falmer Press

Smith, P. and Street, A. (2006) Analysis of secondary school efficiency: final report, DfES Research Report 788

Thomas, S., Peng, WJ. And Gray, J. (2007) Modelling patterns of improvement over time: value-added trends in English secondary school performance across ten cohorts, Oxford Review of Education, 33, 3, 261-295

Tymms, P. (2003) School composition effects, School of Education, Durham University, January 2003

Tymms, P. and Dean, C. (2004) Value-added in the primary school league tables: a report for the National Association of Head Teachers, Durham: CEM Centre

Tymms, P., Merrell, C., Heron, T., Jones, P., Alborne, S. and Henderson, B. (2008) The importance of districts, School Effectiveness and School Improvement, 19, 3, 261-274

 



[1] I exclude some chapters from this statement of inability to comprehend, most especially the chapter by Brown (1998), which I urge everyone to read.

 

[2] The Department for Children, Schools and Families is responsible for the organisation of schools and childrens' services in England.

 

[3] Key Stage 2 leads to statutory testing at the end of primary education, usually for pupils aged 10. Key Stage 4 leads to assessment at age 16, currently the legal age at which a pupil can leave school.

 

[4] Whereas the both the ethnicity coefficient and the FSM/ethnicity interaction are 0 for White pupils, they are 29.190 and 20.4600 respectively for Black African pupils, for example.

 

[5] Readers unused to thinking about these issues should note that all numbers are generally stored in floating point form, involving a fractional mantissa and a binary exponent. Thus, the problem of representational errors can happen with any figure, whether it is an integer in denary or not.

 

[6] Of course, it is the use to which the figures are put that makes this scale of missing data intolerable. If these assumptions were made in order to demonstrate the clear strong relationship between prior and present attainment, or between student background and attainment, then the scale of errors is still large but perhaps tolerable. Put another way the strength of such a finding is not called into question by the error component. But what school effectiveness researchers do is take out all of these clear relationships, and try to claim that what is left over is a school effect. It is this claim that the paper addresses, in which the error component must be so large that the claim is not sustainable.

 

[7] This does not prevent the widespread abuse of random sampling techniques with population data. A couple of recent examples will have to suffice. Hammond and Yeshanew (2007, p.102) base their analysis on a national dataset, but say 'Although no actual samples have been drawn... Statistical checks were carried out and no significant difference between the groups was found'. They then present a table of standard errors for this population data (p.102). They have learnt to use multi-level modelling but clearly forgotten what significance means and what a standard error is. Similarly, Thomas, Peng and Gray (2007) examined data from one school district in England (and so a population, in statistical terms). Yet they report (p.271) that 'the pupil intake and time trend explanatory variables included in the fixed part of the value-added model (Model A) were statistically significant (at 0.05 level)'.

 

[8] Despite this, some purported authorities on school effectiveness routinely propose the use of confidence intervals with school effectiveness scores based on population figures (e.g. Goldstein 2008).

 

[9] One promising avenue is based on regression discontinuity (e.g. Luyten 2006). This has the major advantage over CVA of not being zero-sum in nature. All schools could improve and be recognised for this (and vice versa), and groups of schools or whole districts can be assessed as co-operative units in which the success of any unit adds to the success of any other. Surely something like this is better for now and for the future.

 

[10] Or perhaps parents are smarter than policy-makers, realising that current VA scores for any school or phase are historical and tell them only what might have happened if their child had started at that school five years ago.