Session 2010-11
Peer reviewWritten evidence submitted by Dr Ralph Kenna and Summary: The UK’s Research Assessment Exercise (RAE) and its replacement, the Research Excellence Framework (REF), are based on peer review. They are designed to scrutinise academic areas to determine the quali ty of groups of researchers at publicly funded institutes. A conspicuous flaw of such exercises is that there is no mechanism to compare peer review across academic disciplines and calls have been made to find a remedy. We have answered that call and propose a simple, systematic solution. An academic paper on this theme has been accepted for publication and will appear in the June issue of the journal Research Evaluation . Here we (i) summarise the background to our work, (ii) summarise a quantitative comparison of peer-review stringency levels in different disciplines in the UK, (iii) compare the RAE to its French equivalent and (iv) outline ways in which this work can be taken forward. 1. Introduction 1.1. In this submission, we present quantitative evidence that peer review is not consistent across academic disciplines in the UK. Using the results from RAE 2008, we show, for example, that the levels of stringency can vary by as much as 50% and more. This has serious implications for quality-related (QR) funding. Having quantified the problem, we then offer a solution: a simple method to normalise peer review scores across disciplines. No other normalisation procedure currently exists which can deal with peer review evaluations of research excellence. 1.2. We also compare RAE 2008 with its French counterpart. We show that they are consistent but the British system is finer than the French one. 2. A new mathematical model for research quality 2.1. Our analysis is based on a mathematical model which relates the quality of a research group [1] to the quantity of researchers it contains. Our model shows that research quality increases linearly with quantity up to a certain group size, which we call the upper critical mass . The upper critical mass is interpreted as the average maximum number of colleagues with whom an individual can meaningfully interact. Beyond the upper critical mass, communication problems begin to set in and the group starts to fragment into subgroups. Adding still more staff does not improve research quality because the new researchers attach to one of the subgroups, and do not therefore significantly enhance the strength of the entire group. 2.2. This notion of upper critical mass is very different to the traditional, intuitive notion of critical mass which is frequently mentioned in policy literature. That old notion is of a threshold group size, beyond which research quality noticeably improves. No evidence has ever been presented in any of the literature to support such a notion and no proper, quantitative definition for it has ever been given. 2.3. However, our model shows that a lower critical mass also exists. This is the average number of researchers required for a group to be stable. The lower critical mass is about half the value of the upper critical mass. Both upper and lower critical masses are discipline dependent. For a solitary research discipline such as pure mathematics the lower critical mass is about 2. For a collaborative area like experimental physics, it is about 13. 2.4. We have established and tested our mathematical model in Ref.[1]. Using results from RAE 2008, we have determined the critical masses for a multitude of different disciplines in Ref.[2]. Next we discuss the implications for peer review. 3. Comparison of stringency in peer review across disciplines 3.1. The UK’s RAE and its replacement, REF are based on peer review. They are designed to scrutinise academic research groups to determine the proportion of research which falls into various quality levels. On this basis, it is possible to compare between different groups of researchers based at different universities within given disciplines. 3.2. A conspicuous flaw of the RAE is that it does not employ a robust, formal mechanism to normalise peer review across different academic disciplines. In an interview in Times Higher Education (December 2009), Professor Dame Julia Higgins lamented this problem and called for "serious thinking" in advance of REF. For example, literal interpretation of the results of RAE 2008 would signal that UK institutions perform better in media studies than in physics. Although this interpretation is rejected in academic circles, it has to be taken seriously at funding levels because there is no current alternative. Thus, the absence of a method to normalise peer-review stringency levels across disciplines has serious implications for the manner in which funding is distributed. 3.3. Certainly a straightforward normalisation on the basis of averaging over all research teams scrutinised is not reliable, as different disciplines may have different strengths in a given country. Also, while methods exist to normalise bibliometric measures (citation counts etc) across disciplines, these remain controversial and are considered inappropriate for usage at RAE/REF which (a) focuses on research quality rather than research impact and (b) is peer-review rather than metric based. 3.4. Therefore, in order to avoid meaningless comparisons between disciplines in future peer-review research evaluation frameworks and to ensure fairer distributions of resources, the British funding bodies were called upon to seriously tackle this issue in advance of Britain’s next evaluation exercise by finding a way to normalise results across disciplines. Reference [3] (and this submission) is an answer to that call. 4. Normalisation of peer-review quality measures across research disciplines 4.1. According to our mathematical model, plots of the dependency of group quality on group quantity are expected to exhibit saturation to the right of a breakpoint, where research performance is maximal. Further increase in group size much beyond the breakpoint (or upper critical mass) does not significantly increase group quality. This observation forms the crux of our normalisation scheme: it is sensible to peg maximally performing groups in different disciplines at similar levels. 4.2. In Fig.1(a) the research quality scores from RAE 2008 are plotted against team sizes for three different areas, namely computer science and informatics, physics, and biology. The curves are best fits or trend-lines to the data. The disparity between the computer sciences and the other two disciplines is evident. The figure appears to indicate that computer-science teams are better than both physics and biology, which are comparable with each other. In fact, if taken literally (which it is for QR funding purposes), Fig.1(a) indicates that the biggest and best computer science teams in the UK are performing at levels about 50% above the biggest and best physics and biology teams. On the other hand, international comparisons indicate that the UK is particularly strong in biology. So a more likely explanation for the misalignment apparent in Fig.1(a) is a higher degree of stringency in the RAE peer evaluation panels for physics and biology than for computer science. 4.3. This conclusion is reinforced in Fig.2, where the trend-lines are compared for a variety of different research areas. Again, panel (a) shows the results of peer review in the different disciplines and indicates varying degrees of stringency. These variances are compensated for in panel (b), having used the normalisation process detailed in Reference [3]. Figure 1: (a) Peer-review quality measures plotted against group size for computer science (+), physics (×), and biology (∗) as measured at RAE 2008. The three curves are the corresponding fits or trend-lines. Computer science tends to be rated above the other two disciplines due to less stringent peer review. (b) The same data after normalisation, the effect of which is to bring the computer science results into line with those of physics and biology. 4.4 Our model offers a simple way to normalise the data to remove the effects of variable stringencies of peer review in different disciplines. This process is described in Ref.[3] and converts Fig.1(a) into Fig.1(b). In Fig.1(b), the computer science results are now better aligned with those from the other two disciplines, and there is a greater degree of overlap between the quality scores for the three areas. Moreover, the reasonable alignment between physics and biology evident in Fig.1(a) is not adversely affected by normalisation process (in fact it is improved) and the strength of the best biology teams remain strong post normalisation. Fig.2 (a) The three fits of Fig.1(a) together with those for English language & literature; philosophy & theology; history; archaeology; architecture & planning; law; politics & international studies; geography; Earth & environmental studies; medical sciences; education; art & design. (b) The same fits after normalisation. 4.5 Applying the normalisation procedure to the data fitted in Fig.2(a) converts the trend lines to those in Fig.2(b). This normalises peer-reviewed RAE results for physics, biology, computer sciences, geography, Earth and environmental studi es, medical sciences, archaeolog y, architecture, planning, law, politics and i nternational studies, education, English , philosophy , theology, history, and art & design. Normalisations for further disciplines are presented in Ref.[3]. 4.6 The results of Refs.[2,3] are summarised in Table 1. The table lists the upper and lower critical masses for a variety of research areas. It also lists the normalisation factors for these disciplines at RAE 2008. These factors are the amounts by which the peer-review qualit y estimates should be rescaled to compensate for varying degrees of stringency. Table 1: The lower and upper critical mass estimates for various research disciplines from Ref.[2] along with the normalisation factor which indicates the amount by which the peer-review quality scores should be rescaled to compensate for varying degrees of stringency at RAE 2008.
5 Comparison between RAE 2008 and the equivalent French system 5.1 In Ref.[1], we also compare the RAE to the equivalent French system which is performed by l’Agence d’Évaluation de la Recherche et de l’Enseignement Supérieur (AERES). In the 2008 evaluation, AERES used a method which is considered more precise than previously and this facilitates comparison with the British approach. However, since only 10 traditional universities were evaluated, the amount of data available for the French system is lower than for the UK one. Furthermore, only a global mark is attributed to cumulated research groupings at the level of faculties so a fine-grain analysis at the level of departments is not possible. Nonetheless, we can translate the AERES grades A+, A, B, C into RAE grades 4*. 3*, 2* and 1* and analyse the French system for hard sciences and life sciences to compare to the British equivalent. In Fig.3, standardised quality scores are plotted against the standardised group sizes for both systems. A convincing degree of overlap is evident. Thus the systems yield comparable results, although the RAE is far more detailed than the ARES system. Fig.3. Comparisons between France’s AERES evaluation system and the UK’s RAE for (a) the hard sciences and (a) the life sciences. The French data are represented by circles and the integrated British data by crosses. 6 Conclusions 6.1 Assessment exercises such as the UK’s RAE/REF and France’s AERES system use peer evaluation to perform comparisons between research teams within given academic disciplines. However, the absence of a method to compare across disciplines has been a fundamental flaw of such exercises, and calls have been issued to remedy this flaw. This has important implications for public funding of academic research. Paper [3] and this submission is a response to such calls. Ref.[3] contains the only method in existence to normalise peer-review measures of research quality across disciplines. (The notion of critical mass as defined in Refs.[1,2 ] is integral to this approach.) Such normalisation is required to compensate for different degrees of stringency in the expert evaluation within disciplines and is essential for meaningful comparison between research groups in different areas. Since the British QR funding system is completely reliant on the results of RAE/REF, an uncomplicated, robust normalisation system is essential in order to ensure fair allocation of financial support for research. 6.2 Further work is required to capture the critical masses and normalisation factors of some disciplines which were not included in Refs.[1-3]. These include the engineering disciplines, where we believe strong links with industry contaminate the data. These data have to be cleaned to bring them under control. Notwithstanding this, an approach along the lines summarised here offers a general basis for inter-disciplinary normalisation, while intra-disciplinary peer-review measurements of research quality may continue to be performed by subject experts. 7 References [1] Ralph Kenna and Bertrand Berche , The extensive nature of group quality, Europhysics Letters 90 (2010) 58002 [ http://iopscience.iop.org/0295-5075/90/5/58002 ]. [2] Ralph Kenna and Bertrand Berche , Critical mass and the dependency of research quality on group size, Scientometrics 86 (2011) 527-540 [ http://www.springerlink.com/content/q233208g91124135 ]. [3] Ralph Kenna and Bertrand Berche , Normalization of peer-evaluation measures of group research quality across academic disciplines, to appear in Research Evaluation, DOI 10.3152/095820211X12941371876463 [ http://de.arxiv.org/abs/1006.3863 ]. 8 Declaration of Interests: 8.1 Dr. Ralph Kenna is Deputy Director of the Applied Mathematics Research Centre, Coventry University, Coventry, CV1 5FB, England. He has authored over 50 papers in theoretical physics, about 40 of which were peer reviewed. He is a member of the Editorial Board of the journal Condensed Matter Physics and is peer reviewer for 20 different journals, including for Physical Review and Physical Review Letters, Nuclear Physics, and the Journal of Physics. He also reviews grant proposals for EPSRC, the Royal Society and for the Chilean Research Fund Council (FONDECYT). 8.2 Professor Bertrand Berche is Head of Department of Physics, Institut Jean Lamour (Laboratoire associé au CNRS UMR 7198) CNRS – Nancy Université – UPVM, B.P. 70239, F – 54506 Vandoeuvre lès Nancy Cedex, France. He has produced over 80 theoretical physics papers, of which 60 were peer reviewed. He is also on the Editorial Board of Condensed Matter Physics and reviews submissions to 20 different journals, mainly at the American Physical Society, the Institute of Physics, the European Physical Society as well as Elsevier. He is Head of the Condensed Matter Division of the French University National Council (CNU) is expert for AERES and participated in the evaluation of 6 different institutes the last three years and on many professor-position committees for different universities in France. Dr Ralph Kenna and Professor Bertrand Berche 4 March 2011 [1] Throughout this document we use the word “group” to denote a collection of researchers at a given university active in a common discipline. In the UK, this may be taken to mean those who were submitted to RAE 2008 in a given unit of assessment. A group is not necessarily synonymous with a department because not all department members may be research active (not all may have been submitted to RAE) or a submission may draw from researchers interacting across different departments. The word “group” in this sense is also not synonymous with research centre, as such entities may be involved more than one department (at RAE, research centres may have been involved in submissions across more than one area. |
|
|
©Parliamentary copyright | Prepared 17th March 2011 |