Peer review in scientific publications - Science and Technology Committee Contents

4 Data management

178.  In paragraphs 21-22 we discussed the need for reviewers to assess manuscripts to ensure that they are technically sound. One of the questions that arose in the course of this inquiry was, how far should reviewers be expected to go to assess technical soundness? In this chapter we discuss the feasibility of reviewing the underlying data behind research and how those data should be managed.

The need to review data

179.  Sense About Science told us that:

The ultimate test of scientific data [...] comes through its independent replication by others; peer review is the system which allows publication of data so that it can be both criticised and replicated. It is a system which encourages people to ask questions about scientific data.[316]

180.  Replication does not usually take place during the peer-review process, although, "in exceptional circumstances, referees will undertake considerable work on their own initiative to replicate an aspect of a paper".[317] Professor Sir Adrian Smith, Director General of Knowledge and Innovation in the Department for Business, Innovation and Skills (BIS), acknowledged that reviewing the underlying data is "rather difficult" where data have come out of laboratories or field studies.[318] He added, however, that replication of "somebody's derivation of a mathematical formula", for example, was possible.[319]

181.  Replication of reported results is only possible if the submitted manuscript contains sufficient information to allow others to reproduce the experiments. Dr Mark Patterson, from the Public Library of Science (PLoS), told us that reproducibility is a "gold standard" that publishers should be aiming for.[320] Dr Philip Campbell, from Nature, explained that "it is part of the editor's and peer-reviewer's responsibilities to ensure that data and materials required for other researchers to replicate or otherwise verify and build on the work are subsequently available to those who need it".[321] Dr Rebecca Lawrence, from Faculty of 1000 Ltd, added that:

within the kind of time frames of peer review, [...] you aren't going to be able to repeat the experiment yourself. All you can do is say that it seems okay; it looks like it makes sense; the analysis looks right; the way they have conducted it makes sense and the conclusions make sense. I think the issue of reproducibility must come after publication [...] That is when people say, "I couldn't reproduce it", or, "I could".[322]

Professor Sir John Beddington, the Government Chief Scientific Adviser, explained that this was indeed the way in which science progresses:

We see all the time in the journals that are published this week that there will be people who have challenged peer-reviewed papers that were published some years ago and pointed out fundamental flaws in them or new evidence that undermines the conclusions of those papers.[323]

182.  However, Dr Fiona Godlee, from BMJ Group, explained that there can be problems with inadequate reporting of data:

We have to acknowledge that peer review is extremely limited in what it can do. We are sent an article, effectively, sometimes with datasheets attached. [...] A vast amount of data do not get through to journals. We know that there is under-reporting, misreporting and a whole host of problems, and journals are not adequate to the task that they are being given to deal with at the moment.[324]

183.  Dr Mark Patterson explained what PLoS did when problems of under-reporting arose:

in general, we have a requirement that, in the interests of reproducibility, you must make the data available. We have had cases where readers have reported to us a problem with getting hold of data from an author published in a PLoS journal. We follow that up. We talk to the author and ask what the issues are. In the majority of cases the author will deposit their data and it is a misunderstanding, almost, that they haven't deposited their data in the appropriate repository, or whatever it is that is done in that particular community.[325]

184.  We conclude that reproducibility should be the gold standard that all peer reviewers and editors aim for when assessing whether a manuscript has supplied sufficient information, about the underlying data and other materials, to allow others to repeat and build on the experiments.

Depositing data during the peer-review process

185.  The body of data reviewed can often be large and/or of a complex nature. An increasing challenge is how to make these large or complex datasets available for reviewers to assess confidentially.[326] Dr Andrew Sugden, from Science, told us that "currently no databases allow secure posting for the purposes of peer-review, and some authors are unwilling to release data prior to publication".[327]

186.  PLoS explained that:

In some fields—for example, genetics and molecular biology—there are well-established curated databases where data can be deposited and linked to particular research articles. Examples of such databases include those available at the European Bioinformatics Institute in Hinxton, UK. The curators who run the databases perform critical quality control checks analogous to the technical assessment of research articles.[328]

These quality control checks are independent of the peer-review process involved in assessing the related research article.

187.  The issue of quality control is an important one. Dr Andrew Sugden explained that reviewing data "that is many times the size of the submitted text is a burden to reviewers" and that "standards for reporting and presenting large data sets that allow common analysis tools could help greatly".[329] BioMed Central agreed, adding that:

Capturing the vast amount of data that is continuously generated and ensuring consistent data deposition according to agreed formats and nomenclatures will be crucial to enabling smooth meta-analyses of datasets from different databases.[330]

188.  The area of data deposition is evolving quickly. Dr Mark Patterson, from PLoS, highlighted a new project called Dryad.[331] This is an international repository of data underlying peer-reviewed articles in the basic and applied biosciences, governed by a consortium of journals.[332] Dr Patterson explained how Dryad works:

The idea is that this is a place where you can deposit your data set [...] and where you can give privileged access to reviewers, for example, during the peer review process and then make the data available once the article is published.[333]

Editors and journals are aiming to "facilitate their authors' data archiving by setting up automatic notifications to Dryad of accepted manuscripts", and thereby streamlining the process for depositing data after publication.[334] Dr Patterson told us that Dryad "is developing a kind of generic database for data sets [...] particularly in the fields of ecology and evolution [...] but they are already talking of expanding into other areas".[335] There is also an ongoing project, DryadUK, funded by the Joint Information Systems Committee (JISC), to develop a mirror site in the UK.[336]

189.  If reviewers and editors are to assess whether authors of manuscripts are providing sufficient accompanying data, it is essential that they are given confidential access to relevant data associated with the work during the peer-review process. This can be problematical in the case of the large and complex datasets which are becoming increasingly common. The Dryad project is an initiative seeking to address this. If it proves successful, funding should be sought to expand it to other disciplines. Alternatively, we recommend that funders of research and publishers work together to develop similar repositories for other disciplines.

Technical and economic challenges of data storage

190.  Dr Malcolm Read, from JISC, cautioned that "there are technical and economic problems" associated with making data available in the long term.[337] He told us that "keeping [data] available, possibly in perpetuity, could end up as a cost that the sector simply could not afford",[338] and explained that different approaches would be required depending on the type of data:

Keeping available all the outputs of the experiments on the Large Hadron Collider is just infeasible. Other data, such as environmental data, must be kept permanently available. I think that should be made more open. Of course, you can't repeat an earthquake and that data must never be lost. A lot of social data in terms of longitudinal studies make sense only if the entire length of the study is available. In some areas of science the data is produced by computers and programs. In that case, if the data is very large, an option might be simply to re-run the program.[339]

191.  Sir Mark Walport, from the Wellcome Trust, agreed that there are "major costs" involved.[340] He added that the "costs of storing the data may in the future exceed the costs of generating it" and that this was an issue for research funders because they fund the research and so have to help with the storage.[341] He added that "our funding is a partnership between the charity sector and the Government and [data storage] is a shared expenditure".[342] Professor Sir Adrian Smith acknowledged that cost was "a real problem".[343] However, given how cheap data storage has become, we consider that this cost is a result of the sheer growth in quantities of data.[344]

192.  Dr Philip Campbell, from Nature, provided an example of the potential costs involved in making data, software and codes available:

I was talking to a researcher the other day and he had been asked to make his code accessible. He had had to go to the Department of Energy for a grant to make it so. He was asking for $300,000, which was the cost of making that code completely accessible and usable by others. In that particular case the grant was not given. It is a big challenge in computer software and we need to do better than we are doing.[345]

He added, however that this should not prevent others from validating the research by attempting to reproduce the work, for example, "you can allow people to come into your laboratory and use the computer system and test it".[346]

193.  Dr Malcolm Read explained in more detail why making software code available can be difficult:

if you are talking about stuff running on so-called super-computers, you have to know quite a lot about the machine and the environment it is running on. It is very difficult to run some of those top-end computer applications, even if, of course, they are prepared to make their code available.[347]

He added that the way to get around this problem was to ensure that authors "make clear the nature of the program they are running and the algorithms".[348] Dr Read explained that:

A computer will not have any value beyond the way it is programmed. As long as they define the input conditions, as it were, and what the program is designed to do, you should be able to trust the outputs. That would be no different from any statistical test that is run on a data set, so long as you say what the test is. You then start to get down to the accuracy of the data itself, which is perhaps a more fundamental issue than the software or statistical test that is being run on it. I would say that the availability of the research data is a more important issue because then, of course, other researchers could run different types of algorithms on different types of computer on that data. I think access to the data is more fundamental.[349]

A culture of openness

194.  Access to data is fundamental if researchers are to reproduce and thereby verify results that are reported in the literature. Professor John P. A. Ioannidis, from the University of Ioannina School of Medicine, stated in a recent Scientific American article that:

The best way to ensure that test results are verified would be for scientists to register their detailed experimental protocols before starting their research and disclose full results and data when the research is done. At the moment, results are often selectively reported, emphasizing the most exciting among them, and outsiders frequently do not have access to what they need to replicate studies. Journals and funding agencies should strongly encourage full public availability of all data and analytical methods for each published paper.[350]

195.  In response, Professor Rick Rylance, from Research Councils UK (RCUK), stated that:

I endorse the broad principles of that. The one slight reservation I would have is that, quite often, research is a process of discovery and you don't quite know at the beginning what the protocols and procedures are that you are going to use, particularly in my domain. I would have a slight reservation about that, but the principles are right.[351]

196.  Many of the individuals we heard from were broadly in favour of the principle of openness with regard to data availability post-publication.[352] We were told that:

the principles of openness in science, of making data available and open, are something that the Wellcome Trust and other funders of biomedical research around the world are fully behind and completely supportive of.[353]

197.  Professor Sir Adrian Smith, from BIS, explained the current situation and the Government's position on data availability:

There is a great movement now and a recognition of openness and transparency, which has always been implicit as a fundamental element of the scientific process. But the more we collect large datasets, you have to give other people, as part of the challenge process, the ability to revisit that data and see what they make of it with openness and transparency. There is general support these days for the presumption that the research, the associated data and if you have written a computer code to assess it, should all be available and up for challenge and testing validation. In fact, explicitly the Research Councils encourage that, as Government Departments do. However, there can be complex and legitimate reasons for not necessarily, at least in the short term, being that transparent. An awful lot of policy in recent years has meant that we have been trying to lever more out of public investment by joint working with business and industry and levering additional funding. Once you get into that territory, you do have commercial and intellectual property constraints on a temporary basis at least, for openness and transparency. The presumption is that, unless there is a strong reason otherwise, everything should be out there and available.[354]

198.  Sir Adrian added that "there will always be issues of personal data protection, commercial interests and intellectual property and national security, so the situation is quite complex".[355] Indeed, Dr Malcolm Read, from JISC, explained that "a blanket mandate on open data might not be feasible but the predisposition should be to make data openly available".[356] David Sweeney, from HEFCE, agreed that consideration needed to be given to "the particular circumstances and the sensitivity".[357]

199.  Sir Adrian explained that "different communities, different cultures and different forms of data pose different issues".[358] One example where making data available could be challenging is where confidential patient data are involved in biomedical research. The BMJ Group stated that:

The Wellcome Trust and other major international funders have called for public health researchers to make studies' raw data available. Annals of Internal Medicine, the BMJ, BMJ Open, the PLoS journals and several BMC journals—among others—actively encourage authors to share data in online repositories with necessary safeguards to protect patient confidentiality.[359]

However, "if you are dealing with clinical material then the confidentiality of participants is paramount. You have to manage data so that they are appropriately anonymised and people cannot be revealed".[360] Dr Fiona Godlee did not see confidentiality as a problem:

when one is talking about large datasets, confidentiality has already been dealt with, and we should not use that as an excuse for not looking at [data deposition]. There are no doubt practical issues, but […] nationally, we ought to have systems for data depositioning. The practical problems will be resolved, as with trial registration, which seemed impossible five or 10 years ago, and it is now routine.[361]

200.  Dr Michaela Torkar explained in more detail the challenge faced by publishers and how these might be overcome:

It is only if the standards are well established and agreed on by the community that you can really enforce [data deposition] and insist on it as a publisher. It becomes more difficult when, say, databases are not quite ready to accept all of the submissions or formats. That becomes a real barrier for authors. They cannot publish because the publisher insists on it. I think there is a lot of responsibility on the publishers to interact with different communities to establish the right databases and standards and where the limitations are and to make it mandatory in some cases and in others encourage submission and deposition, in particular. I think it depends very much on the communities.[362]

201.  If mandatory data deposition is problematic, the question becomes how can we encourage rather than enforce it? Dr Mark Patterson, from PLoS, told us that:

First, it would be really helpful for publishers to include some kind of statement about data availability so that it is clear. How do you get hold of this data? Are there any restrictions in terms of accessing it because of the size of the data in some fields or whatever? Secondly, there is an opportunity to incentivise the sharing of data by giving greater credit and finding mechanisms to reward researchers who do that to assess the impact of that sharing as well. Rather than focusing everything on what they have published in whatever journal, to start thinking about different kinds of outputs and their value.[363]

Dr Malcolm Read, from JISC, agreed that researchers "would deserve credit and recognition for that".[364]

202.  We note that the Royal Society launched its Science as a public enterprise project in May 2011.[365] This will look at how scientific data should best be managed and may explore some of the issues highlighted in this chapter.

203.  Access to data is fundamental if researchers are to reproduce, verify and build on results that are reported in the literature. We welcome the Government's recognition of the importance of openness and transparency. The presumption must be that, unless there is a strong reason otherwise, data should be fully disclosed and made publicly available. In line with this principle, where possible, data associated with all publicly funded research should be made widely and freely available. Funders of research must coordinate with publishers to ensure that researchers disclose their data in a timely manner. The work of researchers who expend time and effort adding value to their data, to make it usable by others, should be acknowledged as a valuable part of their role. Research funders and publishers should explore how researchers could be encouraged to add this value.

316   Ev 75, para 11 Back

317   Ev 87, para 13 [Philip Campbell] Back

318   Q 297 Back

319   As above Back

320   Q 203 [Dr Mark Patterson] Back

321   Ev 87, para 16 Back

322   Q 206 Back

323   Q 296 Back

324   Q 106 Back

325   Q 203 Back

326   Ev 93, para 23; and Ev 141, para 27 [Dr Andrew Sugden, ScienceBack

327   Ev 141, para 27 Back

328   Ev 80, para 29 Back

329   Ev 141, para 27 Back

330   Ev 108 Back

331   Q 206 Back

332   "About Dryad", Dryad, Back

333   Q 206 Back

334   "About Dryad", Dryad, Back

335   Q 206 Back

336   "About Dryad", Dryad, Back

337   Q 208 Back

338   As above Back

339   As above Back

340   Q 280 Back

341   As above Back

342   As above Back

343   Q 300 Back

344   For example, "Taking a Hard Look At Storage Costs",, 8 August 2008 Back

345   Q 136 Back

346   Q 137 Back

347   Q 202 Back

348   Q 203 Back

349   As above Back

350   J. P. A. Ioannidis, An epidemic of false claims, Scientific American, June 2011 Back

351   Q 277 Back

352   For example: Q 136 [Dr Philip Campbell]; Q 207 [Dr Mark Patterson]; and Q 278 [David Sweeney] Back

353   Q 277 [Sir Mark Walport] Back

354   Q 298 Back

355   Q 299 Back

356   Q 208 Back

357   Q 279 Back

358   Q 300 Back

359   Ev 73, para 19 Back

360   Q 278 [Sir Mark Walport] Back

361   Q 157 Back

362   Q 207 Back

363   Q 208 Back

364   As above Back

365   "Royal Society launches study on openness in science", Royal Society Press Notices,, 13 May 2011 Back

previous page contents next page

© Parliamentary copyright 2011
Prepared 28 July 2011