4 Data management
178. In paragraphs 21-22 we discussed the need
for reviewers to assess manuscripts to ensure that they are technically
sound. One of the questions that arose in the course of this inquiry
was, how far should reviewers be expected to go to assess technical
soundness? In this chapter we discuss the feasibility of reviewing
the underlying data behind research and how those data should
be managed.
The need to review data
179. Sense About Science told us that:
The ultimate test of scientific data [...] comes
through its independent replication by others; peer review is
the system which allows publication of data so that it can be
both criticised and replicated. It is a system which encourages
people to ask questions about scientific data.[316]
180. Replication does not usually take place
during the peer-review process, although, "in exceptional
circumstances, referees will undertake considerable work on their
own initiative to replicate an aspect of a paper".[317]
Professor Sir Adrian Smith, Director General of Knowledge and
Innovation in the Department for Business, Innovation and Skills
(BIS), acknowledged that reviewing the underlying data is "rather
difficult" where data have come out of laboratories or field
studies.[318] He added,
however, that replication of "somebody's derivation of a
mathematical formula", for example, was possible.[319]
181. Replication of reported results is only
possible if the submitted manuscript contains sufficient information
to allow others to reproduce the experiments. Dr Mark Patterson,
from the Public Library of Science (PLoS), told us that reproducibility
is a "gold standard" that publishers should be aiming
for.[320] Dr Philip
Campbell, from Nature, explained that "it is part
of the editor's and peer-reviewer's responsibilities to ensure
that data and materials required for other researchers to replicate
or otherwise verify and build on the work are subsequently available
to those who need it".[321]
Dr Rebecca Lawrence, from Faculty of 1000 Ltd, added that:
within the kind of time frames of peer review, [...]
you aren't going to be able to repeat the experiment yourself.
All you can do is say that it seems okay; it looks like it makes
sense; the analysis looks right; the way they have conducted it
makes sense and the conclusions make sense. I think the issue
of reproducibility must come after publication [...] That is when
people say, "I couldn't reproduce it", or, "I could".[322]
Professor Sir John Beddington, the Government Chief
Scientific Adviser, explained that this was indeed the way in
which science progresses:
We see all the time in the journals that are published
this week that there will be people who have challenged peer-reviewed
papers that were published some years ago and pointed out fundamental
flaws in them or new evidence that undermines the conclusions
of those papers.[323]
182. However, Dr Fiona Godlee, from BMJ Group,
explained that there can be problems with inadequate reporting
of data:
We have to acknowledge that peer review is extremely
limited in what it can do. We are sent an article, effectively,
sometimes with datasheets attached. [...] A vast amount of data
do not get through to journals. We know that there is under-reporting,
misreporting and a whole host of problems, and journals are not
adequate to the task that they are being given to deal with at
the moment.[324]
183. Dr Mark Patterson explained what PLoS did
when problems of under-reporting arose:
in general, we have a requirement that, in the interests
of reproducibility, you must make the data available. We have
had cases where readers have reported to us a problem with getting
hold of data from an author published in a PLoS journal. We follow
that up. We talk to the author and ask what the issues are. In
the majority of cases the author will deposit their data and it
is a misunderstanding, almost, that they haven't deposited their
data in the appropriate repository, or whatever it is that is
done in that particular community.[325]
184. We conclude that reproducibility
should be the gold standard that all peer reviewers and editors
aim for when assessing whether a manuscript has supplied sufficient
information, about the underlying data and other materials, to
allow others to repeat and build on the experiments.
Depositing data during the peer-review
process
185. The body of data reviewed can often be large
and/or of a complex nature. An increasing challenge is how to
make these large or complex datasets available for reviewers to
assess confidentially.[326]
Dr Andrew Sugden, from Science, told us that "currently
no databases allow secure posting for the purposes of peer-review,
and some authors are unwilling to release data prior to publication".[327]
186. PLoS explained that:
In some fieldsfor example, genetics and molecular
biologythere are well-established curated databases where
data can be deposited and linked to particular research articles.
Examples of such databases include those available at the European
Bioinformatics Institute in Hinxton, UK.
The curators who run the
databases perform critical quality control checks analogous to
the technical assessment of research articles.[328]
These quality control checks are independent of the
peer-review process involved in assessing the related research
article.
187. The issue of quality control is an important
one. Dr Andrew Sugden explained that reviewing data "that
is many times the size of the submitted text is a burden to reviewers"
and that "standards for reporting and presenting large data
sets that allow common analysis tools could help greatly".[329]
BioMed Central agreed, adding that:
Capturing the vast amount of data that is continuously
generated and ensuring consistent data deposition according to
agreed formats and nomenclatures will be crucial to enabling smooth
meta-analyses of datasets from different databases.[330]
188. The area of data deposition is evolving
quickly. Dr Mark Patterson, from PLoS, highlighted a new project
called Dryad.[331]
This is an international repository of data underlying peer-reviewed
articles in the basic and applied biosciences, governed by a consortium
of journals.[332] Dr
Patterson explained how Dryad works:
The idea is that this is a place where you can deposit
your data set [...] and where you can give privileged access to
reviewers, for example, during the peer review process and then
make the data available once the article is published.[333]
Editors and journals are aiming to "facilitate
their authors' data archiving by setting up automatic notifications
to Dryad of accepted manuscripts", and thereby streamlining
the process for depositing data after publication.[334]
Dr Patterson told us that Dryad "is developing a kind of
generic database for data sets [...] particularly in the fields
of ecology and evolution [...] but they are already talking of
expanding into other areas".[335]
There is also an ongoing project, DryadUK, funded by the Joint
Information Systems Committee (JISC), to develop a mirror site
in the UK.[336]
189. If reviewers and editors
are to assess whether authors of manuscripts are providing sufficient
accompanying data, it is essential that they are given confidential
access to relevant data associated with the work during the peer-review
process. This can be problematical in the case of the large and
complex datasets which are becoming increasingly common. The Dryad
project is an initiative seeking to address this. If it proves
successful, funding should be sought to expand it to other disciplines.
Alternatively, we recommend that funders of research and publishers
work together to develop similar repositories for other disciplines.
Technical and economic challenges
of data storage
190. Dr Malcolm Read, from JISC, cautioned that
"there are technical and economic problems" associated
with making data available in the long term.[337]
He told us that "keeping [data] available, possibly in perpetuity,
could end up as a cost that the sector simply could not afford",[338]
and explained that different approaches would be required depending
on the type of data:
Keeping available all the outputs of the experiments
on the Large Hadron Collider is just infeasible. Other data, such
as environmental data, must be kept permanently available. I think
that should be made more open. Of course, you can't repeat an
earthquake and that data must never be lost. A lot of social data
in terms of longitudinal studies make sense only if the entire
length of the study is available. In some areas of science the
data is produced by computers and programs. In that case, if the
data is very large, an option might be simply to re-run the program.[339]
191. Sir Mark Walport, from the Wellcome Trust,
agreed that there are "major costs" involved.[340]
He added that the "costs of storing the data may in the future
exceed the costs of generating it" and that this was an issue
for research funders because they fund the research and so have
to help with the storage.[341]
He added that "our funding is a partnership between the charity
sector and the Government and [data storage] is a shared expenditure".[342]
Professor Sir Adrian Smith acknowledged that cost was "a
real problem".[343]
However, given how cheap data storage has become, we consider
that this cost is a result of the sheer growth in quantities of
data.[344]
192. Dr Philip Campbell, from Nature,
provided an example of the potential costs involved in making
data, software and codes available:
I was talking to a researcher the other day and he
had been asked to make his code accessible. He had had to go to
the Department of Energy for a grant to make it so. He was asking
for $300,000, which was the cost of making that code completely
accessible and usable by others. In that particular case the grant
was not given. It is a big challenge in computer software and
we need to do better than we are doing.[345]
He added, however that this should not prevent others
from validating the research by attempting to reproduce the work,
for example, "you can allow people to come into your laboratory
and use the computer system and test it".[346]
193. Dr Malcolm Read explained in more detail
why making software code available can be difficult:
if you are talking about stuff running on so-called
super-computers, you have to know quite a lot about the machine
and the environment it is running on. It is very difficult to
run some of those top-end computer applications, even if, of course,
they are prepared to make their code available.[347]
He added that the way to get around this problem
was to ensure that authors "make clear the nature of the
program they are running and the algorithms".[348]
Dr Read explained that:
A computer will not have any value beyond the way
it is programmed. As long as they define the input conditions,
as it were, and what the program is designed to do, you should
be able to trust the outputs. That would be no different from
any statistical test that is run on a data set, so long as you
say what the test is. You then start to get down to the accuracy
of the data itself, which is perhaps a more fundamental issue
than the software or statistical test that is being run on it.
I would say that the availability of the research data is a more
important issue because then, of course, other researchers could
run different types of algorithms on different types of computer
on that data. I think access to the data is more fundamental.[349]
A culture of openness
194. Access to data is fundamental if researchers
are to reproduce and thereby verify results that are reported
in the literature. Professor John P. A. Ioannidis, from
the University of Ioannina School of Medicine, stated in a recent
Scientific American article that:
The best way to ensure that test results are verified
would be for scientists to register their detailed experimental
protocols before starting their research and disclose full results
and data when the research is done. At the moment, results are
often selectively reported, emphasizing the most exciting among
them, and outsiders frequently do not have access to what they
need to replicate studies. Journals and funding agencies should
strongly encourage full public availability of all data and analytical
methods for each published paper.[350]
195. In response, Professor Rick Rylance, from
Research Councils UK (RCUK), stated that:
I endorse the broad principles of that. The one slight
reservation I would have is that, quite often, research is a process
of discovery and you don't quite know at the beginning what the
protocols and procedures are that you are going to use, particularly
in my domain. I would have a slight reservation about that, but
the principles are right.[351]
196. Many of the individuals we heard from were
broadly in favour of the principle of openness with regard to
data availability post-publication.[352]
We were told that:
the principles of openness in science, of making
data available and open, are something that the Wellcome Trust
and other funders of biomedical research around the world are
fully behind and completely supportive of.[353]
197. Professor Sir Adrian Smith, from BIS, explained
the current situation and the Government's position on data availability:
There is a great movement now and a recognition of
openness and transparency, which has always been implicit as a
fundamental element of the scientific process. But the more we
collect large datasets, you have to give other people, as part
of the challenge process, the ability to revisit that data and
see what they make of it with openness and transparency. There
is general support these days for the presumption that the research,
the associated data and if you have written a computer code to
assess it, should all be available and up for challenge and testing
validation. In fact, explicitly the Research Councils encourage
that, as Government Departments do. However, there can be complex
and legitimate reasons for not necessarily, at least in the short
term, being that transparent. An awful lot of policy in recent
years has meant that we have been trying to lever more out of
public investment by joint working with business and industry
and levering additional funding. Once you get into that territory,
you do have commercial and intellectual property constraints on
a temporary basis at least, for openness and transparency. The
presumption is that, unless there is a strong reason otherwise,
everything should be out there and available.[354]
198. Sir Adrian added that "there will always
be issues of personal data protection, commercial interests and
intellectual property and national security, so the situation
is quite complex".[355]
Indeed, Dr Malcolm Read, from JISC, explained that "a blanket
mandate on open data might not be feasible but the predisposition
should be to make data openly available".[356]
David Sweeney, from HEFCE, agreed that consideration needed to
be given to "the particular circumstances and the sensitivity".[357]
199. Sir Adrian explained that "different
communities, different cultures and different forms of data pose
different issues".[358]
One example where making data available could be challenging is
where confidential patient data are involved in biomedical research.
The BMJ Group stated that:
The Wellcome Trust and other major international
funders have called for public health researchers to make studies'
raw data available. Annals of Internal Medicine, the BMJ,
BMJ Open, the PLoS journals and several BMC journalsamong
othersactively encourage authors to share data in online
repositories with necessary safeguards to protect patient confidentiality.[359]
However, "if you are dealing with clinical material
then the confidentiality of participants is paramount. You have
to manage data so that they are appropriately anonymised and people
cannot be revealed".[360]
Dr Fiona Godlee did not see confidentiality as a problem:
when one is talking about large datasets, confidentiality
has already been dealt with, and we should not use that as an
excuse for not looking at [data deposition]. There are no doubt
practical issues, but [
] nationally, we ought to have systems
for data depositioning. The practical problems will be resolved,
as with trial registration, which seemed impossible five or 10
years ago, and it is now routine.[361]
200. Dr Michaela Torkar explained in more detail
the challenge faced by publishers and how these might be overcome:
It is only if the standards are well established
and agreed on by the community that you can really enforce [data
deposition] and insist on it as a publisher. It becomes more difficult
when, say, databases are not quite ready to accept all of the
submissions or formats. That becomes a real barrier for authors.
They cannot publish because the publisher insists on it. I think
there is a lot of responsibility on the publishers to interact
with different communities to establish the right databases and
standards and where the limitations are and to make it mandatory
in some cases and in others encourage submission and deposition,
in particular. I think it depends very much on the communities.[362]
201. If mandatory data deposition is problematic,
the question becomes how can we encourage rather than enforce
it? Dr Mark Patterson, from PLoS, told us that:
First, it would be really helpful for publishers
to include some kind of statement about data availability so that
it is clear. How do you get hold of this data? Are there any restrictions
in terms of accessing it because of the size of the data in some
fields or whatever? Secondly, there is an opportunity to incentivise
the sharing of data by giving greater credit and finding mechanisms
to reward researchers who do that to assess the impact of that
sharing as well. Rather than focusing everything on what they
have published in whatever journal, to start thinking about different
kinds of outputs and their value.[363]
Dr Malcolm Read, from JISC, agreed that researchers
"would deserve credit and recognition for that".[364]
202. We note that the Royal Society launched
its Science as a public enterprise project in May 2011.[365]
This will look at how scientific data should best be managed and
may explore some of the issues highlighted in this chapter.
203. Access to data is fundamental
if researchers are to reproduce, verify and build on results that
are reported in the literature. We welcome the Government's recognition
of the importance of openness and transparency. The presumption
must be that, unless there is a strong reason otherwise, data
should be fully disclosed and made publicly available. In line
with this principle, where possible, data associated with all
publicly funded research should be made widely and freely available.
Funders of research must coordinate with publishers to ensure
that researchers disclose their data in a timely manner. The work
of researchers who expend time and effort adding value to their
data, to make it usable by others, should be acknowledged as a
valuable part of their role. Research funders and publishers should
explore how researchers could be encouraged to add this value.
316 Ev 75, para 11 Back
317
Ev 87, para 13 [Philip Campbell] Back
318
Q 297 Back
319
As above Back
320
Q 203 [Dr Mark Patterson] Back
321
Ev 87, para 16 Back
322
Q 206 Back
323
Q 296 Back
324
Q 106 Back
325
Q 203 Back
326
Ev 93, para 23; and Ev 141, para 27 [Dr Andrew Sugden, Science] Back
327
Ev 141, para 27 Back
328
Ev 80, para 29 Back
329
Ev 141, para 27 Back
330
Ev 108 Back
331
Q 206 Back
332
"About Dryad", Dryad, http://datadryad.org/ Back
333
Q 206 Back
334
"About Dryad", Dryad, http://datadryad.org/ Back
335
Q 206 Back
336
"About Dryad", Dryad, http://datadryad.org/ Back
337
Q 208 Back
338
As above Back
339
As above Back
340
Q 280 Back
341
As above Back
342
As above Back
343
Q 300 Back
344
For example, "Taking a Hard Look At Storage Costs",
enterprisestorageforum.com, 8 August 2008 Back
345
Q 136 Back
346
Q 137 Back
347
Q 202 Back
348
Q 203 Back
349
As above Back
350
J. P. A. Ioannidis, An epidemic of false claims, Scientific
American, June 2011 Back
351
Q 277 Back
352
For example: Q 136 [Dr Philip Campbell]; Q 207 [Dr Mark Patterson];
and Q 278 [David Sweeney] Back
353
Q 277 [Sir Mark Walport] Back
354
Q 298 Back
355
Q 299 Back
356
Q 208 Back
357
Q 279 Back
358
Q 300 Back
359
Ev 73, para 19 Back
360
Q 278 [Sir Mark Walport] Back
361
Q 157 Back
362
Q 207 Back
363
Q 208 Back
364
As above Back
365
"Royal Society launches study on openness in science",
Royal Society Press Notices, http://royalsociety.org, 13 May 2011 Back
|