APPENDIX 4
Seminar held at Imperial College, London
1. To enable the Sub-Committee to get a good understanding
of the disease outcome and data-handling aspects of medical databases,
with particular reference to human genetic information, a seminar
was arranged at Imperial College, London on 24 January 2001.
2. Members of the Sub-Committee present were Lord
Oxburgh (Chairman), Lord Flowers, Lord Haskel, Lord Jenkin of
Roding, Lord Turnberg, Lord Walton of Detchant and Baroness Wilcox.
They were supported by the Sub-Committee's Specialist Adviser
(Professor Paul Elliott) and Clerk (Mr Roger Morgan), and the
Select Committee's Specialist Assistant (Dr Adam Heathfield).
Dr Richard Pitts of the HGC Secretariat also attended to assist
in the discussions.
PRESENTATIONS
3. The day began with a series of presentations,
summarised below.
The Small Area Health Statistics Unit (SAHSU)
4. Dr Paul Aylin, of Imperial College, outlined
the origins of SAHSU in the inquiry into leukaemia near Sellafield,
and described its current national responsibilities to highlight
unusual clusters of disease, particularly in the neighbourhood
of industrial installations. SAHSU was required to assess the
health risks to the general population from environmental factors,
and so relied heavily on the interpretation of routine health
statistics.
5. Dr Aylin showed the types of data sets that SAHSU
worked with, which included hospital admissions, census data,
environmental information, registries of deaths and cancers. Linking
these databases (for example, by extracting the information in
each relating to particular postcode areas, allowed variations
and anomalies in particular diseases to be highlighted and correlated
with demographic, social or environmental factors. The value of
this type of investigation was dependent on the accuracy and completeness
of the records, as well as their coverage of all the various factors
relevant to the disease. Other important aspects of data sets
were whether they covered the time period of interest (and in
particular how up-to-date they were) and whether the various headings
under which the data were stored included fields such as the postcode
to facilitate cross-linkage of different databases.
6. Dr Aylin stressed the importance of confidentiality
when working with personal data contained in census and heath
records, and discussed SAHSU's security arrangements. These included:
encrypted fields in the data; a private computing network physically
unconnected with any other Imperial College systems or the internet;
and diskless workstations to prevent data being copied from the
private network.
7. Among the developments that Dr Aylin hoped for
in the near future were the routine inclusion of patients' NHS
numbers in hospital episode statistics. This would greatly assist
SAHSU in linking data.
Data linkage in Scotland
8. Dr Mary Smalls of the Information and Statistics
Division (ISD), Scottish Health Service, described the methods
of collecting health information in Scotland. For over 30 years,
the ISD had overseen the collection, storage and interpretation
of all Scottish NHS data. Having a single body in charge of all
the data meant that the databases were normally more readily cross-linked
than their equivalents in the rest of the United Kingdom. This
resulted partly from the greater consistency of the personal identifying
information collected (surname, forename, date of birth and postcode),
and partly from the absence of boundaries to complicate the sharing
of data.
9. A Privacy Advisory Committee had been set up
at the same time as the ISD to prevent misuse of the ISD data.
It advised the Registrar General and the Director of the ISD.
As an example of the sorts of problems that could occur if insufficient
care were taken with the use of ISD data, Dr Smalls cited the
possibility of breaking the terms of the Adoption Act by linking
birth data to other data sets.
10. Dr Smalls described the use of probability matching,
as opposed to exact matching, which allowed the linking of different
data sets without incurring errors from small discrepancies in
particular data entries. Probability matching produced 99 per
cent accuracy - even though there might be errors in the records
that would produce 10-15 per cent discrepancies using exact matching.
The advantages of the linkable, integrated databases included
better tracking of individual patients' medical treatment, and
a ready-made evidence base for epidemiological studies and for
assessments of clinical practice.
11. Dr Smalls felt that informed consent would be
a big issue for the future collection of data; requiring patients
to give explicit consent might limit the types of data that could
be collected. However, the planned uses of data usually involved
the sorts of studies that patients expected the NHS to be conducting
already, so the problem might not be as significant as some feared.
The use of genetic databases and data linkage
in Denmark
12. Professor Jørgen Olsen of the Danish Institute
of Cancer Epidemiology described Denmark's regulations governing
medical databases. The systematic collection of health care data
was a long-standing tradition in Nordic countries: Denmark's Cancer
Registry had been running since 1942. The great potential value
of such data was commented on by Denmark's Minister for Research,
Birte Weiss, who last year described her country's health databases
as "a resource that could be used more optimally", and
said that they should be "a scientific flagship".
13. Denmark maintained a Central Population Register
(CPR) which, for each individual, contained a 10-digit personal
identification number, current and former addresses, a list of
immediate family, and other information. The CPR was used in all
aspects of administration in Denmark, and a wide variety of databases
could, in principle, easily be linked using CPR numbers.
14. However, the use of personal information was
controlled under Denmark's Law on Personal Data, which had been
updated in Summer 2000. Proposals to conduct studies that involve
secondary use of pre-existing data (e.g. registries of genetic
diseases) required the approval of the Data Inspection Agency.
This Agency would typically focus on the IT security aspects of
a proposal.
15. Under different regulations, studies that involved
assembling new biomedical data on individuals (e.g. by taking
tissue samples for analysis or collecting responses from face-to-face
interviews) required approval by Denmark's Scientific Ethical
Committee System. Such Committees took into account factors such
as the risk to the participants in the study, and the informed
consent sought. Additional projects that would involve analysis
of previously-collected samples required further approval by a
Scientific Ethical Committee. Routinely collected tissue specimens
for diagnosis represented something of a grey area under the Danish
system, but Professor Olsen felt that they probably would be regarded
as a pre-existing data resource.
16. The rationale for the two-tier regulatory system
was the difference in risk to which an individual was exposed.
Secondary use of pre-existing data was regarded as posing no physical
risk to the individuals who supplied the data, and adequate protection
of personal identifiers could ensure that no psychological hazard
was present either. On the other hand, conducting a biomedical
study where an individual was actively and knowingly involved
created a separate, more serious, set of risks - personal and
psychological - hence the involvement of the Scientific and Ethical
Committees.
17. Professor Olsen stressed that the system of requiring
informed consent had been developed to protect people from the
health hazards of taking part in clinical research. It was hard
to extend this to cover the possible risks of untoward disclosure
of confidential data. He described the general atmosphere in Denmark
as supportive of data collection, if for a good idea. The main
public concerns and legal restrictions were on how such data were
used. However, data collection could break down if informed consent
were required.
GP databases
18. Professor Ian Purves, Professor of Health Informatics,
University of Newcastle discussed systems of generating linkable
databases from computerised GP records. He noted that although
the NHS IT strategy was still at a formative stage, 96 per cent
of GPs had proceeded without it and computerised at least part
of their practices, most often for repeat prescriptions.
19. Advances in the use of computer systems in GP
practices were essential to help primary care professionals cope
with the consequences for their work of the advances in genetics.
Many GPs were not well-informed about genetics, and networked
resources could share available knowledge. Furthermore, large-scale
projects to interpret the impact of genetics on disease would
need good quality primary health care information about a large
number of patients: computerised records were the obvious way
of producing such data.
20. The main challenge was how to create a system
under which the narrative of diagnosis could be recorded in a
standard form. If this were achieved, then the process of linking
records created by individual GPs into large databases could be
automated, saving time and making the resulting databases easily
searchable.
21. Professor Purves described a number of different
developments in medical nomenclature and disease classification
systems. He also showed the variation in GPs' use of terminology
and in the precision with which they claimed to identify diseases
in their consultations. Secondary analyses of GP records needed
to take account of these variations, which formed part of the
complexity of health information. Although there would always
be some element of interpretation needed to put each GP's records
in context, with good planning and the use of appropriate systems
this problem could be made more tractable by computer.
22. Professor Purves also discussed security of GP
records, saying that the main problems occurred with people who
had legitimate access to the records, but illegitimate aims for
their use. He claimed that private investigators could normally
get hold of individual patient records for about £120, and
that this lack of security had led to complaints from patients.
Cancer registration and linkage to genetic information
in England
23. Professor David Forman, Director of the Northern
and Yorkshire Regional Cancer Registry (and Professor of Cancer
Epidemiology, University of Leeds) described the system of registering
cancer incidence in the United Kingdom, a system in place since
the 1960s. There were nine regional cancer registries in England
together with national registries in Scotland, Wales and Northern
Ireland; several of these covered populations larger than entire
Scandinavian countries.
24. The registries collected a systematic population-based
data set on all those diagnosed with cancer, including personal
details of the patient; the site, type, and stage of the cancer;
the management of the disease and treatment; and outcome indicators.
These data classes were to be greatly increased in the next three
years as a result of co-ordinated initiatives under the National
Cancer Plan, including the development of a national cancer data
set.
25. The data could be used to study trends in incidence
and survival. Other applications were found in epidemiology, health
services research, audit, and linkage with the National Screening
Programme.
26. Links between the cancer registries and genetic
information were created for a number of purposes: for the study
of genetic causes of cancer - using both individuals and families
with histories of cancer; as a resource for the counselling services
who give support to families with high cancer incidence; and to
trace individuals who had genetic markers for increased cancer
risk.
27. Professor Forman considered that the current
emphasis on gaining individual patient consent before any aspect
of his or her health information could be stored and analysed
put at risk the continued collection of several valuable data
sets including that held by the cancer registries. He cited the
recent GMC guidelines on releasing patient information and some
interpretations of the Data Protection Act 1998 and the
common law on privacy as representing specific areas of recent
concern. If certain patients or groups of patients had to opt
into the cancer registry system explicitly, or required that their
data be fully anonymised before registration, then much of the
utility of the data sets would be lost[72].
A technical IT perspective on linking genotypic
and phenotypic data
28. Professor Carole Goble, Department of Computer
Science, University of Manchester described the challenges created
by attempting to link (or fuse) the extremely large databases
that were increasingly common in genetics research. These databases
could be of very different natures: annotated DNA sequences; text
and images describing protein form and function; clinical trial
data; and many others.
29. Bio-medical and genetic databases were rarely
stand-alone. They often incorporated information created by interpreting
other databases (e.g. annotations in a DNA sequence database might
be based on the interpretations of gene expression databases and
protein function databases). This created two problems: there
was a large potential for propagation of errors between different
databases; and the data in a number of different databases needed
to be updated as research in any one single area advanced. Professor
Goble stressed how important it was to design databases with a
high flux of information in mind. The sources of interpreted information
needed be clearly identified, so that databases could benefit
from being inter-dependent without becoming strewn with erroneous
or out-of-date information.
30. Professor Goble pointed out that, since so little
of modern biology could be interpreted a priori from raw data,
most of the valuable information in databases was contained in
plain text entries. This meant that creating links between databases
was not a job which could be done by computer alone - skilled
people were needed to ensure that the linkage of data was intelligently
done.
31. The language used in different databases also
affected the degree to which they could be linked automatically.
Many important databases had been developed within particular
research communities, with no guarantee of standard terminology.
Even if the same terminology were used, the words did not necessarily
have the same history, context or meaning in different research
areas or countries. This inherent "semantic heterogeneity"
meant that even the adoption of a standard vocabulary would not
completely remove the need for intelligent input in the process
of linking databases.
32. Professor Goble closed by discussing the problems
of training, recruiting and retaining suitable staff to meet the
considerable challenges of linking large biological databases.
She said that Europe had a better recent record than the USA of
providing the particular skills such projects required. However,
suitably qualified people could earn far more in e-commerce than
in bioinformatics research.
DISCUSSION
33. In a wide-ranging discussion involving all the
presenters, together with
- Dr John Fox, Director of Statistics, Department
of Health;
- Dr Peter Goldblatt, Chief Medical Statistician,
Office for National Statistics; and
- Dr Lars Järup of Imperial College,
the following main points were then noted.
- The ability to link information from different
databases was a very powerful tool for epidemiological and resource
planning purposes, even when some of the data were anonymised.
- However, the relevant databases had generally
been established in isolation. Great care had to be taken in fusing
the data to ensure that like was linked with like. The typical
highly iterative methodologies meant that any error could be propagated
widely and prove difficult to unpick.
- Moves were afoot to get greater uniformity in
classifying and recording medical factors, although it was unrealistic
to think that there would ever be standardisation in the way individuals
used such codes. Managing this heterogeneity of data would continue
to require substantial skilled human input.
- Mitigating those procedural difficulties and
developing databases that could accommodate inevitable change
would require substantial work by skilled software engineers who
were in short supply. Although Europe was probably ahead of the
United States in training these skilled people, such skills were
much in demand in better paid parts of the economy.
- It would probably always be the case that some
of the more meaningful information would be in normal language,
impossible for a machine to understand. There would be a continuing
demand for skilled human input.
- The IT burden should not be underestimated. The
NHS targets for improvements in computing and interconnectivity
over the next few years would take the United Kingdom only part
of the way. Substantially useful outcomes were likely to be a
decade or more away.
- Both Scotland and Denmark had the benefit of
more integrated medical databases, in large measure flowing from
their relatively small populations (of about 5 million). Record
linkage in Denmark was greatly simplified by used of citizen's
ID numbers. Dr Lars Järup also briefly described the Swedish
experience where large collections of biological material, which
could be linked to health and lifestyle information of the donors,
had been assembled over the past 15-30 years.
- All systems had substantial arrangements for
maintaining security and confidentiality. Subject to those, Danish
citizens were content for secondary use of pre-existing data to
be overseen by the Data Inspection Agency.
- 'Informed consent' had been developed to protect
people's physical safety when participating in medical trials
or undergoing medical procedures. The threats from misuse of personal
data were not the same, and it could be argued that different
mechanisms were required.
- There would be difficulties in the United Kingdom
if moves to develop explicit "informed consent" (for
example, under the recent GMC guidelines) led to information about
sections of the patient population being excluded from disease
registers.
- The development of people's rights in respect
of personal data needed to be accompanied not only by education
in their responsibilities for facilitating studies from which
the population as a whole would greatly benefit, but also by robust
systems to reassure them that sensitive data could not be used
to their detriment. At the same time, much better information
should be provided to patients about how their information is
being used.
34. Members endorsed the Chairman's thanks to all
the visitors for their contribution to a most stimulating and
informative seminar, and to the Specialist Adviser for his work
in setting up the event. The visitors said that they too had found
great value in the sharing of knowledge throughout the day.
72 In connection with these points, Professor Forman
subsequently drew our attention to a February 2001 Briefing Statement
from the UK Association of Cancer Registries, Health and Social
Care Bill: Clause 59, and to Brewster D, Coleman P, Forman
D, Roche M. "Cancer information under threat: the case for
legislation", Annals of Oncology 2001; 12: 147-149. Back
|