Select Committee on Science and Technology Fourth Report


Seminar held at Imperial College, London

1. To enable the Sub-Committee to get a good understanding of the disease outcome and data-handling aspects of medical databases, with particular reference to human genetic information, a seminar was arranged at Imperial College, London on 24 January 2001.

2. Members of the Sub-Committee present were Lord Oxburgh (Chairman), Lord Flowers, Lord Haskel, Lord Jenkin of Roding, Lord Turnberg, Lord Walton of Detchant and Baroness Wilcox. They were supported by the Sub-Committee's Specialist Adviser (Professor Paul Elliott) and Clerk (Mr Roger Morgan), and the Select Committee's Specialist Assistant (Dr Adam Heathfield). Dr Richard Pitts of the HGC Secretariat also attended to assist in the discussions.


3. The day began with a series of presentations, summarised below.

The Small Area Health Statistics Unit (SAHSU)

4. Dr Paul Aylin, of Imperial College, outlined the origins of SAHSU in the inquiry into leukaemia near Sellafield, and described its current national responsibilities to highlight unusual clusters of disease, particularly in the neighbourhood of industrial installations. SAHSU was required to assess the health risks to the general population from environmental factors, and so relied heavily on the interpretation of routine health statistics.

5. Dr Aylin showed the types of data sets that SAHSU worked with, which included hospital admissions, census data, environmental information, registries of deaths and cancers. Linking these databases (for example, by extracting the information in each relating to particular postcode areas, allowed variations and anomalies in particular diseases to be highlighted and correlated with demographic, social or environmental factors. The value of this type of investigation was dependent on the accuracy and completeness of the records, as well as their coverage of all the various factors relevant to the disease. Other important aspects of data sets were whether they covered the time period of interest (and in particular how up-to-date they were) and whether the various headings under which the data were stored included fields such as the postcode to facilitate cross-linkage of different databases.

6. Dr Aylin stressed the importance of confidentiality when working with personal data contained in census and heath records, and discussed SAHSU's security arrangements. These included: encrypted fields in the data; a private computing network physically unconnected with any other Imperial College systems or the internet; and diskless workstations to prevent data being copied from the private network.

7. Among the developments that Dr Aylin hoped for in the near future were the routine inclusion of patients' NHS numbers in hospital episode statistics. This would greatly assist SAHSU in linking data.

Data linkage in Scotland

8. Dr Mary Smalls of the Information and Statistics Division (ISD), Scottish Health Service, described the methods of collecting health information in Scotland. For over 30 years, the ISD had overseen the collection, storage and interpretation of all Scottish NHS data. Having a single body in charge of all the data meant that the databases were normally more readily cross-linked than their equivalents in the rest of the United Kingdom. This resulted partly from the greater consistency of the personal identifying information collected (surname, forename, date of birth and postcode), and partly from the absence of boundaries to complicate the sharing of data.

9. A Privacy Advisory Committee had been set up at the same time as the ISD to prevent misuse of the ISD data. It advised the Registrar General and the Director of the ISD. As an example of the sorts of problems that could occur if insufficient care were taken with the use of ISD data, Dr Smalls cited the possibility of breaking the terms of the Adoption Act by linking birth data to other data sets.

10. Dr Smalls described the use of probability matching, as opposed to exact matching, which allowed the linking of different data sets without incurring errors from small discrepancies in particular data entries. Probability matching produced 99 per cent accuracy - even though there might be errors in the records that would produce 10-15 per cent discrepancies using exact matching. The advantages of the linkable, integrated databases included better tracking of individual patients' medical treatment, and a ready-made evidence base for epidemiological studies and for assessments of clinical practice.

11. Dr Smalls felt that informed consent would be a big issue for the future collection of data; requiring patients to give explicit consent might limit the types of data that could be collected. However, the planned uses of data usually involved the sorts of studies that patients expected the NHS to be conducting already, so the problem might not be as significant as some feared.

The use of genetic databases and data linkage in Denmark

12. Professor Jørgen Olsen of the Danish Institute of Cancer Epidemiology described Denmark's regulations governing medical databases. The systematic collection of health care data was a long-standing tradition in Nordic countries: Denmark's Cancer Registry had been running since 1942. The great potential value of such data was commented on by Denmark's Minister for Research, Birte Weiss, who last year described her country's health databases as "a resource that could be used more optimally", and said that they should be "a scientific flagship".

13. Denmark maintained a Central Population Register (CPR) which, for each individual, contained a 10-digit personal identification number, current and former addresses, a list of immediate family, and other information. The CPR was used in all aspects of administration in Denmark, and a wide variety of databases could, in principle, easily be linked using CPR numbers.

14. However, the use of personal information was controlled under Denmark's Law on Personal Data, which had been updated in Summer 2000. Proposals to conduct studies that involve secondary use of pre-existing data (e.g. registries of genetic diseases) required the approval of the Data Inspection Agency. This Agency would typically focus on the IT security aspects of a proposal.

15. Under different regulations, studies that involved assembling new biomedical data on individuals (e.g. by taking tissue samples for analysis or collecting responses from face-to-face interviews) required approval by Denmark's Scientific Ethical Committee System. Such Committees took into account factors such as the risk to the participants in the study, and the informed consent sought. Additional projects that would involve analysis of previously-collected samples required further approval by a Scientific Ethical Committee. Routinely collected tissue specimens for diagnosis represented something of a grey area under the Danish system, but Professor Olsen felt that they probably would be regarded as a pre-existing data resource.

16. The rationale for the two-tier regulatory system was the difference in risk to which an individual was exposed. Secondary use of pre-existing data was regarded as posing no physical risk to the individuals who supplied the data, and adequate protection of personal identifiers could ensure that no psychological hazard was present either. On the other hand, conducting a biomedical study where an individual was actively and knowingly involved created a separate, more serious, set of risks - personal and psychological - hence the involvement of the Scientific and Ethical Committees.

17. Professor Olsen stressed that the system of requiring informed consent had been developed to protect people from the health hazards of taking part in clinical research. It was hard to extend this to cover the possible risks of untoward disclosure of confidential data. He described the general atmosphere in Denmark as supportive of data collection, if for a good idea. The main public concerns and legal restrictions were on how such data were used. However, data collection could break down if informed consent were required.

GP databases

18. Professor Ian Purves, Professor of Health Informatics, University of Newcastle discussed systems of generating linkable databases from computerised GP records. He noted that although the NHS IT strategy was still at a formative stage, 96 per cent of GPs had proceeded without it and computerised at least part of their practices, most often for repeat prescriptions.

19. Advances in the use of computer systems in GP practices were essential to help primary care professionals cope with the consequences for their work of the advances in genetics. Many GPs were not well-informed about genetics, and networked resources could share available knowledge. Furthermore, large-scale projects to interpret the impact of genetics on disease would need good quality primary health care information about a large number of patients: computerised records were the obvious way of producing such data.

20. The main challenge was how to create a system under which the narrative of diagnosis could be recorded in a standard form. If this were achieved, then the process of linking records created by individual GPs into large databases could be automated, saving time and making the resulting databases easily searchable.

21. Professor Purves described a number of different developments in medical nomenclature and disease classification systems. He also showed the variation in GPs' use of terminology and in the precision with which they claimed to identify diseases in their consultations. Secondary analyses of GP records needed to take account of these variations, which formed part of the complexity of health information. Although there would always be some element of interpretation needed to put each GP's records in context, with good planning and the use of appropriate systems this problem could be made more tractable by computer.

22. Professor Purves also discussed security of GP records, saying that the main problems occurred with people who had legitimate access to the records, but illegitimate aims for their use. He claimed that private investigators could normally get hold of individual patient records for about £120, and that this lack of security had led to complaints from patients.

Cancer registration and linkage to genetic information in England

23. Professor David Forman, Director of the Northern and Yorkshire Regional Cancer Registry (and Professor of Cancer Epidemiology, University of Leeds) described the system of registering cancer incidence in the United Kingdom, a system in place since the 1960s. There were nine regional cancer registries in England together with national registries in Scotland, Wales and Northern Ireland; several of these covered populations larger than entire Scandinavian countries.

24. The registries collected a systematic population-based data set on all those diagnosed with cancer, including personal details of the patient; the site, type, and stage of the cancer; the management of the disease and treatment; and outcome indicators. These data classes were to be greatly increased in the next three years as a result of co-ordinated initiatives under the National Cancer Plan, including the development of a national cancer data set.

25. The data could be used to study trends in incidence and survival. Other applications were found in epidemiology, health services research, audit, and linkage with the National Screening Programme.

26. Links between the cancer registries and genetic information were created for a number of purposes: for the study of genetic causes of cancer - using both individuals and families with histories of cancer; as a resource for the counselling services who give support to families with high cancer incidence; and to trace individuals who had genetic markers for increased cancer risk.

27. Professor Forman considered that the current emphasis on gaining individual patient consent before any aspect of his or her health information could be stored and analysed put at risk the continued collection of several valuable data sets including that held by the cancer registries. He cited the recent GMC guidelines on releasing patient information and some interpretations of the Data Protection Act 1998 and the common law on privacy as representing specific areas of recent concern. If certain patients or groups of patients had to opt into the cancer registry system explicitly, or required that their data be fully anonymised before registration, then much of the utility of the data sets would be lost[72].

A technical IT perspective on linking genotypic and phenotypic data

28. Professor Carole Goble, Department of Computer Science, University of Manchester described the challenges created by attempting to link (or fuse) the extremely large databases that were increasingly common in genetics research. These databases could be of very different natures: annotated DNA sequences; text and images describing protein form and function; clinical trial data; and many others.

29. Bio-medical and genetic databases were rarely stand-alone. They often incorporated information created by interpreting other databases (e.g. annotations in a DNA sequence database might be based on the interpretations of gene expression databases and protein function databases). This created two problems: there was a large potential for propagation of errors between different databases; and the data in a number of different databases needed to be updated as research in any one single area advanced. Professor Goble stressed how important it was to design databases with a high flux of information in mind. The sources of interpreted information needed be clearly identified, so that databases could benefit from being inter-dependent without becoming strewn with erroneous or out-of-date information.

30. Professor Goble pointed out that, since so little of modern biology could be interpreted a priori from raw data, most of the valuable information in databases was contained in plain text entries. This meant that creating links between databases was not a job which could be done by computer alone - skilled people were needed to ensure that the linkage of data was intelligently done.

31. The language used in different databases also affected the degree to which they could be linked automatically. Many important databases had been developed within particular research communities, with no guarantee of standard terminology. Even if the same terminology were used, the words did not necessarily have the same history, context or meaning in different research areas or countries. This inherent "semantic heterogeneity" meant that even the adoption of a standard vocabulary would not completely remove the need for intelligent input in the process of linking databases.

32. Professor Goble closed by discussing the problems of training, recruiting and retaining suitable staff to meet the considerable challenges of linking large biological databases. She said that Europe had a better recent record than the USA of providing the particular skills such projects required. However, suitably qualified people could earn far more in e-commerce than in bioinformatics research.


33. In a wide-ranging discussion involving all the presenters, together with

  1. Dr John Fox, Director of Statistics, Department of Health;
  2. Dr Peter Goldblatt, Chief Medical Statistician, Office for National Statistics; and
  3. Dr Lars Järup of Imperial College,

the following main points were then noted.

  1. The ability to link information from different databases was a very powerful tool for epidemiological and resource planning purposes, even when some of the data were anonymised.
  2. However, the relevant databases had generally been established in isolation. Great care had to be taken in fusing the data to ensure that like was linked with like. The typical highly iterative methodologies meant that any error could be propagated widely and prove difficult to unpick.
  3. Moves were afoot to get greater uniformity in classifying and recording medical factors, although it was unrealistic to think that there would ever be standardisation in the way individuals used such codes. Managing this heterogeneity of data would continue to require substantial skilled human input.
  4. Mitigating those procedural difficulties and developing databases that could accommodate inevitable change would require substantial work by skilled software engineers who were in short supply. Although Europe was probably ahead of the United States in training these skilled people, such skills were much in demand in better paid parts of the economy.
  5. It would probably always be the case that some of the more meaningful information would be in normal language, impossible for a machine to understand. There would be a continuing demand for skilled human input.
  6. The IT burden should not be underestimated. The NHS targets for improvements in computing and interconnectivity over the next few years would take the United Kingdom only part of the way. Substantially useful outcomes were likely to be a decade or more away.
  7. Both Scotland and Denmark had the benefit of more integrated medical databases, in large measure flowing from their relatively small populations (of about 5 million). Record linkage in Denmark was greatly simplified by used of citizen's ID numbers. Dr Lars Järup also briefly described the Swedish experience where large collections of biological material, which could be linked to health and lifestyle information of the donors, had been assembled over the past 15-30 years.
  8. All systems had substantial arrangements for maintaining security and confidentiality. Subject to those, Danish citizens were content for secondary use of pre-existing data to be overseen by the Data Inspection Agency.
  9. 'Informed consent' had been developed to protect people's physical safety when participating in medical trials or undergoing medical procedures. The threats from misuse of personal data were not the same, and it could be argued that different mechanisms were required.
  10. There would be difficulties in the United Kingdom if moves to develop explicit "informed consent" (for example, under the recent GMC guidelines) led to information about sections of the patient population being excluded from disease registers.
  11. The development of people's rights in respect of personal data needed to be accompanied not only by education in their responsibilities for facilitating studies from which the population as a whole would greatly benefit, but also by robust systems to reassure them that sensitive data could not be used to their detriment. At the same time, much better information should be provided to patients about how their information is being used.

34. Members endorsed the Chairman's thanks to all the visitors for their contribution to a most stimulating and informative seminar, and to the Specialist Adviser for his work in setting up the event. The visitors said that they too had found great value in the sharing of knowledge throughout the day.

72   In connection with these points, Professor Forman subsequently drew our attention to a February 2001 Briefing Statement from the UK Association of Cancer Registries, Health and Social Care Bill: Clause 59, and to Brewster D, Coleman P, Forman D, Roche M. "Cancer information under threat: the case for legislation", Annals of Oncology 2001; 12: 147-149. Back

previous page contents next page

House of Lords home page Parliament home page House of Commons home page search page enquiries index

© Parliamentary copyright 2001