Select Committee on Science and Technology Fourth Report


CHAPTER 5: Information technology and data linkage

5.1 This Chapter considers the handling, data linkage and analysis of human genetic databases. Data and information technology (IT) issues specific to the NHS are dealt with in Chapter 6. Related to both of these Chapters are some substantial questions of consent and confidentiality which we address in Chapter 7.

Drowning in data

5.2 The human genome project alone has generated huge amounts of data. In addition, the advances discussed in Chapter 4 will bring with them massive further increases in data, including extensive genetic information (e.g. SNP data), data on proteins, variations in individual characteristics and lifestyle and environmental factors. These will need to be collected from individuals who are perhaps followed up over many years. Current computer technology is being pushed to its limits.

5.3 The purpose of all these large and diverse data sets is to facilitate the search for potentially very subtle correlations between health outcomes, genetics and environmental factors. However, problems of data incompatibility, lack of standardisation of terminology and differences in data structure will make it very difficult to link information together. Even when those problems are resolved, the analysis of such data will require new and computationally intensive methods to extract useful information. The sheer volume of genetic data, and the computer-intensive demands of analysing them, may outstrip expected advances in computer hardware and software.

Information technology

GENETIC SEQUENCE DATA

5.4 In written evidence, the Sanger Centre stressed that the demands on data management were set to grow dramatically in the field of human genetics. The amount of basic genomic data was increasing faster than the capabilities of computing. The situation for data derived from and building on those basic data was even more acute since this was increasing at a yet faster rate (p 88). On our visit (see Appendix 5), we heard that the Sanger Centre currently had storage for 22 terabytes of data[32]. Backing up these data every 3 days was a major task in itself.

5.5 Also on that visit, we learnt that the Sanger Centre published information on about 50 million base pairs of genome sequence data every 24 hours[33]. Applying the Centre's current full resources to sequencing one person's genes (about 3 per cent of the genome) would take around one week. Alternatively, a targeted programme that sequenced only sites of possible key variations (in, say, around 400 genes) could, with current capacity, deal with around 100 individuals a week (see Appendix 5). Those are still very large amounts of sequence data. If Professor Bell's forecast (as noted in paragraph 4.26) of being able, within 10 years, to sequence an individual's entire genome very efficiently came true (Q 341), the amount of data would increase massively.

UK POPULATION BIOMEDICAL COLLECTION

5.6 Dr Dexter (Q 83) considered that the 500,000 person human genetic database proposed by the MRC and The Wellcome Trust (see paragraph 4.16 ff) should be much less demanding in terms of information management than handling the whole human genome sequence (and its annotation). He felt that the technology to deal with the information was already available, although it would require a large resource and excellent management.

5.7 The size of data handling and computing resource required for this large cohort study would clearly depend on the amount of primary data collected, and the amount of genetic, lifestyle, questionnaire and other information that would be obtained on each of the 500,000 individuals. Our discussions at the Sanger Centre included consideration of the computing resources that would be required in order to handle and analyse such data, assuming the use of modern computer intensive techniques to identify gene-environment-disease associations. The computing demands seem likely to push at the limits of currently foreseeable technology (see Appendix 5). Dr Kevin Cheeseman of AstraZeneca felt that, while computing hardware was unlikely to be a constraint, the big logistical problems needing careful consideration were how to deal with the huge quantity of data and how to use the computing resource efficiently (Q 214).

Data linkage

5.8 At the risk of being repetitive, we stress that large-scale epidemiological studies linking genetic information to information on environment, lifestyles and disease outcome are needed to take maximum advantage of advances in genetics. Only by looking for correlations between these collections of information can the complex interactions between genetics, environment and lifestyle in disease pathogenesis and causation be understood. Such linkage of data was a vital component of the proposed UK Population Biomedical Collection (see paragraph 4.16 ff).

EXAMPLES OF DATA LINKAGE IN DIFFERENT SETTINGS

5.9 The seminar we commissioned at Imperial College gave several examples (see Appendix 4) of the successful linkage of health data to other data on individuals - not necessarily including genetic information - that had provided vital information for understanding disease causation; for use in clinical audit and for clinical governance purposes; for tracking of individual patients' medical treatment; and for health service management.

5.10 Dr Paul Aylin, a medical epidemiologist in the UK Small Area Health Statistics Unit, outlined how the Unit used geographical information (the postcode) to link data across a number of different data sets in order to look for the effects of environmental pollution on the health of the population. He stressed that the value and interpretation of this type of analysis was dependent on the accuracy, completeness and timeliness of the data. In England and Wales, the ability to link across different data sets for medical research would be greatly enhanced by the use of a common identifier such as the NHS number on all health records

5.11 Dr Mary Smalls from the Information and Statistics Division of the NHS in Scotland told us that, with a view to facilitating epidemiological surveillance of the health of the Scottish population - which had proved very valuable - arrangements had been in place for many years to facilitate the linking of various data sets held by the Service. These included mortality data, cancer incidence, births, still births, maternity statistics and hospital admissions data.

5.12 Professor David Forman, Director of the Northern and Yorkshire Regional Cancer Registry, described the system of registering cancer incidence in the United Kingdom, in place since the 1960s. Data were collected systematically by the registries on people diagnosed with cancer, including personal details of the patient; the site, type, and stage of the cancer; the management of the disease and treatment; and the outcome. These data had proved invaluable in epidemiological studies[34].

5.13 Professor Forman told us that links between the cancer registries and genetic information were already being created for a variety of purposes. These included study of the genetic causes of cancer; acting as a resource for counselling services that gave support to families with high cancer incidence; and a means of tracing individuals who had genetic markers for increased cancer risk.

5.14 In the NHS, the General Practitioner (GP) is the point of entry into the health care system[35]. As key players in the management and delivery of health care, their data are vital in gaining an overall picture of health needs and provision. However, we had noted suggestions that substantial parts of the information in some GP records, including simple things such as the postcode, were either wrong or incomplete[36]. Sir John Pattison indicated that a system of holding each person's data in electronic health records in the NHS (thus facilitating linkage across different data sets) would not be fully implemented nationally until 2005 at the earliest (QQ 19 & 27).

5.15 Sir George Radda and Dr Dexter indicated that the present system of holding, maintaining and identifying an individual's health records in the NHS was inadequate for the purposes of the proposed UK Population Biomedical Collection. For this new study, consistent and reliable linkage to genetic, lifestyle and certain other data was required. Accordingly, special follow-up procedures, independent of the NHS's routine collection of data, would need to be put into place (Q 77).

5.16 Representatives of the pharmaceutical industry told us that it was common practice for genetic data obtained in clinical trials to be linked to subsequent health information of the patient, although this was usually done through the treating physician (QQ 237-240).

PROBLEMS OF DATA LINKAGE

5.17 Also at our Imperial College seminar (see Appendix 4), Professor Carole Goble, Department of Computer Science, University of Manchester outlined the challenges created by attempts to link (or fuse) the extremely large and heterogeneous databases that were increasingly common in genetic research. These databases were highly varied, containing, among many other types of information, annotated DNA sequences, text and images describing protein form and function, and clinical trial data. They were rarely stand-alone, but often incorporated information created by interpreting other databases (for example, annotations in a DNA sequence database might be based on interpretations of databases of gene expression and protein function). This gave a large potential for propagation of errors between different databases. Moreover, information in the secondary databases needed continual updating, as research refined the primary sources from which they were derived.

5.18 Professor Goble pointed out that the most valuable information in databases was often contained in plain text entries. This meant that creating links between databases could not be done by computer alone - skilled people were needed to ensure that the linkage of data was intelligently done. The language used in different databases also affected the degree to which they could be linked automatically. Many important databases had been developed within particular research communities, with no standardised terminology. Even if the same terminology were used, the words did not necessarily have the same history, context or meaning in different research areas or countries. This inherent "semantic heterogeneity" meant that even the adoption of a standard vocabulary would not completely remove the need for intelligent input in the process of linking databases.

5.19 Professor Ian Purves, Professor of Health Informatics, University of Newcastle, discussed how databases might be linked together from computerised GP records. He noted that the NHS IT strategy was still at a formative stage[37], but that almost all GPs had proceeded without it and computerised at least part of their practices, most often for repeat prescriptions. Professor Purves felt that the main challenge was the creation of a system under which diagnostic information and relevant exchanges with patients could be recorded in a standard form. If this were achieved, then the process of linking records created by individual GPs into large databases could be automated, saving time and making the resulting databases easily searchable.

5.20 In his evidence, Dr Aidan Power of Pfizer noted from the pharmaceutical company perspective that creating and building databases that combined different kinds of data from different data sets was enormously complex and not easily solved (Q 243).

Essential skills

5.21 It was common ground among our witnesses that there was a world-wide shortage of skills in the key areas of bioinformatics and statistical genetics (the application of IT and statistical techniques respectively to genetic and other related problems in biomedicine). The MRC and The Wellcome Trust had programmes in place which aimed to improve the position in the United Kingdom (QQ 86 & 87), as did the pharmaceutical industry (QQ 205, 212-213 & 215).

5.22 At our Imperial College seminar, Professor Goble noted that Europe had a better recent tradition than the United States in some of the specialised skills required for research into the linking of large biological databases. She pointed out, however, that the skills were transferable and suitably qualified people could earn far more in the finance sector, e-commerce or the pharmaceutical industry than in bioinformatics research in higher education institutions (see Appendix 4).

5.23 Dr Roses noted that it was important to develop the necessary basic computing skills at school. He felt that the United Kingdom had a particular advantage in that, as part of their degree courses, a large number of students took internships with companies that made practical use of bioinformatics. Often, they were subsequently recruited back into the company (Q 215).

Conclusions

5.24 Computing and data handling requirements for genetics research seem bound to continue to increase, and quite possibly at a faster rate than developments in computing hardware and software.

5.25 While the computing requirements for handling the epidemiological follow-up of large cohorts, such as is proposed for the UK Population Biomedical Collection, are likely to be manageable with present technology, the analysis of the resulting data will be highly computer-intensive and may exceed foreseeable developments in computing. As noted in paragraph 4.33, we fully support this study, but we remain unconvinced that the many critical data-handling issues have yet been fully thought through. Recognising that the UK Population Biomedical Collection project will stand or fall on its ability to manage the data, we recommend that the MRC and The Wellcome Trust should give high priority to ensuring that all aspects of the data handling and computing requirements for this important project have been fully addressed, and make appropriate plans to meet its needs.

5.26 There are skills shortages in bioinformatics and statistical genetics, which are being only partially met by current training initiatives in the public and private sectors. The basic skills underpinning these specialist areas need to be gained at school and, while we recognise the advances that have been made in this area, there is still more to be done. Furthermore such training offers excellent prospects for employment.

5.27 Employers in the public sector (universities, health service) are often priced out of the market, as these skills are widely sought after in the private sector. We commend moves within the pharmaceutical industry to train school leavers and students on industrial placements in these shortage skills and the support already given by the MRC and The Wellcome Trust. However, we believe that more needs to be done - including in further and higher education. We recommend that the Government (and the various education funding councils), the Medical and other relevant Research Councils, The Wellcome Trust and other research charities, and the pharmaceutical companies should give high priority to funding training and supporting research in the areas of bioinformatics, statistical genetics and the computing science underlying database management.

5.28 Although the observation is also relevant to the following Chapter, it is important to note here that, notwithstanding the acknowledged difficulties in capturing diagnostic information in computer readable form, the quality of GP and other health databases in the United Kingdom needs urgent improvement. GP databases need to be made compatible with one another and held in a way that allows the computer retrieval of the wealth of clinical information they contain. Accordingly, we recommend that the Government should ensure that the necessary financial and other resources are made available for this purpose. The aim must be to have such systems operational nationally within five years. Achieving this will require an NHS-wide standard protocol for data capture and retrieval, and that will need to be in place much sooner.


32   22x1012 bytes or some 20,000 times more data storage than current top of the range personal computers. Back

33   We also heard that there were persistent but, to date, unsuccessful attempts to hack into the Sanger Centre computers. This is clearly an issue when considering questions of data security and confidentiality, especially since substantial amounts of the data are deposited in the public domain with free access from the outside. Back

34   As discussed further in paragraph 6.21, Professor Forman was concerned about a possible reduction in data flowing to the registries making them much less effective. Back

35   The NHS Plan, Cm 4818-I. Back

36   In "Death Watch", New Scientist, 19 February 2000, a member of the Association for Geographical Information noted that some information from GPs' active records was very poor - in some cases, with errors in a quarter of the postcodes alone. Back

37   See also the evidence from Sir John Pattison and Department of Health colleagues (QQ 19-35). Back


 
previous page contents next page

House of Lords home page Parliament home page House of Commons home page search page enquiries index

© Parliamentary copyright 2001