60.Large quantities of data have been an essential component of most contemporary advancements in AI. While there are AI algorithms which require smaller quantities of data, by and large the evidence we received anticipated no let-up in demand in the foreseeable future. The issues surrounding data, both in relation to digital technology and AI specifically, are complex and intertwined. In this chapter we consider issues to do with designing and developing artificial intelligence systems, including the use and monopolisation of data, the transparency and intelligibility of decisions made by AI systems, and the risk of prejudice in those decisions.
61.Two points must be explained in advance. Firstly, there is an important distinction to be made between data more generally, and personal data. While ‘data’ could refer to almost any information (such as temperature readings) on a computer, or which is intended to be held on a computer, ‘personal data’ has a specific meaning under the Data Protection Act 1998, and generally covers any set of information relating to individuals. While the balance of power and commercial opportunity afforded to companies and organisations will generally be determined by the quantity and quality of data they have access to, questions of privacy and personal agency relate more specifically to personal data.
62.Secondly, while we initially considered this area in terms of data ownership, after taking evidence from a number of experts in the field we came to believe that data control was a more appropriate concept. It was pointed out that custody and control of data were far more established legal concepts than ownership of data, and Olivier Thereaux, Head of Technology, The Open Data Institute, also spelled out the conceptual difficulties inherent to data ownership:
“Data has a few qualities that makes it incompatible with notions of ownership. I can hold it, you can hold it, and my holding of it does not impact the value you derive from it … Take the data about a phone call. I make a phone call to a friend. The data about that call has me as the data subject, but it could not easily be owned just by me because there are other data subjects. The person to whom I made that phone call is a data subject, the companies through which I made that phone call are another kind of data subject, a secondary data subject, and so on”.
We have accordingly decided to refer to data control, rather than data ownership, in this report.
63.Our witnesses painted a picture in which large technology corporations, headquartered in the United States but with a global reach, are accruing huge quantities of data, which is increasingly giving them a massive advantage in the development and application of AI systems when compared with smaller competitors and public sectors across the world.
64.Professor Richard Susskind OBE told us of the “unprecedented concentration of wealth and power in a small number of corporations” such as Alibaba, Alphabet, Amazon, Apple, Facebook, Microsoft and Tencent, a view widely held among a variety of witnesses. Innovate UK noted that these “vast volumes of data” were increasingly allowing these companies to unlock the potential value in the AI systems they are developing. Alongside personal data, this also included many other disparate forms of data, which “might relate to the physical parameters of processes or systems, such as the functioning of an engine, the condition of a piece of factory machinery, or the state of the weather”, and it was the combination of various different kinds of data, often “largely unnoticed”, which was creating “significant commercial opportunities” for these corporations. In some cases, as with DeepMind’s controversial deal with the Royal Free London NHS Foundation Trust (see Box 8), these companies are also making bespoke arrangements with public institutions to augment their already impressive access to data.
65.Several witnesses pointed to the network effects at work in information-based industries, which tend towards ‘winner-takes-all’ markets, and contributed to the growing dominance of these large technology companies. As Dr Mike Lynch told us:
“Data is everything in machine learning, which means whoever gets access to data can have a big advantage. As they gain a more consolidated position in the market, in turn they get access to more data, and so they can easily create an advanced competitively defensive position”.
66.Some witnesses noted that the Government, and particular public institutions also had access to a wide range of datasets. The NHS was frequently cited as holding some of the world’s largest and most comprehensive collections of health records, but there were also less commonly mentioned examples, such as Ordnance Survey, who told us of their “abundance of labelled data”, which included “rich data describing land parcels and geospatial features such as buildings, roads and railways”. However, these often came with their own problems; many NHS records are still in paper form and require expensive and time-consuming digitisation before AI can make use of them, while Ordnance Survey observed that a lack of processing power is currently one of their major challenges in developing AI.
67.Meanwhile, many universities, charities, start-ups and small and medium sized enterprises (SMEs) complained that they could not easily access large, good quality datasets, and were struggling to compete with larger players as a result. Dr Mercedes Bunz, Senior Lecturer, Communications and Media Research Institute at the University of Westminster, emphasised how expensive the creation of datasets can be. She explained how ImageNet, one of the largest visual databases in the world, “employed at its peak 48,940 people in 167 countries”, who sorted around 14 million images.
68.The Charities Aid Foundation informed us that “the data on social and environmental needs that charities could use to refine and target their interventions is often locked up in siloes within Government and the private sector, and where it is available it is not presented in a consistent, usable format”. Our evidence also suggests that many start-ups struggle to gain access to data. In some cases, this is because they need to develop a service before they can attract the customers who would in turn provide data, while in others it is because small start-ups cannot demonstrate their credibility to public institutions and organisations. With respect to small businesses, Kriti Sharma, Vice President of Artificial Intelligence and Bots, Sage, told us that “some 55% of small businesses are still using pen and paper, Excel spreadsheets, fragmented datasets”, which prevented them from making any substantive use of AI, although Sage also believed that developments in cloud technology were partly offsetting this by opening up new sources of data for SMEs. We were also told that, in many cases, even larger companies are unable or unwilling to make use of the data they have, with many different datasets scattered across different ‘siloes’ across the company.
69.All indications suggest that the present status quo will be disrupted to some extent by the upcoming General Data Protection Regulation (GDPR), which the UK is planning to adhere to regardless of the outcome of Brexit, and the new ePrivacy Regulation, which both come into force across the EU from 25 May 2018. With respect to data access, the GDPR’s introduction of a right to data portability is probably the most significant feature. While subject to some restrictions, in many cases this will require companies and organisations to provide a user with their personal data in a structured, commonly used and machine readable form, free of charge. The intention is that consumers will be able to take their personal data from one service and, relatively seamlessly, transfer it to another, helping to prevent the ‘lock in’ effect which can dissuade customers from switching between service providers. A number of witnesses welcomed the new right to data portability. Dr Sandra Wachter, Postdoctoral Researcher in Data Ethics and Algorithms, Oxford Internet Institute, argued that it could “enhance competition in a very healthy way”, and facilitate access to data for new start-ups looking to compete with established giants. However, Dr Bunz, while broadly welcoming the initiative, warned that on its own, “individual portability will not be sufficient to collect a dataset that allows the creation of knowledge and businesses”, and the UK was still in need of “a strategy to actively create big data, especially in areas of government interest such as healthcare, transport, science and education”.
70.Where the sharing of large quantities of personal data is concerned, another issue to be considered is how to make maximum use of this data, with the minimum possible infringement on the privacy of individuals. In practise, much of this comes down to an area of data science known as ‘anonymisation’ or ‘de-identification’, whereby datasets are processed in order to remove as much data which relates to individuals as possible, while retaining the usefulness of the dataset for the desired purpose. For example, a dataset of x-rays might be stripped of names, addresses and other identifying features, but which would still make the dataset useful for understanding trends in the development of particular diseases. This is now a routine process in the handling of many different kinds of datasets containing personal data.
71.However, some of our witnesses argued that de-identifying datasets is far less effective than often supposed. Olivier Thereaux of the Open Data Institute said “AI is particularly problematic, because it can be used extremely efficiently to re-identify people”, even in the face of “pretty good de-identification methods”. De-identification often means removing identifying features, and AI systems can be effective at adding these back in, by, for example, cross-referencing other available datasets. It should also be noted that in at least some cases, there is a necessary trade-off—if more information is stripped out of a dataset in a bid to preserve privacy, this may also limit the potential usefulness and flexibility of the dataset in the process.
72.Elizabeth Denham, the Information Commissioner, told us this was an unrealistic view, stating: “I get frustrated when people say there is no such thing as perfect anonymisation; there is anonymisation that is sufficient”. She also noted provisions in the Data Protection Bill which will criminalise re-identification unless for particular legitimate purposes. The Office of the National Data Guardian took a similar view that the public are broadly supportive of the use of anonymised data in the health sector if it was for clear health and social care purposes. While we accept that de-identification methods have their flaws, and will always have to be kept up-to-date in the face of advances in AI, we are satisfied that for most purposes they offer an acceptable level of security and privacy. The ICO is closely monitoring this issue and has demonstrated a willingness to intervene as, and when, necessary. We welcome the ICO’s engagement in this area.
73.Given the current data landscape, and the forthcoming new data protection regulations, it is perhaps unsurprising that we received a wide variety of opinions on how the situation might be improved.
74.Some of our witnesses, mostly larger technology companies, appeared to be broadly satisfied with the current mix of large repositories of privately-held data, bespoke agreements between particular companies and public organisations, and a patchwork of open data sources. A number of our witnesses argued that this situation, while not directly recompensing individuals for their data (or in many cases indirectly), still delivered significant benefits to the wider population through the development of innovations, which might not be possible otherwise.
75.Those who held this position were also keen to emphasise, as the Confederation of British Industries (CBI) put it, that data “is not a finite resource”, and that unlike conventional assets, the use of a dataset for commercial purposes “does not inhibit the use of the same data for social or non-commercial purposes”. They contested the very concept of ‘data monopolies’, arguing that “data has no intrinsic value in its raw sense”, and that in reality it is “the expertise in processing and application of data that creates value for organisations”. They were resistant to any idea of interfering with contractual freedoms, and told us that businesses who invest capital into building, maintaining, and protecting datasets have a right to seek a return on investment if they carry the reinvestment costs. A number of other witnesses cautioned against undue interference, especially through regulation, and held that current data protection and competition law was more than adequate.
76.A number of witnesses advocated for more free and open access to public datasets, as part of the open data agenda, in order to help address the risks of data monopolies. This movement, supported by the Open Data Institute in the UK, believes that as much data as possible should be made freely available for use and reuse. UCL drew attention to data.gov.uk, which provides free access to data from Government departments and public bodies. Several witnesses also applauded Transport for London’s efforts to make their data, which includes real-time transport information such as tube and bus departures, freely available, which has in turn led to the development of apps such as Citymapper. Research Councils UK argued that, subject to legal, ethical and commercial restraints, all publicly funded research data is “a public good and should be made available with as few restrictions as possible”. A number of witnesses felt that this was a clear area of data policy in which Government could, and should, intervene in order to encourage both the release of more public datasets and to further develop the open data standards, potentially encouraging others to do the same.
77.While many witnesses believed this would help to facilitate innovation in AI and develop a more level playing field, even open data was not without its critics. While several witnesses simply believed it was not sufficient on its own to counteract larger privately-held datasets, Dr Lynch argued that the open data movement was a “rather academic debate”, which does not account for the economic or strategic value of publicly-held datasets. He highlighted the perversity of unique forms of publicly-held data, particularly those held by the NHS, being given away for free to AI companies which may then hold those same organisations “to ransom” when selling these systems back to the public sector. In his view, the Government should take proper consideration of the ‘strategic data’ it holds, and, for example, insist on “favoured nation pricing” in the contractual arrangements for supplying such data. Indeed, a number of witnesses were highly critical of the current tendency for bespoke public-private data deals, which they believed were allowing data to flow from the public sector to the private sector without securing proper value for the taxpayer.
78.Finally, there were those who believed that far more emphasis should be placed on individuals owning or controlling their own personal data, and retaining the ability to do with it as they please. In the light of the GDPR’s requirement for data portability, there are signs that this could become a reality, with a range of initiatives aimed at enabling individuals to control what happens to their data. For example, Sir Tim Berners-Lee is currently working on ‘Solid’, a proposed set of standards and tools, based on the idea of linked data, which would allow individuals on the internet to choose where to keep their personal data and how it should be used. The Hub of All Things (HAT) project, supported by the Engineering and Physical Sciences Research Council (EPSRC) and involving researchers from six British universities, seeks to help users take back control of their personal data by creating decentralised personal databases, controlled by individuals, allowing them to see all of their personal data in one place, see how it is being used, and sell or exchange it in return for money or benefits in kind. The Open Banking initiative, launched in January of this year, is also demonstrating how individual control of personal data, albeit only financial data for the time being, can work in practice.
79.Given the varied positions on data outlined above, we believe there is an urgent need for conceptual clarity on the subject of data if we are to prevent confusion in this area from hindering the development of AI. Datasets have properties which make them quite unlike most physical assets or resources in the world today. For example, datasets can be duplicated at near-zero cost, used in multiple ways by different people without diminishing their value, and their value often increases as they are combined with other datasets. Indeed, the question of how data can be accurately valued, and whether it can be treated as a form of property or type of asset, is an ongoing area of debate among economists, accountants, statisticians and other experts.
Open Banking refers to a series of reforms relating to the handling of financial information by banks. From 13 January 2018, UK-regulated banks have had to let their customers share their financial data with third parties (such as budgeting apps, or other banks). Banks are sharing customer data in the form of open APIs (application programming interfaces) which are used to provide integrated digital services. The intent of these reforms is to encourage competition and innovation, and to lead to more, and better, products for money management. Importantly, personal information can only be shared if the data subject (the person whose information it is) gives their express permission.
The Competition and Markets Authority said “the principles underlying Open Banking are similar to the new portability principle in GDPR—and there is a lot of potential in the portability principle to help get data working for consumers.”
80.In its Industrial Strategy White Paper and its evidence to us, the Government announced the creation of a national Centre for Data Ethics and Innovation, the establishment of ‘data trusts’, which would facilitate the sharing of datasets between organisations and companies, and the updating of its 2016 data ethics framework. The first policy is a reflection of the Government’s manifesto commitment to “institute an expert Data Use and Ethics Commission to advise regulators and parliament on the nature of data use and how best to prevent its abuse”, as well as the Royal Society’s recommendation for a national data stewardship body. The second policy, stemming from the Hall-Pesenti Review, is to establish ‘data trusts’, which will monitor and supervise the sharing of datasets between organisations and companies. The Review emphasised that these trusts would “not be a legal entity or institution, but rather a set of relationships underpinned by a repeatable framework, compliant with parties’ obligations, to share data in a fair, safe and equitable way”.
81.However, it is important to note that the Review’s conception of a ‘data trust’ appears to be different from the understanding of the term expressed in our written evidence. The Hall-Pesenti Review imagined data trusts as more co-operative, less top-down organisations, whereby individuals could opt-in to particular trusts, and have a say in their governance and how the personal data they provide is used. It was also unclear from our session with the Rt Hon Matt Hancock MP (then Minister of State for Digital) and Lord Henley, Parliamentary Under-Secretary of State at the Department for Business, Energy and Industrial Strategy, whether the Government intends to address the central questions over the value of public and personal data, as they did not answer our questions on this subject.
82.The Government plans to adopt the Hall-Pesenti Review recommendation that ‘data trusts’ be established to facilitate the ethical sharing of data between organisations. However, under the current proposals, individuals who have their personal data contained within these trusts would have no means by which they could make their views heard, or shape the decisions of these trusts. We therefore recommend that as data trusts are developed under the guidance of the Centre for Data Ethics and Innovation, provision should be made for the representation of people whose data is stored, whether this be via processes of regular consultation, personal data representatives, or other means.
83.Access to data is essential to the present surge in AI technology, and there are many arguments to be made for opening up data sources, especially in the public sector, in a fair and ethical way. Although a ‘one-size-fits-all’ approach to the handling of public sector data is not appropriate, many SMEs in particular are struggling to gain access to large, high-quality datasets, making it extremely difficult for them to compete with the large, mostly US-owned technology companies, who can purchase data more easily and are also large enough to generate their own. In many cases, public datasets, such as those held by the NHS, are more likely to contain data on more diverse populations than their private sector equivalents, and more control can be exercised before they are released.
84.We recommend that wherever possible and appropriate, and with regard to its potential commercial value, publicly-held data be made available to AI researchers and developers. In many cases, this will require Government departments and public organisations making a concerted effort to digitise their records in unified and compatible formats. When releasing this data, subject to appropriate anonymisation measures where necessary, data trusts will play an important role.
85.We support the approach taken by Transport for London, who have released their data through a single point of access, where the data is available subject to appropriate terms and conditions and with controls on privacy. The Centre for Data Ethics and Innovation should produce guidance on similar approaches. The Government Office for AI and GovTech Catalyst should work together to ensure that the data for which there is demand is made available in a responsible manner.
86.We acknowledge that open data cannot be the last word in making data more widely available and usable, and can often be too blunt an instrument for facilitating the sharing of more sensitive or valuable data. Legal and technical mechanisms for strengthening personal control over data, and preserving privacy, will become increasingly important as AI becomes more widespread through society. Mechanisms for enabling individual data portability, such as the Open Banking initiative, and data sharing concepts such as data trusts, will spur the creation of other innovative and context-appropriate tools, eventually forming a broad spectrum of options between total data openness and total data privacy.
87.We recommend that the Centre for Data Ethics and Innovation investigate the Open Banking model, and other data portability initiatives, as a matter of urgency, with a view to establishing similar standardised frameworks for the secure sharing of personal data beyond finance. They should also work to create, and incentivise the creation of, alternative tools and frameworks for data sharing, control and privacy for use in a wide variety of situations and contexts.
88.Increasingly, public sector data has value. It is important that public organisations are aware of the commercial potential of such data. We recommend that the Information Commissioner’s Office work closely with the Centre for Data Ethics and Innovation in the establishment of data trusts, and help to prepare advice and guidance for data controllers in the public sector to enable them to estimate the value of the data they hold, in order to make best use of it and negotiate fair and evidence-based agreements with private-sector partners. The values contained in this guidance could be based on precedents where public data has been made available and subsequently generated commercial value for public good. The Information Commissioner’s Office should have powers to review the terms of significant data supply agreements being contemplated by public bodies.
89.Alongside consumer awareness, many witnesses highlighted the importance of making AI understandable to developers, users and regulators.
90.Several witnesses told us that while many AI systems are no more difficult to understand than conventional software, the newer generation of deep learning systems did present problems. As discussed earlier, these systems rely on the feeding of information through many different ‘layers’ of processing in order to come to an answer or decision. The number and complexity of stages involved in these deep learning systems is often such that even their developers cannot always be sure which factors led a system to decide one thing over another. We received a great deal of evidence regarding the extent and nature of these so-called ‘black box’ systems.
91.The terminology used by our witnesses varied widely. Many used the term transparency, while others used interpretability or ‘explainability’, sometimes interchangeably. For simplicity, we will use ‘intelligibility’ to refer to the broader issue. Within this, from our evidence, we have identified two broad approaches to intelligibility—technical transparency and explainability—which we address in more detail below.
92.The extent to which it was considered that an AI system needed to be intelligible at all was very much dependent on the witnesses in question, and the purposes to which the AI system was to be put. A small minority of witnesses appeared to believe the necessity for intelligible AI systems was low, and that current systems, even those which make use of deep learning, more than meet these demands. Nvidia, for example, made the point that machine learning algorithms are often much shorter and simpler than conventional software coding, and are therefore in some respects easier to understand and inspect. Dr Timothy Lanfear, Director of the EMEA Solution Architecture and Engineering team at Nvidia, told us:
“We are using systems every day that are of a level of complexity that we cannot absorb. Artificial intelligence is no different from that. It is also at a level of complexity that cannot be grasped as a whole. Nevertheless, what you can do is to break this down into pieces, find ways of testing it to check that it is doing the things you expect it to do and, if it is not, take some action”.
93.Other witnesses thought it was unrealistic to hold AI to a higher standard than human decision-making, which itself can often seem illogical or impenetrable. Within AI development today, there are many different techniques being used, which all come with advantages and disadvantages, and trade-offs will be inevitable. For example, decision tree learning can be faster, easier to understand, and are usually less data hungry, whereas deep neural networks are often more data hungry, slower and much less easy to understand, but can be more accurate, given the right data. Some witnesses argued that if we restricted ourselves only to techniques which we could fully understand, we would also limit what could be accomplished with AI.
94.The idea of restricting the use of unintelligible systems in certain important or safety-critical domains was also mentioned by many witnesses. Experts from the University of Edinburgh emphasised the human element, arguing that given the “completely unintelligible” nature of decisions made via deep learning, it would be more feasible to focus on outcomes, and “only license critical AI systems that satisfy a set of standardised tests, irrespective of the mechanism used by the AI component”. Examples of contexts in which a high degree of intelligibility was thought to be necessary included:
95.One solution to the question of intelligibility is to try to increase the technical transparency of the system, so that experts can understand how an AI system has been put together. This might, for example, entail being able to access the source code of an AI system. However, this will not necessarily reveal why a particular system made a particular decision in a given situation. Significantly, it does not include the data that was input into the system in a particular scenario, nor how that data was processed in order to arrive at a decision, and a system may behave very differently with different datasets.
96.We were also told that what transparency means can vary widely depending on the underlying purpose. Professor Chris Reed, Professor of Electronic Commerce Law, Queen Mary University of London, argued that:
“There is an important distinction to be made between ex ante transparency, where the decision-making process can be explained in advance of the AI being used, and ex post transparency, where the decision making process is not known in advance but can be discovered by testing the AI’s performance in the same circumstances. Any law mandating transparency needs to make it clear which kind of transparency is required”.
97.He argued that there are situations where a lack of transparency in advance may be acceptable, on the basis that the overall societal benefits are significant and any loss can be compensated for. Indeed, requiring explanations for all decisions in advance could limit innovation, as it would limit the capacity for a system to evolve and learn through its mistakes. On the other hand, he believed that transparency should be required where fundamental rights are put at risk.
98.What people need to know about how an AI system is operating, and why it is making particular decisions, will often be very different depending on whether they are developers, users, or regulators, auditors or investigators, for example. In many cases, a full technical breakdown of a system will not be useful. Professor Spiegelhalter explained to us that in many cases, telling “everybody everything” could actively be unhelpful; instead, “the information [a user is receiving] has to be accessible; they have to get it, understand it to some extent and be able to critique it”.
99.Based on the evidence we have received, we believe that achieving full technical transparency is difficult, and possibly even impossible, for certain kinds of AI systems in use today, and would in any case not be appropriate or helpful in many cases. However, there will be particular safety-critical scenarios where technical transparency is imperative, and regulators in those domains must have the power to mandate the use of more transparent forms of AI, even at the potential expense of power and accuracy.
100.An alternative approach is explainability, whereby AI systems are developed in such a way that they can explain the information and logic used to arrive at their decisions. We learned of a variety of technical solutions which are currently in development, which could help explain machine learning systems and their decisions. A variety of companies and organisations are currently working on explanation systems, which will help to consolidate and translate the processes and decisions made by machine learning algorithms into forms that are comprehensible to human operators. A number of large technology companies, including Google, IBM and Microsoft, told us of their commitment to developing interpretable machine learning systems, which included Google’s ‘Glassbox’ framework for interpretable machine learning, and Microsoft’s development of best practices for intelligible AI systems.
101.In the evidence we received, it was frequently observed that Article 22 of the GDPR contains a ‘right to an explanation’ provision. This provision gives an individual, when they have been subject to fully automated decision-making (and where the outcome has a significant impact on them), the right to ask for an explanation as to how that decision was reached, or to ask for a human to make the decision. This provision was considered relatively vague, and contains a number of limitations, and does not apply if a decision is based on explicit consent, if it does not produce a legal outcome of similar significant effect on the data subject, or if the process is only semi-automated. Nevertheless, our witnesses generally thought that it provided added impetus to solve the problem of explainable AI systems. It was also pointed out that with regards to some AI healthcare systems (which most witnesses believed should be explainable), there were likely to be additional legal requirements as a result of the EU In Vitro Diagnostic Devices Regulation (2017), which comes into force in May 2022.
102.Indeed, the GDPR already appears to have prompted action in the UK, with the Data Protection Bill going further still towards enshrining a ‘right to an explanation’ in UK law. When we asked Dr Wachter, one of Europe’s leading experts on the subject, what she thought of the developing Bill, she told us:
“ … it has been proposed that after a decision has been made, the individual has to be informed about the outcome, which is new and better than what the GDPR currently offers. It also states that data subjects should have the right to ask that the decision be reconsidered, or that the decision not be made by an algorithm. Both those things meet and exceed what is currently envisaged in the GDPR, and that is excellent”.
103.However, we were also concerned to hear that the automated decision safeguards in the Data Protection Bill still only apply if a decision is deemed to be “significant” and based “solely on automated processing”. As we discuss elsewhere in this report, many AI systems today and into the future are aiming to augment, rather than fully replace, human labour, and as Michael Veale, a researcher at UCL, told us, “few highly significant decisions are fully automated—often, they are used as decision support, for example in detecting child abuse. Additionally few fully automated decisions are individually significant, even though they might be over time”. Veale argued that the law should also cover systems where AI is only part of the final decision, as is the case in France’s Digital Republic Act 2016, and that there was a need to incentivise the development of explainable AI systems, without having to rely solely on regulation.
104.The style and complexity of explanations will need to vary based on the audience addressed and the context in which they are needed. For example, the owner of a self-driving car will want certain kinds of information when asking their vehicle to explain why it chose a particular route, while an accident investigator will want very different kinds of information, probably at far more granular levels of detail, when assessing why the same vehicle crashed. Nevertheless, we should insist that these systems are built into AI products before they are deployed and that certain standards are met.
105.We believe that the development of intelligible AI systems is a fundamental necessity if AI is to become an integral and trusted tool in our society. Whether this takes the form of technical transparency, explainability, or indeed both, will depend on the context and the stakes involved, but in most cases we believe explainability will be a more useful approach for the citizen and the consumer. This approach is also reflected in new EU and UK legislation. We believe it is not acceptable to deploy any artificial intelligence system which could have a substantial impact on an individual’s life, unless it can generate a full and satisfactory explanation for the decisions it will take. In cases such as deep neural networks, where it is not yet possible to generate thorough explanations for the decisions that are made, this may mean delaying their deployment for particular uses until alternative solutions are found.
106.The Centre for Data Ethics and Innovation, in consultation with the Alan Turing Institute, the Institute of Electrical and Electronics Engineers, the British Standards Institute and other expert bodies, should produce guidance on the requirement for AI systems to be intelligible. The AI development sector should seek to adopt such guidance and to agree upon standards relevant to the sectors within which they work, under the auspices of the AI Council.
107.The current generation of AI systems, which have machine learning at their core, need to be taught how to spot patterns in data, and this is normally done by feeding them large bodies of data, commonly known as training datasets. These systems are designed to spot patterns, and if the data is unrepresentative, or the patterns reflect historical patterns of prejudice, then the decisions which they make may be unrepresentative or discriminatory as well. This can present problems when these systems are relied upon to make real world decisions. Within the AI community, this is commonly known as ‘bias’.
108.While the term ‘bias’ might at first glance appear straightforward, there are in fact a variety of subtle ways in which bias can creep into a system. Much of the data we deem to be useful is about human beings, and is collected by human beings, with all of the subjectivity that entails. As LexisNexis UK put it, “biases may originate in the data used to train the system, in data that the system processes during its period of operation, or in the person or organisation that created it. There are additional risks that the system may produce unexpected results when based on inaccurate or incomplete data, or due to any errors in the algorithm itself”. It is also important to be aware that bias can emerge when datasets inaccurately reflect society, but it can also emerge when datasets accurately reflect unfair aspects of society.
109.For example, an AI system trained to screen job applications will typically use datasets of previously successful and unsuccessful candidates, and will then attempt to determine particular shared characteristics between the two groups to determine who should be selected in future job searches. While the intention may be to ascertain those who will be capable of doing the job well, and fit within the company culture, past interviewers may have consciously or unconsciously weeded out candidates based on protected characteristics (such as age, sexual orientation, gender, or ethnicity) and socio-economic background, in a way which would be deeply unacceptable today.
110.While some witnesses referred to the issue of bias as something which simply needed to be removed from training data and the AI systems developed using them, others pointed out that this was more complicated. Dr Ing. Konstantinos Karachalios, Managing Director, IEEE-Standards Association, told us:
“You can never be neutral; it is us. This is projected in what we do. It is projected in our engineering systems and algorithms and the data that we are producing. The question is how these preferences can become explicit, because if it can become explicit it is accountable and you can deal with it. If it is presented as a fact, it is dangerous; it is a bias and it is hidden under the table and you do not see it. It is the difficulty of making implicit things explicit”.
111.Dr Ansgar Koene made a similar point, when he distinguished between ‘operationally-justified bias’, which “prioritizes certain items/people as part of performing the desired task of the algorithm, e.g. identifying frail individuals when assigning medical prioritisation”, and ‘non-operationally-justified bias’, which is “not integral to being able to do the task, and is often unintended and its presence is unknown unless explicitly looked for”. Fundamentally, the issues arise when we are unaware of the hidden prejudices and assumptions which underpin the data we use. Olivier Thereaux noted Maciej Cegłowski’s description of machine learning as “like money laundering for bias.” Thereaux said:
“We take bias, which in certain forms is what we call ‘culture’, put it in a black box and crystallise it for ever. That is where we have a problem. We have even more of a problem when we think that that black box has the truth and we follow it blindly, and we say, ‘The computer says no’. That is the problem”.
112.Several witnesses pointed out that the issue could not be easily confined to the datasets themselves, and in some cases “bias and discrimination may emerge only when an algorithm processes particular data”. The Centre for the Future of Intelligence emphasised that “identifying and correcting such biases poses significant technical challenges that involve not only the data itself, but also what the algorithms are doing with it (for example, they might exacerbate certain biases, or hide them, or even create them)”. The difficulties in fixing these issues was illustrated recently when it emerged that Google has still not fixed its visual identification algorithms, which could not distinguish between gorillas and black people, nearly three years after the problem was first identified. Instead, Google has simply disabled the ability to search for gorillas in products such as Google Photos which use this feature.
113.The consequences of this are already starting to be felt. Several witnesses highlighted the growing use of AI within the US justice system, in particular the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) system, developed by Northpointe, and used across several US states to assign risk ratings to defendants, which help to assist judges in sentences and parole decisions. Will Crosthwait, co-founder of AI start-up Kensai, highlighted investigations which found that the system commonly overestimated the recidivism risk of black defendants, and underestimated that of white defendants. Big Brother Watch observed that here in the UK, Durham Police have started to investigate the use of similar AI systems for determining whether suspects should be kept in custody, and described this and other developments as a “very worrying trend, particularly when the technology is being trialled when its abilities are far from accurate”. Evidence from Sheena Urwin, Head of Criminal Justice at Durham Constabulary, emphasised the considerable lengths that Durham Constabulary have taken to ensure their use of these tools is open, fair and ethical, in particular the development of their ‘ALGO-CARE’ framework for the ethical use of algorithms in policing.
114.A number of solutions were offered in the evidence we received. The most immediate was the creation of more diverse datasets, which fairly reflect the societies and communities which AI systems are increasingly affecting. Kriti Sharma told us “we now have the ability to create [diverse datasets]. If data does not exist we need to work hard, we need to work together and focus on open datasets”. Several witnesses highlighted the role the Government could play in this regard, by releasing open datasets which are representative of the entire population, and which would address some of the shortcomings of less demographically-diverse privately-held datasets. Witnesses frequently mentioned that there are a variety of groups in society, such as the financially excluded and ethnic minorities, who suffer from ‘data poverty’, especially compared to more privileged groups in society, who are likely to generate more data about themselves through a plethora of smart devices and services. Dr Bunz also recommended that the Government should create guidance for the creation of datasets, which could help to minimise bias.
115.The importance of ensuring that AI development is carried out by diverse workforces, who can identify issues with data and algorithm performance, was also emphasised—a point we return to below. Dr Wachter and academics from the Centre for the Future of Intelligence also argued that a greater diversity of academic disciplines needed to be involved in this process. Dr Wachter observed that questions around bias and prejudice in society have long been within the remit of philosophers, social scientists, legal theorists and political scientists, and warned that “if we do not have an interdisciplinary approach to these questions, we are going to leave out very important issues”.
116.Many witnesses told us of the need to actively seek out bias within AI systems, by testing datasets and how they operate within particular AI systems. Dr Wachter believed that in cases where a system is “inherently opaque and not understandable”, as with many deep learning systems, “auditing after the fact, auditing during data processing or inbuilt processes that could detect biases” should be considered, backed up by certification of some form. Dr Koene told us of work already moving in this direction, which aims to establish a ‘Standard on Algorithm Bias Considerations’.
117.However, Kriti Sharma noted that “interesting research work has been done, but it has not been commercialised … [the area] needs more funding from government”. She suggested that the Challenge Fund, or other bodies working on AI, robotics and automation needed to divert more attention to the issue.
118.The message we received was not entirely critical though. A number of witnesses emphasised that, if the right measures are taken, we also have an opportunity to better address long-standing prejudices and inequalities in our societies. Kriti Sharma explained:
“AI can help us fix some of the bias as well. Humans are biased; machines are not, unless we train them to be. AI can do a good job at detecting unconscious bias as well. For example, if feedback is given in performance reviews where different categories of people are treated differently, the machine will say, ‘That looks weird. Would you like to reconsider that?’”
119.We are concerned that many of the datasets currently being used to train AI systems are poorly representative of the wider population, and AI systems which learn from this data may well make unfair decisions which reflect the wider prejudices of societies past and present. While many researchers, organisations and companies developing AI are aware of these issues, and are starting to take measures to address them, more needs to be done to ensure that data is truly representative of diverse populations, and does not further perpetuate societal inequalities.
120.Researchers and developers need a more developed understanding of these issues. In particular, they need to ensure that data is pre-processed to ensure it is balanced and representative wherever possible, that their teams are diverse and representative of wider society, and that the production of data engages all parts of society. Alongside questions of data bias, researchers and developers need to consider biases embedded in the algorithms themselves—human developers set the parameters for machine learning algorithms, and the choices they make will intrinsically reflect the developers’ beliefs, assumptions and prejudices. The main ways to address these kinds of biases are to ensure that developers are drawn from diverse gender, ethnic and socio-economic backgrounds, and are aware of, and adhere to, ethical codes of conduct.
121.We recommend that a specific challenge be established within the Industrial Strategy Challenge Fund to stimulate the creation of authoritative tools and systems for auditing and testing training datasets to ensure they are representative of diverse populations, and to ensure that when used to train AI systems they are unlikely to lead to prejudicial decisions. This challenge should be established immediately, and encourage applications by spring 2019. Industry must then be encouraged to deploy the tools which are developed and could, in time, be regulated to do so.
122.As previously discussed, access to, and control of, data is a crucial ingredient in the development of modern AI, and data network effects can quickly lead to certain companies building up proprietary dataset dominance in the market. These can be difficult for smaller competitors to match. These dominant companies—sometimes referred to as the ‘tech giants’ or increasingly ‘Big Tech’—are commonly identified as Alphabet (Google’s parent company), Amazon, Apple, Facebook, and Microsoft (as well as occasionally IBM, and beyond the USA, Samsung, Alibaba and Tencent), and have built business models partially, or largely, focused on the “aggregation of data and provision of cloud services”. They are large and diverse enough that they can generate their own massive and varied datasets—for example, all of the data provided to Google through web searches and the use of Android smartphones, or data on social connections and interests provided to Facebook—without relying on third parties, and use this to understand consumer behaviour to an unprecedented extent.
123.These datasets, sometimes evocatively described within the sector as ‘data moats’ to signify the unassailable commercial positions they are meant to create, gives these companies a key advantage in the development of AI systems. This form of market dominance is increasingly referred to as a ‘data monopoly’.
124.We asked our witnesses if they had concerns about the dominance of these companies. Innovate UK said these organisations have “benefited from access to many disparate, significantly large, data sets, enabling access to a pool of data that is unprecedented and which creates significant commercial opportunities”. The Information Commissioner described the dominance of technology giants as “a vexing problem”.
The Competition and Markets Authority (CMA) is a non-ministerial department which exists to promote competition for the benefit of consumers, both within and outside the UK. Their aim is to make markets work well for consumers, businesses and the economy. The CMA was established in April 2014 as a result of a merger of the Competition Commission and the Office of Fair Trading.
125.Some of our witnesses were less concerned by this dominance. The Center for Data Innovation, an international think tank which focuses on data, technology and public policy, said “there are no ‘data-based monopolies’ and the winner does not take all” as “data is non-rivalrous: customers who give their personal data to one company can provide it again to another”. IBM told us that “any concerns about dominance can be addressed through competition law on an ex-post basis rather than a-priori regulation of AI” and that “competition authorities are well equipped to deal with data dominance issues and can monitor the behaviour of companies that amass large amounts of data for commercial exploitation”. Capco said “there is no natural way to address these monopolies and perhaps they should not be addressed while they serve the consumer and society” but regulators should be vigilant in making sure that these companies do not inhibit competition.
126.Digital Catapult told us these large companies do not “engage in typical monopoly behaviour such as stealing market share by selling goods below the cost of production” and suggested that their success has benefited consumers: “few people would be happy to give up using Amazon’s one day delivery or Google search”. Digital Catapult did also, however, state that “data is becoming even more valuable” and the large companies must not restrict SME access to data or prevent the growth of an AI developer sector.
127.Other witnesses felt it was too insurmountable a problem: “the data monopolies and ‘winner takes all’ cannot be adequately addressed. That is a simple fact of life. The robber barons of the dark ages are still with us in a different guise”. Many of our witnesses, however, believed there to be serious issues with what they believed to be the monopoly-building activities of large, mostly US-headquartered, technology companies, and identified ways to address the issue.
128.The Information Commissioner told us “people are becoming more and more concerned about information monopolies and large platforms that offer many different services and collect and link all kinds of personal data”. Other witnesses said that people were not aware enough of how their personal data was being used. Professor Kathleen Richardson and Ms Nika Mahnič, from the Campaign Against Sex Robots, described the giving of personal information for access to digital services as a “Faustian pact”, and that people need to be more aware of the implications. A concern over how one’s personal data is used is a fair response when it comes to the development and deployment of artificial intelligence. Such concerns can also be best addressed by individuals having greater control over their data, as we have discussed above.
129.While we welcome the investments made by large overseas technology companies in the UK economy, and the benefits they bring, the increasing consolidation of power and influence by a select few risks damaging the continuation, and development, of the UK’s thriving home-grown AI start-up sector. The monopolisation of data demonstrates the need for strong ethical, data protection and competition frameworks in the UK, and for continued vigilance from the regulators. We urge the Government, and the Competition and Markets Authority, to review proactively the use and potential monopolisation of data by the big technology companies operating in the UK.
69 (Dr Mercedes Bunz, Elizabeth Denham, Dr Sandra Wachter) and (Javier Ruiz Diaz, Olivier Thereaux, Frederike Kaltheuner). Javier Ruiz Diaz noted the precedent established in the Your Response Limited vs Datateam Business Media Limited  EWCA Civ 281 (14 March 2014) case, when the Court of Appeal found that databases do not constitute tangible property of a kind which is capable of possession.
70 (Olivier Thereaux)
71 Written evidence from Professor Richard Susskind (); Doteveryone (); Professor Michael Wooldridge (); Royal Academy of Engineering () and Transport Systems Catapult ()
72 Written evidence from Innovate UK ()
74 A conventional network effect is when a good or service becomes more useful as more people use it. For example, a telephone system becomes more useful as more people buy telephones. A data network effect produces similar benefits, but does so because machine learning allows a product or service to improve as more data is added to the network, which in turn draws more customers in, who in turn provide more data. Written evidence from The Economic Singularity Supper Club (); Research Councils UK (); Digital Catapult () and Dr Toby Walsh ()
75 Written evidence from Dr Mike Lynch ()
76 Braintree (); Bikal (); Charities Aid Foundation (); Dr Mercedes Bunz (); Doteveryone () and The Royal Society ()
77 Written evidence from Ordnance Survey ()
78 (Dr Julian Huppert) and written evidence from Ordnance Survey ()
79 Written evidence from Dr Mercedes Bunz ()
80 Written evidence from Charities Aid Foundation ()
81 Written evidence from UK Computing Research Committee (); Michael Veale () and University College London (UCL) ()
82 (Kriti Sharma)
83 Written evidence from Bikal ()
84 Namely, the right to data portability only applies to personal data that an individual has provided to a controller, where the processing is based on the individual’s consent or for the performance of a contract, and when the processing is carried out by automated means.
85 (Dr Sandra Wachter)
86 Written evidence from Dr Mercedes Bunz ()
87 (Olivier Thereaux)
89 (Elizabeth Denham)
90 Written evidence from the National Data Guardian for Health and Care ()
91 Written evidence from CBI (); The Economic Singularity Supper Club (); The Law Society of England and Wales () and IBM ()
92 (David Kelnar and Eileen Burbidge)
93 Written evidence from CBI ()
95 Written evidence from The Economic Singularity Supper Club (); The Law Society of England and Wales () and IBM ()
96 Written evidence from University College London ()
97 Written evidence from Accenture UK Limited () and University College London ()
98 Written evidence from Research Councils UK ()
99 Written evidence from Dr Jerry Fishenden (); Braintree (); Google (); DeepMind () and Balderton Capital (UK) LLP ()
100 Written evidence from Dr Mike Lynch ()
101 Solid, ‘What is Solid?’: [accessed 5 February 2018]
102 Hub of All Things, ‘Personal Data Economy’: [accessed 5 February 2018]
103 Written evidence from Competition and Markets Authority ()
104 For recent policy-orientated discussions over the treatment of data as a form of intangible asset, and the need for robust methods of valuation, see Royal Academy of Engineering and the Institute of Engineering and Technology, Connecting data: Driving productivity and innovation (November 2015): [accessed 20 February 2018]; Professor Sir Charles Bean, Independent Review of UK Economic Statistics (March 2016): [accessed 20 February 2018]; Royal Society and British Academy, Data management and use: Governance in the 21st century (June 2017): [accessed 20 February 2018]
105 The Conservative and Unionist Party, Manifesto 2017: Forward Together, Our Plan for a Stronger Britain and a Prosperous Future (May 2017), p 79: [accessed 26 March 2018]
106 Professor Dame Wendy Hall and Dr Jérôme Pesenti, Growing the artificial intelligence industry in the UK, (15 October 2017), p 46: [accessed 31 January 2018]
107 For examples, see written evidence from University College London (UCL) () and Toby Phillips and Maciej Kuziemski ()
108 (Matt Hancock MP and Lord Henley) and (Matt Hancock MP and Lord Henley)
109 Written evidence from Medicines and Healthcare products Regulatory Agency (); Braintree () and Dr Mike Lynch ()
110 Written evidence from NVIDIA ()
111 (Dr Timothy Lanfear)
112 Written evidence from Deep Science Ventures (); Dr Dan O’Hara, Professor Shaun Lawson, Dr Ben Kirman and Dr Conor Linehan () and Professor Robert Fisher, Professor Alan Bundy, Professor Simon King, Professor David Robertson, Dr Michael Rovatsos, Professor Austin Tate and Professor Chris Williams ()
113 Professor Robert Fisher, Professor Alan Bundy, Professor Simon King, Professor David Robertson, Dr Michael Rovatsos, Professor Austin Tate and Professor Chris Williams (); Michael Veale (); Five AI Ltd (); University College London (UCL) () and Electronic Frontier Foundation ()
114 Written evidence from Professor Robert Fisher, Professor Alan Bundy, Professor Simon King, Professor David Robertson, Dr Michael Rovatsos, Professor Austin Tate and Professor Chris Williams ()
115 Written evidence from Joanna Goodman, Dr Paresh Kathrani, Dr Steven Cranfield, Chrissie Lightfoot and Michael Butterworth (); Future Intelligence () and Ocado Group plc ()
116 Written evidence from Medicines and Healthcare products Regulatory Agency (); PHG Foundation (); SCAMPI Research Consortium, City, University of London () and Professor John Fox ()
117 Written evidence from Professor Michael Wooldridge ()
118 Written evidence from Five AI Ltd () and UK Computing Research Committee ()
119 Written evidence from Big Brother Watch () and Amnesty International ()
120 (Professor Alan Winfield)
121 Written evidence from Professor Chris Reed ()
123 Written evidence from IBM (); Simul Systems Ltd (); Imperial College London () and Professor Chris Reed ()
124 (Professor Sir David Spiegelhalter)
125 Written evidence from Royal Academy of Engineering () and Michael Veale ()
126 Written evidence from Google (); IBM () and Microsoft ()
127 Written evidence from Future Intelligence (); PricewaterhouseCoopers LLP (); IBM (); CognitionX (); Big Brother Watch (); Will Crosthwait () and Dr Maria Ioannidou ()
128 Written evidence from Article 19 ()
129 Written evidence from Article 19 (); Information Commissioner’s Office (); CBI () and Ocado Group plc ()
130 Written evidence from PHG Foundation ()
131 (Dr Sandra Wachter)
132 , clause 14 [Bill 153 (2017–19)]
133 Written evidence from Michael Veale ()
134 The Digital Republic Act 2016, which covers French public bodies, refers to decisions ‘taken on the basis of algorithmic processing’, rather than ‘solely on automated processing’.
135 Written evidence from LexisNexis UK ()
136 (James Luke)
137 (Dr Ing Konstantinos Karachalios)
138 Written evidence from Dr Ansgar Koene ()
139 (Olivier Thereaux)
140 Written evidence from Research Councils UK ()
141 Written evidence from Leverhulme Centre for the Future of Intelligence ()
142 Tom Simonite, ‘When it comes to gorillas, Google Photos remains blind’, Wired (11 January 2018): [accessed 31 January 2018]
143 Written evidence from Will Crosthwait ()
144 Written evidence from Big Brother Watch ()
145 Written evidence from Marion Oswald and Sheena Urwin ()
146 (Kriti Sharma)
147 Written evidence from Dr Mercedes Bunz (); University College London () and Center for Data Innovation ()
148 Written evidence from Center for Data Innovation () and Weightmans LLP ()
149 Written evidence from Dr Mercedes Bunz ()
150 Written evidence from CognitionX ()
151 (Dr Sandra Wachter)
152 (Dr Sandra Wachter)
153 This will aim to “provide certification oriented methodologies to identify and mitigate non-operationally-justified algorithm biases through: the use of benchmarking procedures, criteria for selecting bias validation data sets and guidelines for the communication of application domain limitations (using the algorithm for purposes beyond this scope invalidates the certification).” Written evidence from Dr Ansgar Koene ()
154 (Kriti Sharma)
156 Written evidence from Royal Academy of Engineering ()
157 Though technically, given the presence of several large companies, it more closely resembles an oligopoly, rather than a monopoly.
158 Written evidence from Innovate UK ()
159 (Elizabeth Denham)
160 Written evidence from the Center for Data Innovation ()
161 Written evidence from IBM ()
162 Written evidence from Capco ()
163 Written evidence from Digital Catapult ()
165 Written evidence from Advanced Marine Innovation Technology Subsea Ltd ()
166 (Elizabeth Denham)
167 See written evidence from Department of Computer Science University of Liverpool (); BBC () and Future Intelligence ()
168 Written evidence from Professor Kathleen Richardson and Ms Nika Mahnič ()