Large language models and generative AI Contents

Chapter 8: Copyright

228.Many contributors to our inquiry contended that LLM developers were acting unethically and unlawfully by using copyrighted data to train models without permission.376 Developers disagreed, citing the societal value of their products and the legal exemptions. We examined the balance of evidence and ways forward.

Background on data mining

229.Text and data mining (TDM) involves accessing and analysing large datasets to identify patterns and trends to train AI. Obtaining permission for this typically involves acquiring a licence or relying on an exception. Non-commercial research is permitted. In 2022 the Intellectual Property Office (IPO) proposed to change this system to allow any form of commercial mining. Our report on the creative industries noted the £108 billion sector relied on copyright protections and criticised the IPO’s plans for undercutting business models.377 The Government’s response confirmed it would no longer pursue a “broad copyright exception” and set up a working group to develop a new code of practice by “summer” 2023.378 A separate creative industries strategy published in June 2023 emphasised the Government’s continued commitment “to promote and reward investment in creativity” and ensure rightsholder content is “appropriately protected” while also supporting AI innovation.379

Using rightsholder data

230.Many LLM developers have used extensive amounts of human-generated content to train their models. We heard that much of this had taken place without permission from or compensation for rightsholders. Many felt that allowing such practices was morally unfair and economically short sighted.380

231.The Society of Authors noted that AI systems “would simply collapse” if they did not have access to creators’ works for training and believed tech firms should reward creators fairly.381 The Copyright Licensing Agency argued that current LLM practices “severely undermine not only the economic value of the creative industries but the UK’s internationally respected ‘gold-standard’ copyright framework”.382

232.The Financial Times said there were “legal routes to access our content which the developers … have chosen not to take”.383 DMG Media said its news content was being used to train models and fact check outputs, and believed the resulting AI tools “could make it impossible to produce independent, commercially funded journalism”.384 The Guardian Media Group said current practices represented a “one sided bargain … without giving any value back” to rightsholders, and warned that openly available high quality news would be “hollow[ed] out” as a result.385

233.We heard further concern that the debate on innovation and copyright was too often presented as a mutually exclusive choice. Richard Mollet, Head of European Government Affairs at the information business RELX, noted that RELX was managing to “innovate while at the same time preserving all the things we want to preserve about copyright”.386

234.OpenAI told us however that it “respect[ed] the rights of content creators and owners” and that its tools helped creative professionals innovate. It noted it had already established “partnership deals with publishers like the Associated Press”, though maintained it was “impossible to train today’s leading AI models without using copyrighted materials” and attempting to do so “would not provide AI systems that meet the needs of today’s citizens”.387 Meta, Stability AI and Microsoft similarly said that limiting access to data risked leading to poorly performing or biased models and less benefit for users.388

Legal compliance

235.We heard further disagreement about the extent to which the methods used by LLM developers to acquire and use data are lawful. Dan Conway, CEO of the Publishers’ Association, argued that LLMs “are infringing copyrighted content on an absolutely massive scale … when they collect the information, how they store the information and how they handle it.” He said there was clear evidence from model outputs that developers had used pirated content from the Books3 database, and alleged they were “not currently compliant” with UK law.389

236.Microsoft argued in contrast that conducting TDM on “publicly available and legally accessed works should not require a licence” and was “not copyright infringement”.390 It cited international copyright conventions391 suggesting copyright should “not extend to ideas … Everyone should have the right to read, learn and understand these works, and copyright law in the UK includes exceptions that allow for the use of technology as a tool to enable this”.392

237.OpenAI said it complied with “all applicable laws” and believed that, in its view, “copyright law does not forbid training”.393 Stability AI said its activities were “protected by fair use doctrine in jurisdictions such as the United States”.394 Professor Zoubin Ghahramani of Google DeepMind said that if models were to directly reproduce works then rightsholder concerns would be “very valid … We try to take measures so that does not happen.”395

Technical complexity

238.A large language model may not necessarily ‘hold’ a set of copyrighted works itself. As Dr Andres Guadamuz has noted, the text from books and articles is converted into billions of sequences (called tokens).396 The final model contains only statistical representations of the original training data.397 Jonas Andrulis, CEO of Aleph Alpha, said it was “technically not possible to trace the origin of a certain word or sentence down to one or even a handful of sources”.398

239.The process for extracting data from websites and transferring it to processing platforms may however involve some form of temporary copy. There is disagreement as to whether such usage is exempt from the Copyright, Designs and Patents Act 1988.399

240.Dr Hayleigh Bosher, Reader in Intellectual Property Law and Associate Dean at Brunel University London, said the Act covered the reproduction or “storing the work in any medium by electronic means”.400 She argued that the exceptions allowing transient or incidental copies were narrow and did not apply to LLMs.401 Dan Conway, CEO of the Publishers Association, agreed.402 This issue may be a focus of future legal action.403

241.Dr Bosher further argued that it was more helpful to consider the underlying purpose and principles of copyright law. She noted that metaphors comparing LLMs to people reading books were misleading, because the intent behind LLM development was clearly commercial whereas reading a book for interest was not. She said the application of copyright law should be future proof and not overly specific to how a particular technology works:

“because it is not the point. It does not matter how you do it; it is why you are doing it.”404

Reviewing the Government’s position

242.We were disappointed that the Government could not articulate its current legal understanding. The Minister said the issues were context dependent and he “worr[ied] about committing … because of the uses and the context in which these potential infringements are occurring”. We heard the Government was “waiting for the courts’ interpretation of these necessarily complex matters”.405

243.We were not convinced that waiting for the courts to provide clarity is practical.406 Rob Sherman of Meta thought it would take “a decade or more for this to work through the court system”,407 and cases may be decided on narrow grounds or settled out of court. In the meantime rightsholders would lose out and contested business practices would become normalised.408

244.We welcomed the Minister’s acknowledgement of the challenges however. He did “not believe that infringing the rights of copyright holders is a necessary precondition for developing successful AI”.409 And he was clear that AI:

“can copy an awful lot of information quickly, inexpensively and in new ways that have not been available to copyright infringers before. So it is the same risk of copyright infringement, but it is happening many millions of times faster, which is why it is more complex. It is quite straightforward for someone who intends to infringe copyright to train their model in a different jurisdiction”.410

245.LLMs may offer immense value to society. But that does not warrant the violation of copyright law or its underpinning principles. We do not believe it is fair for tech firms to use rightsholder data for commercial purposes without permission or compensation, and to gain vast financial rewards in the process. There is compelling evidence that the UK benefits economically, politically and societally from upholding a globally respected copyright regime.

246.The application of the law to LLM processes is complex, but the principles remain clear. The point of copyright is to reward creators for their efforts, prevent others from using works without permission, and incentivise innovation. The current legal framework is failing to ensure these outcomes occur and the Government has a duty to act. It cannot sit on its hands for the next decade until sufficient case law has emerged.

247.In response to this report the Government should publish its view on whether copyright law provides sufficient protections to rightsholders, given recent advances in LLMs. If this identifies major uncertainty the Government should set out options for updating legislation to ensure copyright principles remain future proof and technologically neutral.

Ways forward

248.Viscount Camrose, Minister for AI and Intellectual Property, said he “had hoped” the IPO-convened working group could develop a voluntary code for AI and copyright by the end of 2023. If talks failed he would consider “other means, which may include legislation”.411 Dan Conway said he still supported the IPO’s efforts but believed they would fail without an explicit acknowledgement from the Government and tech firms about the application of copyright and IP law. He said a “legislative handbrake” was needed “if the voluntary conversations fall apart”.412

249.The voluntary IPO-led process is welcome and valuable. But debate cannot continue indefinitely. If the process remains unresolved by Spring 2024 the Government must set out options and prepare to resolve the dispute definitively, including legislative changes if necessary.

250.We heard there were difficult decisions over whether access to and payment for data should be conducted on an ‘opt-in’ or ‘opt-out’ basis. Stability AI said it already operated an ‘opt-out’ system and believed requirements to obtain licenses before conducting TDM would “stifle AI development” and encourage activity to shift to more permissive jurisdictions.413 OpenAI, Google DeepMind and Aleph Alpha also supported opt-out approaches.414 Richard Mollett of RELX noted the EU already has an “opt-in/opt-out regime … [which] operates tolerably well”.415

251.Getty Images argued that “ask for forgiveness later” opt-out mechanisms were “contrary to fundamental principles of copyright law, which requires permission to be secured in advance”.416 The Publishers’ Licensing Services said an opt-out approach would also be “impractical” because models could not easily unlearn data they had already been trained on.417 DMG Media noted that opt-outs could also be commercially damaging, as it is not always clear whether web crawlers are being used for internet search services (which contribute significantly to publishers’ revenue) or for AI training. The uncertainty means that publishers have been reluctant to block bots from some large tech firms.418

252.The IPO code must ensure creators are fully empowered to exercise their rights, whether on an opt-in or opt-out basis. Developers should make it clear whether their web crawlers are being used to acquire data for generative AI training or for other purposes. This would help rightsholders make informed decisions, and reduce risks of large firms exploiting adjacent market dominance.

Better licensing options

253.The Copyright Licensing Agency said that there were already collective licensing mechanisms providing a “practical” system for developers to access data responsibly.419 Work is underway to develop further licensing options specifically for generative AI.420 LLMs require vast amounts of data however. The IP Federation believed that a licensing framework was “not feasible for large scale AI”.421

254.Expanding existing licensing systems and developing new, commercially attractive curated datasets may help address concerns about the viability of licensing agreements and about AI activity shifting to more permissive jurisdictions.422 Reaching the scale required by LLM developers may be challenging, though some content aggregators already run businesses which reportedly offer access to trillions of words.423

255.BT said the Government should boost access to publicly held data and invest in large curated datasets.424 Jisc, an education and technology firm, likewise thought the UK could play a leading role in this space.425 The Copyright Clearance Center suggested the Government should use its leverage over public sector technology use and procurement to restrict the use of “products built upon infringement of UK creators’ rights”.426

256.The Government should encourage good practice by working with licensing agencies and data repository owners to create expanded, high quality data sources at the scales needed for LLM training. The Government should also use its procurement market to encourage good practice.

New powers to assert rights

257.We heard that copyright holders are often unable to exercise their rights because they cannot access the training data to check if their works have been used without permission. The British Copyright Council said the IPO should be “empowered” to oversee and enforce copyright issues relating to AI models.427 RELX called for a transparency mechanism which “requires developers to maintain records, which can be accessed by rightsholders”.428 Dan Conway suggested a searchable repository of citations and metadata would be helpful.429

258.Google DeepMind said such schemes would be technically “challenging”.430 PRS for Music argued however that it was:

“insufficient for AI developers to say that the scale of ingestion prevents licensing, record keeping, good data stewardship and disclosure. They have designed and built the product; the ability to meet these fundamental expectations should be built in from the start.”431

259.The IPO code should include a mechanism for rightsholders to check training data. This would provide assurance about the level of compliance with copyright law.

377 Communications and Digital Committee, At risk: our creative future (2nd Report, Session 2022–23, HL Paper 125), para53. The £108 billion figure refers to a more recent update from the Government. See Department for Culture, Media and Sport, ‘Ambitious plans to grow the economy and boost creative industries’ (June 2023): [accessed 8 January 2024].

379 Department for Culture, Media and Sport, Creative Industries Sector Vision, CP 863 (June 2023): [accessed 8 January 2024]

380 Written evidence from Publishers’ Licensing Services (LLM0082), British Copyright Council (LLM0043), Authors’ Licensing and Collecting Society (LLM0092), British Equity Collecting Society (LLM0085), British Recorded Music Industry (LLM0084), Creators’ Rights Alliance (LLM0039), PRS for Music (LLM0071), Ivors Academy of Music Creators (LLM0071), Publishers Association (LLM0067), RELX (LLM0064), Getty Images (LLM0054), DACS (LLM0045), Society of Authors (LLM0044), Association of Illustrators (LLM0036), Copyright Licensing Agency (LLM0026), Alliance for Intellectual Property (LLM0022) and Copyright Clearance Center (LLM0018). Note that we refer to ‘rightsholders’ as a shorthand for stakeholders critical of LLM developers’ use of copyrighted works. We recognise that both parties are rightsholders and should not be seen as entirely separate groups.

381 Written evidence from the Society of Authors (LLM0044)

382 Written evidence from the Copyright Licensing Agency (LLM0026)

383 Written evidence from the Financial Times (LLM0034)

384 Written evidence from DMG Media (LLM0068)

385 Written evidence from the Guardian Media Group (LLM0108)

387 Written evidence from OpenAI (LLM0113)

388 Q 4 (Ben Brooks), Q 78 (Rob Sherman) and written evidence from Microsoft (LLM0087)

390 Written evidence from Microsoft (LLM0087)

391 TRIPS is an international agreement among World Trade Organization members, see World Trade Organisation, ‘Frequently asked questions about TRIPS [trade-related aspects of intellectual property rights] in the WTO’: [accessed 8 January 2023].

392 Written evidence from Microsoft (LLM0087)

393 Written evidence from OpenAI (LLM0113)

396 Dr Andres Guadamuz, ‘A scanner darkly’ (February 2023): [accessed 8 January 2024], OpenAI blog, ‘What are tokens and how to count them?’ (2023): [accessed 8 January 2024];

397 Dr Andres Guadamuz, ‘A scanner darkly’ (February 2023): [accessed 8 January 2024]

399 Alec Radford et al, ‘Language Models are Unsupervised Multitask Learners’, OpenAI Research Paper (2018): [accessed 8 January 2024], Dr Andres Guadamuz, ‘A scanner darkly’ (February 2023): [accessed 8 January 2024] and Q 54 (Dan Conway)

400 Written evidence from Dr Hayleigh Bosher (LLM0109)

401 Ibid.

403 Dr Andres Guadamuz, A scanner darkly (February 2023): [accessed 8 January 2024]. Some legal action is underway already. See for example BBC, ‘New York Times sues Microsoft and OpenAI for “billions”’ (27 December 2023): [accessed 8 January 2024].

406 Q 61 (Dan Conway)

408 Written evidence from the Authors’ Licensing and Collecting Society (LLM0092)

410 Ibid.

413 Stability AI highlighted the EU’s tiered approach which allowed greater opt-out options, and licensing regimes in the US and Japan. See written evidence from Stability AI (LLM0078).

414 Q 106, 109 and written evidence from OpenAI (LLM0113)

416 Written evidence from Getty Images (LLM0054)

417 Written submission from PLS (LLM0028)

418 Written evidence from DMG Media (LLM0068)

419 Written evidence from the CLA (LLM0026)

420 CLA, Friend or Foe? Attitudes to Generative Artificial Intelligence Among the Creative Community (4 December 2023): [accessed 8 January 2024]

421 Written evidence from the IP Federation (LLM0057)

422 Written evidence from Human Native AI (LLM0119)

423 See SyndiGate, ‘Global content solutions’: [accessed 21 December 2023].

424 Written evidence from BT (LLM0090)

425 Written evidence from Jisc (LLM025)

426 Written evidence from the Copyright Clearance Center (LLM0018)

427 Written evidence from the British Copyright Council (LLM0043)

428 Written evidence from RELX (LLM0064)

431 Written evidence from PRS for Music (LLM0071)

© Parliamentary copyright 2024