Spotlight on Large Language Models: Copyright Law in the Age of Digital Creativity

2 May 2024Law Schools

Bessie O’Dell of the University of Oxford examines the legal issues surrounding the use of copyright music to train LLMs. Spotlight on Large Language Models was the overall winner of the 2024 vLex International Writing Competition.

Spotlight on Large Language Models: Copyright Law in the Age of Digital Creativity

by Bessie O'Dell

Ask most music artists what accolade would mean the world to them, and they will likely mention a Grammy Award. That isn’t because of the cash prize on offer (there isn’t one), nor is it really about possession of a coveted golden gramophone trophy. What the award offers, alongside prestige, is a boost in streaming numbers and album sales, with surges in earnings ranging from 4% to 400% after the ceremony. Just ask music icon Madonna, who experienced a 494% boost in sales for her 1990 album The Immaculate Collection following the 2014 Awards.

So, when a catchy new song with vocals from Drake and The Weeknd amassed more than 15 million online streams, and seemed destined for Grammy glory, you might imagine that it would be well-received. Not quite. Soon after release in April 2023, streaming platforms were alerted to the fact that the tune was not a collaboration between two global superstars, but instead an Artificial Intelligence (AI)-generated piece titled 'Heart on My Sleeve'. The maestro behind the melody? None other than TikTok user Ghostwriter977 (whose true identity has never been unveiled). The content was swiftly removed from streaming sites. Then, a week later it happened again. ‘This is the last straw’ lamented Drake.

Shortly after, reports circulated that Universal Music Group (‘UMG’; the company representing Drake and The Weeknd) was claiming a violation of copyright law. Further, UMG denounced the creation of ‘infringing content’ using generative AI – a nascent technology which can produce creative content ranging from audio and images, to text and video. You might have heard of some of the most popular generative AI tools currently in existence – ChatGPT, DALL-E, Bard, Copilot, Llama. Perhaps you’ve experimented with them. Maybe now, you are pausing to think about how the technology underpinning them works, or you’re considering how you might know if you were infringing the law?

Often (but not exclusively), these types of AI are a form of Large Language Model (LLM), which, like the one behind 'Heart on My Sleeve', are advanced AI systems trained (developed) on vast datasets broken down into ‘tokens’ of text. For example, AI music generation works by training a LLM on very large datasets of existing music, to learn the patterns present in the data using variables called parameters. After this, the LLM can be used to generate new music.

Ordinarily, when new content is produced, it will be protected by copyright law, to prevent others from using it without permission. For example, UK copyright law is governed by the Copyright, Designs and Patents Act 1988, and applies to original literary, dramatic, musical or artistic works, sound recordings and films or broadcasts. By contrast, the US equivalent is Title 17 of the United States Code. Content generation is certainly no stranger to copyright law. Indeed, at the same time ‘Heart on My Sleeve’ was released, British singer-songwriter Ed Sheeran was embroiled in a copyright infringement lawsuit over similarities between Marvin Gaye’s 1973 ‘Let’s Get It On’ and his own Grammy-winning hit ‘Thinking Out Loud.’ However, applying copyright law to LLMs feels like a whole new ball game.

‘Ok’, you might be thinking. Isn’t this the wrong way around? ‘Under copyright law, shouldn’t Ghostwriter977’s AI-generated content be protected, not removed?’ After all, the song had brand new lyrics, and the final product was not another artist’s voice (AI was used to produce facsimiles). The answer though, is no.

Even if the content is original, if it is not generated by a human or a company then it cannot be protected by copyright – at least, in countries like the UK, USA and Australia. Historically, the courts have helped to shape the scope of copyright protections, and they did so again in late December 2023, in a landmark case that saw a computer scientist seek to register patents for the inventions created by his AI system- a "creativity machine" called DABUS. Ultimately, the UK’s Supreme Court ruled that AI cannot be considered a patent ‘inventor’ (the US’s Copyright Office issued formal guidance to a similar effect, but this is yet to be tested in court). This ruling is helpful, considering how copyright protections usually subsist for the lifetime of the inventor plus an additional 50-70 years. If AI were to be considered an ‘inventor’, would that mean that copyright protection would last in perpetuity? Or until that model (version) of the AI is retired?

Far more contentious a question, is whether the data used to train AI should be subject to copyright law. What UMG were concerned about was not the originality of ‘Heart on My Sleeve’, but ‘‘the training of generative AI using our artists’ music’’ - and as a result they requested that streaming services block AI companies from accessing songs from its catalogue.

UMG aren’t the only ones who are incensed, nor is this issue limited to the music industry. Following a recent copyright infringement lawsuit filed in the Federal District Court of Manhattan, there have been calls for OpenAI to delete any AI models allegedly trained using ‘millions’ of New York Times newspaper articles. Likely, this trial will have to head to the US Supreme Court before we have any definitive legal answers. However, we can already mull over the second-order effects of this case. Imagine for example, a binding court ruling that all proprietary data used to train LLMs is subject to copyright law. Could this give Big Tech companies like Apple and Google, who have both vast sums of money and easier access to data, an unfair advantage? Could it mean that ChatGPT would be pulled from the market? Would other states abide by the Berne Convention, or would LLMs become the next legal and jurisdictional ‘no man’s land’?

The introduction of commercially available LLMs is an exciting technological development. However, it has also exposed legislative gaps that need addressing. Ultimately, the poignant lyrics to ‘Heart on my Sleeve’ mirrored its own fate - 'all I know is you could've had the world'. However, the jury is still out on whether proprietary data can be used to train LLMs, and whether legal decisions will be reflected across jurisdictions. For now, we wait.