How Big Tech Is Exploiting Content Creators, And (Trying To) Get Away With It
Diving into current courtroom dramas around AI, copyright, and the true meaning of “fair use.”
This post originally appeared on my Medium blog, katherineamunro.medium.com
If you’re reading this, you’re part of the content creator ecosystem: either as a fellow writer, casual consumer or a Medium subscriber. You help keep the system running, which means you have a stake in the subject of today’s post, which will be all about copyright concerns and Generative AI.
I’ve been keeping an interested eye on this topic for a while. As both a writer and someone working with “Gen AI” every day, it’s in my own personal and professional best interests to do so. Of course, copyright isn’t the only legal and ethical issue tied to Gen AI, nor is Gen AI the first technology to raise such flags. However, it has captured public attention on an entirely new scale, which is why I’m diving into it today.
Setting the Scene
Let’s start with key stakeholders and critical questions. So far, the most vocal stakeholders in discourse on Generative AI and copyright law have been:
builders of Generative AI models,
consumers of such models’ outputs,
and content producers, whose IP may wind up in a model’s training data.
Key questions for these stakeholders are:
Can copyrighted works be used as training data?
Can AI-generated works be copyrighted?
… If so, who owns the copyright?
This post will tackle the first question, and the critical concept of “fair use,” a legal doctrine which has been central to discussions on the topic. I’ll use a number of current lawsuits against Stability AI, OpenAI and Meta, among others, to illustrate some criteria and considerations which may be used to evaluate whether or not an activity constitutes fair use.
In a later post or two, I’ll cover the latter two questions in the same way. I’m not a lawyer, but I have been following these issues with great interest, and I’ll be sure to include plenty of links so you can fact check my writing. Feel free to drop me a comment with critique or to start a chat on the topic, if that’s your thing. And now, let’s get into the nitty gritty…
Background: The Problem
Generative AI is big business. Open AI is one of the world’s fastest-growing tech companies and recently surpassed the $2 billion in revenue mark, thanks in large part to its release of ChatGPT. Unsurprisingly, all the biggest tech names in the business are racing to catch them up.
On the plus side, this has produced an incredible wave of innovation and, to some extent, democratisation of powerful AI technology. On the other hand, it’s created an insatiable thirst for data to train Generative AI models 1 . For example, Meta was so desperate for text data to train its Large Language Models that it considered buying publishing house Simon & Schuster just to access their copyrighted material, according to a recent investigation by the New York Times. The report found that some Gen AI companies “cut corners, ignored corporate policies and debated bending the law,” including Meta, where there was allegedly even talk of simply using copyrighted text data and dealing with any lawsuits later. The time it would take to license such content correctly was, apparently, simply too long.
In a similarly disturbing example, the report cites insider accounts from OpenAI in which the company, desperate for additional text data for training GPT 4, developed a speech recognition tool to extract transcriptions from YouTube videos. This would be a clear breach of YouTube’s terms of service: a fact that OpenAI employees allegedly discussed at the time, before proceeding anyway.
And speaking of breaking the rules: AWS is investigating Perplexity AI for allegedly scraping websites without consent, after both Forbes and Wired magazines raised the red flag over Perplexity’s model outputs.
This appetite for training data — be it text, images, video, or other modalities — won’t go away any time soon. That means we’re unlikely to see an end to such stories until we figure out a fair way to handle the issue. If you don’t consider yourself an artist or content creator, you may wonder how this applies to you. But remember that any content you generate online — such as a photo on Instagram, a thread on X, or a post on LinkedIn — could potentially land in a Gen AI model’s training dataset, if these companies continue to push the boundaries of what they are allowed to scrape.
The NYT investigation, for example, details how Google recently broadened their terms of service to allow it to use publicly available Google Docs, restaurant reviews, and other online content to develop its AI models. And the larger and more diverse a model’s training data is, the more capable it can be trained to become. As a result, Gen AI models can be used to produce content at an unmatched pace, including imitating specific creators and their styles. This can be a threat to content producers worldwide, disrupting all sorts of content platforms, including those you consume.
So, with the stakes made clear, let’s examine the first, key question for copyright and Gen AI.
Can copyrighted works be used as training data?
Debates on this question have heavily revolved around how to apply the legal doctrine of “fair use” to the creation of generative AI models. The fair use doctrine allows limited use of copyrighted material without obtaining permission for it, provided that the result is “transformative,” meaning that it adds value, gives commentary on the original work, or serves an entirely different purpose. It permits applications like news reporting, teaching, review and critique, and research, which is why many generative AI models are created by universities and non-profit organisations who state their goals as purely academic. For example, text-to-image model Stable Diffusion was developed by the Machine Vision and Learning group at the Ludwig Maximilian University of Munich; they own the technical license for Stable Diffusion, but the compute power to train it was provided by Stability AI.
The challenge arises when such models are commercialised. For instance, while Stable Diffusion was released for free public use, Stability.Ai built DreamStudio on top of it — a simple interface enabling users to call the model to generate images — and used the fact of their “co-creation” to raise millions in venture capital funding 2,3. Stability can also make money from DreamStudio, by having users pay for image credits.
In such cases, courts need to determine whether the models’ outputs are transformative enough to still be allowed. A number of recent, ongoing copyright lawsuits can help us understand just how difficult this can be.
First, we have the suit by three artists against Stability AI, Midjourney and DeviantArt, which claims that these companies violated millions of artists’ copyrights by using the artists’ works to train their Generative AI models without permission. Second, we have multiple lawsuits by various news publications and authors against OpenAI, Microsoft, and Meta4. The allegations in these cases include unfair competition, unjust enrichment, vicarious copyright infringement (that is, to know about and benefit from an infringement), and violation of the Digital Millennium Copyright Act by removing copyright management information. In all of these cases, the defendants (that is, the AI companies being sued) have leaned heavily on the defence that their research is “transformative”, and thus, “fair use”. So, we can start to understand the complexity of the fair use doctrine, by first summarising arguments against these companies, followed by those in defence of them.
How Gen AI companies might be in trouble:
Let’s start with some facts about Stable Diffusion. It was trained on a dataset of image links and their alt-text descriptions, scraped from the internet without necessarily obtaining consent. The dataset could possibly be considered protected under fair use, due to its non-profit, research nature (it was created by German non-profit LAION, short for Large-scale Artificial Intelligence Open Network), and the fact that it does not store the images themselves (rather, it contains web-links and associated scraped information, like captions). However, the plaintiffs (that is, the accusers in the case) argue that Stability AI created unauthorised reproductions of copyrighted works, by downloading the images for training. In other words, the argument against the company relates to its unauthorised use of a possibly-otherwise-permissible source. We can see a parallel issue in the lawsuits against OpenAI: although AI researchers have been using large datasets of publicly crawled text data for years5, OpenAI are accused of conducting infringement by removing copyright management information, such as authors and titles, from their training data6.
Another problem for Stability AI is that their model can recreate existing expressions and styles with high accuracy, which could constitute so-called “unauthorised derivative works.” This is a huge concern for creators, who fear that such models will be able to out-compete them in their own game. Hence, this case also included a right of publicity claim, alleging that providers of image generation models can profit off being able to reproduce certain artists’ styles (this claim was dismissed due to lack of evidence that the companies were actually doing this). In the case against OpenAI, the company was similarly accused of unfair competition and of harming advertising revenues (for news providers whose work was supposedly being reproduced by ChatGPT, and thus drawing clicks away from the original source). In light of such complains, it is difficult for any of these model providers to defend themselves by claiming their work exists purely for research purposes, given that they allow commercial applications of their models, including Stability AI’s DreamStudio app, and OpenAI’s ChatGPT.
One final accusation by content creators is that models which replicate their works can also harm their brand. For example, the news outlets versus OpenAI case complained that ChatGPT generated misleading and harmful articles — including proposing smoking for asthma relief and recommending a baby product which was linked to child deaths — and falsely attributed to these newspapers, potentially harming their reputations.
How Gen AI companies could be safe (at least, for now):
Turning now to arguments favouring Stability AI and OpenAI: the former has defended the creation of copies of images for training, saying that this technical requirement is, in principle, no different to humans learning and taking inspiration from existing material. They also argued that their model does not memorise training images, but instead, uses them to learn general features about objects — such as outlines and shapes — and how they relate to one another in the real world.
Stability AI have also claimed that Stable Diffusion does not create derivative works, given that a reasonable person usually cannot tell which images, if any, contributed to a specific output: a condition courts have historically used to determine whether a work is derivative. In the case involving authors against OpenAI, the plaintiffs argued they shouldn’t have to prove derivative use of their works, if they could simply prove their works were in a model’s training data. They argued that if a model is trained on protected data, then all its works are derivative. The judge, however, dismissed that line of argument, insisting that they still needed to demonstrate that some outputs strongly resembled their own works. A final point in Stability AI’s favour is that style itself is not copyrightable — only specific, concrete expressions are.
The cases are only heating up…
The courtroom dramas over Generative AI and copyright don’t end with the cases I’ve discussed above. There are plenty more; I’ll summarise a few here, to reiterate just how messy this topic is… and how much messier it’s going to get7.
Getty Images vs Stability AI: In May 2023, Getty Images sued Stability AI for allegedly breaching their intellectual property rights by using images scraped from gettyimages and istock to train their models. Getty Images claimed that the two databases had cost them nearly US $1 Billion to create, including around $150 million in licensing fees to acquire images. According to Getty, the resulting high-quality, metadata-enriched images have become an enticing target for companies wanting training data for their AI models, and some of those images had also been scraped for the non-profit LAION dataset mentioned above.
Getty’s complaint was that training Stable Diffusion involved “a chain of sequential copying of Getty Images from assembling of the training dataset through the diffusion process and onto the outputs in response to individual user prompts.” They say the resulting model outputs are so similar to Getty’s own images that they sometimes even include the company’s watermark. This, they said, not only proves that Stability AI copied their copyrighted material, but also constitutes a breach of trademark, and so-called “passing off”.
Stability AI hit back that Getty had missed the point of Gen AI, which is to create new and novel content, not replicate existing works. Stability AI conceded that Getty images were using during training, but offered two key defences. First, they said the copying was temporary and occurred in the US (perhaps trying to take advantage of leaner fair use laws there). Second, they argue that even if a model output is similar to an input training data sample, it is not a breach of copyright. This, they say, is down to multiple reasons: First, during the training phase, Stability says there’s no intention to memorise input data. Second, during usage of a trained model, they say an image starts as random noise, and thus cannot comprise any copyrighted material. Finally, they say that the random start of an output images means the same prompt will never generate the same output, and thus, no particular image can be generated with a specific prompt, which means the model cannot be used to reproduce a copyrighted work. When faced with evidence from Getty of output images that strongly resembled Getty images, Stability argued that this was due to the actions of the user deliberately trying to recreate a training sample, and not due to the model itself.
Music Publishers vs. Anthropic: In October 2023, three major music publishers — Universal Music Publishing Group, Concord Music Group and ABKCO — sued Anthropic for using copyrighted song lyrics to train its “Claude” Language Model, and for allowing it to reproduce copyrighted lyrics almost verbatim. They claimed that even without being prompted to recreate existing works, Claude nevertheless uses extremely similar phrases to well-known lyrics. They also claim that, while other lyric distribution platforms pay to license lyrics (thus providing attribution and compensation to artists where it’s due), Anthropic frequently leaves out critical copyright information.
The music publishers rejected Anthropic’s use of copyrighted material as “innovation”, accusing them of downright theft. They acknowledged that Claude sometimes refused to output copyrighted songs, but used this as evidence that guardrails had been applied, but not satisfactorily.
In response, Anthropic called its model “transformative”, as it adds “a further purpose or different character” to the original works. They also claimed that song lyrics were such a small part of Claude’s training data that to properly license them — not to mention the rest of Claude’s dataset — would be practically and financially infeasible. Anthropic even accused the plaintiffs of “volitional conduct”, which essentially defends Claude as an “innocent” autonomous machine which the music producers “attacked” in order to force it to recreate copyrighted content. (Similarly, OpenAI claimed that the NYT illegally “hacked” ChatGPT to create misleading evidence to support its case).
Continuing to pick apart the case, Anthropic asked the court to reject the music publishers’ allegations of “irreparable harm,” because the publishers had not provided evidence (such as a drop in revenues since Claude’s release) to support that claim. In fact, Anthropic argued that the lawsuit’s demands of financial compensation implies that the harm can be quantified, ergo, it cannot be irreparable. Finally, Anthropic insisted that any accidental output of copyrighted material could be fixed by applying guardrails).
Programmers vs. Microsoft, GitHub, and OpenAI: In January 2023, programmer and lawyer Matthew Butterick joined with Joseph Saveri Law Firm — the same firm representing two of the aforementioned cases against OpenAI, Meta and Stability AI — to file a class action against Microsoft, OpenAI and GitHub on behalf of two anonymous software developers. The suit accuses the companies of “software piracy on an unprecedented scale.”
The accused companies attempted to have the complaint dismissed, saying that the complaints do not show violation of any recognisable rights, and rely on hypothetical events rather than real evidence of injury. Microsoft and GitHub also said that Copilot, its code assistance tool based on LLMs, does not extract from any existing, publicly available code, but instead, learns from open-source code examples in order to make suggestions. They even accused the plaintiffs of undermining the open-source philosophy by asking for monetary compensation for “software that they willingly share as open source.”
The hearing on whether this case can proceed will take place in this May, so I’ll be keeping an eye out, and will return and update this post as the case develops.
So, could these defences work?
That is, will these Gen AI model providers be able to convince courts that their endeavours constitute “fair use”? Perhaps. In a 2015 case of the Authors Guild against Google, “Google was permitted to scan, digitize and catalog books in an online database after arguing that it had reproduced only snippets of the works online and had transformed the originals, which made it fair use.” Meta used this example to argue that training its LLMs on copyrighted data should be similarly allowed.
Meanwhile, Anthropic’s insistence that their accusers actually need to prove financial harm is an example of another kind of argument — so-called “harm-based defences” — which Gen AI companies may try to apply in addition to claiming fair use. For example, in the cases against Microsoft and OpenAI, both companies essentially argued that The NYT had failed to show real harm or provide that LLM-powered chatbots had dented news traffic or their subscription revenues 8,9. In fact, in addition to downplaying any harm, these companies have stated that such lawsuits “threaten the growth of the potential multi-trillion-dollar AI industry.” They may well be right. And applying both fair-use and harm-based arguments in this way could potentially strengthen their overall case.
If judicial bodies accept these arguments, then they may consider the act of using a copyrighted work in training as sufficiently transformative, since it results in a productive new model. However, new regulations will likely still be required, giving creators ways to have their creations removed from AI training datasets.
Summing up
This post tackled the question: can copyrighted data be used to train AI models? A clear, legally binding answer isn’t likely to arrive any time soon, and many content creators and platforms are choosing not to wait for one. Furthermore, they’re also recognising this hunger for training data as an opportunity to establish new revenue streams. Many are partnering up with the biggest Gen AI companies in the game, who themselves are working hard to attract such willing partners. A recently leaked Open AI pitch deck, for example, revealed that in addition to financial compensation, the company is offering its partners priority placement in search results and greater opportunities for brand expression when their works are surfaced during chats.
As a result of these situational and commercial pressures, many organisations are making deals with the Gen AI giants directly, allowing use of their copyrighted material, for a fee:
Newspapers and media companies like The Associated Press, The Financial Times, Vox Media, Time Magazine and Axel Springer, the parent company of Business Insider Politico, have made such a deal with OpenAI.
There are rumours (not yet confirmed) that Google has agreed to pay over $5 million per year to News Corp, owners of the Wall Street Journal, to fund News Corp’s development of AI-related content and products.
Reddit and Google have made a deal worth approximately $60 million per year, which allows Google to train its AI models using Reddit’s data. It’s not surprising that Google is interested in Reddit’s vast supply of stored text content: Large Language Models need to be robust to dealing with all kinds of text data, and while the news publications will have stores of pristinely edited articles in a specific journalistic tone, Reddit’s data will contain a diverse mix of both high quality texts as well as texts full of slang, typos, hashtags, emojis, HTML codes, and all kinds of other ‘noise’, making it a vital learning source.
Is this a good thing? What are your thoughts? Personally, I’m frustrated and anxious about the immense power these companies have over content creators and publications — not only does it give Open AI and co. the upper hand in such negotiations, but they’re also proven they’re willing to simply poach other people’s content if no deal is made at all10 — but I’m happy that media producers are at least getting something out of the situation.
But of course, this isn’t the end of the discussion on Gen AI and copyright concerns. There are still two crucial questions to tackle:
Can AI-generated works be copyrighted?
And if so, who owns the copyright?
Look out for a deep dive on those issues in future posts.
Katherine.
The Gen AI race has also caused burnout among AI engineers, rushed rollouts without sufficient testing, and a lack of consideration of safe and responsible AI practices, according to this CBNC report.
TechCrunch, October 17, 2022: “Stability AI, the startup behind Stable Diffusion, raises $101M”
Sifted.eu, April 21, 2023: “Leaked deck raises questions over Stability AI’s Series A pitch to investors”
In the parallel cases of Sarah Silverman, Christopher Golden and Richard Kadrey against OpenAI and Meta, the authors claim that ChatGPT can summarize their books when prompted, but leaves out copyright ownership information. They also accuse Meta of using their works to train its LLaMa models, after Meta listed in a research paper a source database which is actually built off an illegal shadow library website.
Note: the argument that OpenAI deliberately removed copyright information from training data was ultimately not allowed to proceed, due to a lack of evidence.
NYT, March 4th, 2024: “Microsoft Seeks to Dismiss Parts of Suit Filed by The New York Times“
NYT, February 27, 2024: “OpenAI Seeks to Dismiss Parts of The New York Times’s Lawsuit”
Some content creators are simply being forced out of the game due to their lack of power in the Gen AI copyright fight: photographers I follow on Instagram, for example have been closing their accounts to prevent AI being trained on their works. And while some such artists have found a new home in “AI sceptical” portfolio apps, there’s no doubt that these artists’ audiences and reach will have been significantly damaged in the process.