OpenAI admits it’s impossible to train generative AI without copyrighted materials

OpenAI and its biggest backer, Microsoft, are facing several lawsuits accusing them of using other people’s copyrighted works without permission to train the former’s large language models (LLMs). And based on what OpenAI told the House of Lords Communications and Digital Select Committee, we might see more lawsuits against the companies in the future. It would be "impossible to train today’s leading AI models without using copyrighted materials," OpenAI wrote in its written evidence (PDF) submission for the committee’s inquiry into LLMs, as first reported by the The Guardian.

The company explained that it’s because copyright today "covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents." It added that "[l]imiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens." OpenAI also insisted that it complies with copyright laws when it trains its models. In a new post on its blog made in response to the The New York Times‘ lawsuit, it said the use of publicly available internet materials to train AI falls under fair use doctrine. 

It admitted, however, that there is "still work to be done to support and empower creators." The company talked about the ways it’s allowing publishers to block the GPTBot web crawler from being able to access their websites. It also said that it’s developing additional mechanisms allowing rightsholders to opt out of training and that it’s engaging with them to find mutually beneficial agreements. 

In some of the lawsuits filed against OpenAI and Microsoft, the plaintiffs accuse the companies of refusing to pay authors for their work while building a billion-dollar industry and enjoying enormous financial gain from copyrighted materials. The more recent case filed by a couple of non-fiction authors argued that the companies could’ve explored alternative financing options, such as profit sharing, but have "decided to steal" instead.

OpenAI didn’t address those particular lawsuits, but it did provide a direct answer to The New York Times‘ complaint that accuses it of using its published news articles without permission. The publication isn’t telling the full story, it said. It was already negotiating with The Times regarding a "high-value partnership" that would give it access to the publication’s reporting. The two parties were apparently still in touch until December 19, and OpenAI only found out about the lawsuit on December by reading about it on The Times.

In the complaint filed by the newspaper, it cited instances of ChatGPT providing users with "near-verbatim excerpts" from paywalled articles. OpenAI accused the publication of intentionally manipulating prompts, such as including lengthy excerpts of articles in its interaction with the chatbot to get it to regurgitate content. It’s also accusing The Times of cherry picking examples from many attempts. OpenAI said the lawsuit filed by The Times has no merit, but it’s still hopeful for a "constructive partnership" with the publication. 

This article originally appeared on Engadget at

Title: OpenAI admits it’s impossible to train generative AI without copyrighted materials
Source: Engadget
Source URL:
Date: January 9, 2024 at 11:36AM
Feedly Board(s): Technologie