OpenAI destroyed a trove of books used to train AI models. The employees who collected the data are gone.

OpenAI destroyed a trove of books used to train AI models. The employees who collected the data are gone.

Newly unsealed documents in the class action lawsuit brought by the Authors Guild against OpenAI reveal that the startup deleted two huge datasets, named “books1” and “books2,” that had been used to train its GPT-3 AI model.

Lawyers for the Authors Guild said in court filings that the datasets likely contained “more than 100,000 published books” and are central to its allegations that OpenAI used copyrighted materials to train AI models.

For months the Guild has been seeking information from OpenAI about the datasets. The company initially resisted, citing confidentiality concerns, before ultimately revealing that it had deleted all copies of the data, according to the legal filings reviewed by Business Insider.

High-quality training data is an important part of the powerful AI models that are taking the tech world by storm. OpenAI and other companies used data from the internet, including many books, to build these models. Many of the companies that created this information want to be paid for providing intelligence to these new AI products. Tech companies don’t want to be forced to pay. This dispute is being fought in court now, via several lawsuits.

In a 2020 white paper, OpenAI described the books1 and books2 datasets as “internet-based books corpora” and said they made up 16% of the training data that went into creating GPT-3. The white paper also says “books1” and “books2” together contained 67 billion tokens of data, or roughly the equivalent of 50 billion words. For comparison the King James Bible contains 783,137 words.

The unsealed letter from OpenAI’s lawyers, which is labeled “highly confidential – attorneys’ eyes only”, says that the use of books1 and books2 for model training was discontinued in late 2021 and that the datasets were deleted in mid-2022 due to their non-use. The letter goes on to say that none of the other data used to train GPT-3 has been deleted and offered attorneys for the Authors Guild access to those other datasets.

The unsealed documents also reveal that the two researchers who created books1 and books2 are no longer employed by OpenAI. OpenAI initially refused to reveal the identities of the two employees.

The startup has since identified the employees to lawyers for the Authors Guild but has not publicly disclosed their names. OpenAI has petitioned the court to keep the names of the two employees, as well as information about the datasets, under seal. The Authors Guild has opposed this, arguing for the public’s right to know. The dispute is ongoing.

“The models powering ChatGPT and our API today were not developed using these datasets,” OpenAI said in a statement on Tuesday. “These datasets, created by former employees who are no longer with OpenAI, were last used in 2021 and deleted due to non-use in 2022.”

Source Reference

Latest stories