Zuckerberg knowingly used pirated data to train Meta AI, authors allege
2 days ago Benito Santiago
Mark Zuckerberg has approved the use of pirated books to train Meta AI, even after his own team warned that the material was illegally obtained, a group of authors recently sued in a court case.
The lawsuit stems from a copyright infringement lawsuit filed in California federal court in July 2023 by a group of authors including comedians Sarah Silverman, Christopher Golden and Richard Cadre. The group misused their books to train the Meta Lama LLM. They're asking for damages and an injunction to prevent Meta from using their work. The judge in the case dismissed the author's claim in November of that year, but these latest lawsuits could breathe new life into the legal dispute.
Meta CEO Mark Zuckerberg approves Meta's LibGen dataset despite concerns from the Meta AI executive team (and others at Meta) that LibGen is ‘a dataset that we know has been hacked,” the plaintiffs' lawyers said in a filing Wednesday, despite these red flags, saying the lawsuit “has progressed Later, Zuckerberg said he gave the Meta AI team the green light to continue. Using the controversial data set.
Representatives for Meta did not immediately respond to a request for comment.
Libgen, short for Library Genesis, is an online platform that allows free access to books, academic papers, articles, and other written publications without actually falling under copyright laws. It acts as a “shadow library”, making these materials available without permission from the publishers or copyright holders. It currently hosts over 33 million books and over 85 million articles.
Meta tried to wire this up until the last possible moment, the charge said. In the year Two hours before the Dec. 13, 2024, deadline for fact-finding, the company released what plaintiffs described as “some of the most damning internal documents it has ever produced.”
According to court filings, Meta's own engineers appear to be uncomfortable with the plan. Internal messages from the team of authors show that Meta engineers hesitated to download the pirated material, one person said, “it's leaking internally.” [Meta-owned] A corporate laptop doesn't feel right (smiling emoticon).” However, the lawsuit claims that they not only downloaded the books but also systematically extracted copyright information to prepare them for AI training.
Recent filings filed in the lawsuit paint a picture of the company fully aware of the risks: An internal memo warned: “Media coverage suggesting our use of a dataset like LibJen could weaken our negotiating position with regulators.” ” However, Meta continued to download and distribute (or “seed”) the pirated content in January 2024.
Zuckerberg appeared to distance himself from the decision when asked about these activities, testifying that this type of piracy raises “a lot of red flags” and that it “seems like a bad thing.”
The court documents also emphasized that Meta's approach to handling copyrighted information focused on model training rather than copyright laws. According to the application, an engineer said, “Check […] copyright lines and other information outside of LibGen to train a stripped-down version of Llama.” This systematic removal of copyright information may bolster the author's claim that Meta knowingly tried to hide pirated material.
The revelations come at a critical time for meta AI ambitions. While the company has gone out of its way to compete with OpenAI and Google in the AI space, Llama 3.2 is the most popular open source LLM, while Meta AI is a strong contender with ChatGPT, which has similar features.
Many of these AI companies are facing legal battles over their questionable practices in training their large language models. Meta has already been sued by another group of authors for copyright infringement, OpenAI is currently facing various lawsuits for training LLMs on copyrighted material, and Anthroponic is also facing various allegations from authors and songwriters.
But in general, tech entrepreneurs and innovators have been on hand since generative AI exploded in popularity. There are currently dozens of different lawsuits against AI companies for willingly using copyrighted material to train their models. But as with Bleeding Edge, we'll have to wait and see what the courts say about it all.
Generally intelligent newspaper
A weekly AI journey narrated by a generative AI model.