Adobe under legal fire: accused of training AI with pirated books through contaminated data chain

2026-01-12 09:03:02

Generative artificial intelligence has opened a legal Pandora’s box for the tech industry. While Adobe was betting on expanding its arsenal of AI-powered tools with products like Firefly, a new class-action lawsuit threatens to dismantle the foundations of how these systems are built. The accusation is straightforward: the software company used pirated literary works to train SlimLM, its series of language models optimized for document tasks on mobile devices.

The Contaminated Path of Training Data

The core of the dispute lies in how Adobe obtained its data. According to the lawsuit filed by Elizabeth Lyon, an Oregon-based author specializing in nonfiction guides, SlimLM was pre-trained using SlimPajama-627B, a dataset released by Cerebras in 2023. But here’s the critical problem: SlimPajama is not a virgin dataset. It was created by processing and manipulating RedPajama, which in turn contains a problematic subset of data known as Books3, a massive collection of 191,000 volumes.

This chain of derivation is what strengthens the legal case. Lyon’s attorneys argue that by using a processed subset of data originally from Books3, Adobe indirectly incorporated thousands of copyrighted works without consent or compensation. Books3 has been the source of contamination in numerous AI training initiatives, and each new lawsuit exposes how developers perpetuate this cycle.

A Wave of Litigation Defining the Industry

Adobe is not alone in this legal dilemma. In September, Apple faced similar allegations for incorporating copyrighted material into its Apple Intelligence model, again explicitly mentioning RedPajama as a data contamination source. A few weeks later, Salesforce received an identical legal blow, also linked to the use of datasets containing pirated works.

The pattern is undeniable: big tech companies have built their AI systems on data structures contaminated from the start. This is not accidental negligence but the result of an industry that prioritized development speed over legal diligence.

The Precedent That Changed the Game

The most significant settlement so far came when Anthropic, creator of the chatbot Claude, agreed to pay $1.5 billion to authors who sued it for using pirated versions of their writings. This arrangement is considered a turning point, a sign that courts are taking copyright protection seriously in the AI era.

With each new case citing Books3, RedPajama, and their derived subsets as evidence of infringement, the industry faces an uncomfortable reality: most current AI models rest on legally questionable bases. What started as a lawsuit against Adobe could end up being a catalyst for a complete rethinking of how AI systems are developed and trained.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.