## The Intellectual Property Dilemma in AI Model Training: The Adobe Case and Beyond
The use of massive data sets to train artificial intelligence systems has become a standard practice in the tech industry, but it has also sparked an unprecedented legal conflict. The core of the issue lies in how these models acquire their capabilities: through processing enormous collections of data that, in many cases, contain copyrighted works without explicit consent from the original creators.
### Adobe Under Scrutiny: SlimLM and the Books3 Legacy
Adobe, the software company that has invested heavily in artificial intelligence since 2023 with products like Firefly, now faces a class-action lawsuit questioning the methods behind its SlimLM technology. Elizabeth Lyon, an Oregon-based author specializing in nonfiction writing guides, has led the lawsuit claiming her works were included without authorization in the training data of this model.
The accusation points to a chain of derivations of data sets illustrating the complexity of the problem. SlimLM was pre-trained using SlimPajama-627B, an open-source dataset released by Cerebras. The problematic part is that SlimPajama was created as a processed derivative of RedPajama, which in turn contains Books3: a colossal collection of 191,000 books that has become the source of numerous legal controversies. Each subset of data potentially inherited the intellectual property vulnerabilities of the previous one, creating a diffuse but real chain of responsibility.
### A Pattern Repeating Across the Industry
What is happening with Adobe is not an isolated incident but part of a broader trend that has begun to collapse under the weight of legal challenges. In September, Apple faced similar accusations of using copyrighted material to train Apple Intelligence, again with RedPajama mentioned as a source. Simultaneously, Salesforce was sued under nearly identical arguments.
The most significant moment came when Anthropic agreed to a $1.5 million settlement with authors suing for the unauthorized use of their works in training Claude. This settlement, reported in September, was widely seen as a turning point in the litigation over copyright in AI training data.
### Where Is the Industry Heading?
The accumulation of class-action lawsuits suggests that the current data acquisition model for training AI is legally unsustainable. Tech companies face a dilemma: training powerful models requires massive volumes of data, but the legal and compensated acquisition of such volumes still lacks an established framework in most jurisdictions. Adobe’s case, particularly how SlimLM inherited problematic data from earlier subsets, highlights how responsibility can be traced through multiple layers of data processing, even when companies claim to use "open-source" datasets.
The industry is at a critical juncture where legal precedents are beginning to define what is acceptable and what is not in AI training.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
## The Intellectual Property Dilemma in AI Model Training: The Adobe Case and Beyond
The use of massive data sets to train artificial intelligence systems has become a standard practice in the tech industry, but it has also sparked an unprecedented legal conflict. The core of the issue lies in how these models acquire their capabilities: through processing enormous collections of data that, in many cases, contain copyrighted works without explicit consent from the original creators.
### Adobe Under Scrutiny: SlimLM and the Books3 Legacy
Adobe, the software company that has invested heavily in artificial intelligence since 2023 with products like Firefly, now faces a class-action lawsuit questioning the methods behind its SlimLM technology. Elizabeth Lyon, an Oregon-based author specializing in nonfiction writing guides, has led the lawsuit claiming her works were included without authorization in the training data of this model.
The accusation points to a chain of derivations of data sets illustrating the complexity of the problem. SlimLM was pre-trained using SlimPajama-627B, an open-source dataset released by Cerebras. The problematic part is that SlimPajama was created as a processed derivative of RedPajama, which in turn contains Books3: a colossal collection of 191,000 books that has become the source of numerous legal controversies. Each subset of data potentially inherited the intellectual property vulnerabilities of the previous one, creating a diffuse but real chain of responsibility.
### A Pattern Repeating Across the Industry
What is happening with Adobe is not an isolated incident but part of a broader trend that has begun to collapse under the weight of legal challenges. In September, Apple faced similar accusations of using copyrighted material to train Apple Intelligence, again with RedPajama mentioned as a source. Simultaneously, Salesforce was sued under nearly identical arguments.
The most significant moment came when Anthropic agreed to a $1.5 million settlement with authors suing for the unauthorized use of their works in training Claude. This settlement, reported in September, was widely seen as a turning point in the litigation over copyright in AI training data.
### Where Is the Industry Heading?
The accumulation of class-action lawsuits suggests that the current data acquisition model for training AI is legally unsustainable. Tech companies face a dilemma: training powerful models requires massive volumes of data, but the legal and compensated acquisition of such volumes still lacks an established framework in most jurisdictions. Adobe’s case, particularly how SlimLM inherited problematic data from earlier subsets, highlights how responsibility can be traced through multiple layers of data processing, even when companies claim to use "open-source" datasets.
The industry is at a critical juncture where legal precedents are beginning to define what is acceptable and what is not in AI training.