ChatGPT and other chat AI based on natural language processing technology have three main legal compliance issues that need to be solved urgently in the short term:
First, when it comes to the intellectual property rights of the answers provided by chat AI, the main compliance problem is whether the answers produced by chat AI generate corresponding intellectual property rights, and whether intellectual property rights authorization is required.
Second, does the process of data mining and training of chat AI on a huge amount of natural language processing text (commonly known as a corpus) need to obtain corresponding intellectual property rights?
Third, one of the mechanisms for ChatGPT and other chat AI to answer is to obtain a statistical-based language model by mathematically statistically counting a large number of existing natural language texts, which leads to the fact that chat AI is likely to “talk serious nonsense”, which in turn leads to the legal risk of the spread of false information.
In general, at present, China’s artificial intelligence legislation is still in the pre-research stage, and there is no formal legislative plan or relevant draft motion, and the relevant departments are particularly cautious about the supervision of the field of artificial intelligence.
1. ChatGPT is not a “cross-era artificial intelligence technology”
ChatGPT is essentially a product of the development of natural language processing technology, and is still essentially just a language model.
At the beginning of 2023, the huge investment of global technology giant Microsoft has made ChatGPT the “top stream” in the technology field and successfully out of the circle. With the sharp rise of the ChatGPT concept in the capital market, many domestic technology companies have also begun to lay out this field, while the capital market is enthusiastic about the concept of ChatGPT, as legal workers, we can’t help but evaluate what legal security risks ChatGPT itself may bring, and what is its legal compliance path?
Before discussing the legal risks and compliance paths of ChatGPT, we should first examine the technical rationale of ChatGPT – does ChatGPT give the questioner any questions they want, as the news suggests?
In the eyes of Sister Sa’s team, ChatGPT seems to be far from being as “god” as some news advertised - in a word, it is just an integration of natural language processing technologies such as Transformer and GPT, and it is still essentially a language model based on neural networks, rather than a “cross-era AI progress”.
As mentioned earlier, ChatGPT is a product of the development of natural language processing technology, and in terms of the development history of the technology, it has roughly gone through three stages: grammar-based language model, statistics-based language model, and neural network-based language model The working principle and the legal risks that may arise from this principle must first be clarified as the working principle of the statistical-based language model, the predecessor of the neural network-based language model.
In the statistics-based language model stage, AI engineers determine the probability of successive connections between words by counting a huge amount of natural language text, and when people ask a question, AI begins to analyze which words are highly probable in the language environment where the constituent words of the problem are composed, and then splices these high-probability words together to return a statistical-based answer. It can be said that this principle has run through the development of natural language processing technology since its emergence, and even in a sense, the subsequent emergence of neural network-based language models is also a modification of statistics-based language models.
To give an easy-to-understand example, Sister Sa’s team typed the question “What are the tourist attractions in Dalian?” into the ChatGPT chat box, as shown in the figure below:
In the first step, the AI will analyze the basic morphemes in the question, “Dalian, which, tourism, and scenic spots”, and then find the natural language text set where these morphemes are located in the existing corpus, find the collocations with the highest probability of occurrence in this set, and then combine these collocations to form the final answer. For example, the AI will find that there is the word “Zhongshan Park” in the corpus with a high probability of the occurrence of the three words “Dalian, tourism, and resort”, so it will return to “Zhongshan Park”, and the word “park” has the highest probability of collocation with words such as garden, lake, fountain, statue, etc., so it will further return "This is a historic park with beautiful gardens, lakes, fountains, and statues. 」
In other words, the whole process is based on the probability statistics of the natural language text information (corpus) that already exists behind the AI, so the answers returned are also “statistical results”, which leads to ChatGPT’s “serious nonsense” on many questions. As the answer to the question “What are the tourist attractions in Dalian”, although Dalian has Zhongshan Park, there are no lakes, fountains and statues in Zhongshan Park. Dalian did have “Stalin Square” in history, but Stalin Square was never a commercial square, nor did it have any shopping centers, restaurants, or entertainment venues. Apparently, the information returned by ChatGPT is false.
Second, ChatGPT is currently the most suitable application scenario as a language model
Although we bluntly explained the disadvantages of statistical-based language models in the previous part, ChatGPT is already a neural network-based language model that greatly improves the statistical-based language model, and its technical foundation Transformer and GPT are the latest generation of language models The model is combined to model the natural language in a very deep way, and the returned sentences are sometimes “nonsense”, but at first glance they still look like “human responses”, so this technology has a wide range of application scenarios in scenarios that require massive human-computer interaction.
For now, there are three such scenarios:
First, search engines;
Second, the human-computer interaction mechanism in banks, law firms, various intermediaries, shopping malls, hospitals, and government government service platforms, such as the customer complaint system, guidance navigation, and government affairs consultation system in the above-mentioned places;
Third, the interaction mechanism of smart cars and smart homes (such as smart speakers and smart lights).
Search engines that combine AI chat technologies such as ChatGPT are likely to present a traditional search engine-based approach + neural network-based language models. At present, traditional search giants such as Google and Baidu have a deep accumulation of neural network-based language model technology, for example, Google has Sparrow and Lamda, which are comparable to ChatGPT.
The application of AI chat technology such as ChatGPT in the customer complaint system, the guidance and navigation of hospitals and shopping malls, and the government affairs consultation system of government agencies will greatly reduce the human resource cost of relevant units and save communication time, but the problem is that the answers based on statistics may produce completely wrong content responses, and the risk control risks brought by this may need to be further evaluated.
Compared with the above two application scenarios, the legal risk of ChatGPT application becoming the human-computer interaction mechanism of the above-mentioned devices in the fields of smart cars and smart homes is much smaller, because the application environment in this field is relatively private, and the wrong content fed back by AI will not cause large legal risks, and at the same time, such scenarios do not have high requirements for content accuracy, and the business model is more mature.
III. A Preliminary Study on ChatGPT’s Legal Risks and Compliance Path
First, the overall regulatory landscape of artificial intelligence in China
Like many emerging technologies, the natural language processing technology represented by ChatGPT faces a “Collingridge dilemma” This dilemma includes the information dilemma, that is, the social consequences of an emerging technology cannot be predicted in the early stage of the technology, and the so-called control dilemma, that is, when the adverse social consequences of an emerging technology are discovered, the technology has often become a part of the entire social and economic structure, so that the adverse social consequences cannot be effectively controlled.
At a time when the field of artificial intelligence, especially natural language processing technology, is in a stage of rapid development, the technology is likely to fall into the so-called “Collingridge dilemma”, and the corresponding legal regulation does not seem to have “kept pace”. At present, there is no national legislation on the artificial intelligence industry in China, but there have been relevant legislative attempts at the local level. In September last year, Shenzhen announced the “Regulations on the Promotion of Artificial Intelligence Industry in the Shenzhen Special Economic Zone”, which is a special legislation for the national non-artificial intelligence industry, and then Shanghai also passed the “Regulations on Promoting the Development of the Artificial Intelligence Industry in Shanghai”.
In terms of the ethical regulation of artificial intelligence, the National Professional Committee for the Governance of the New Generation of Artificial Intelligence also issued the “New Generation of Artificial Intelligence Ethics Code” in 2021, proposing to integrate ethics into the whole life cycle of artificial intelligence R&D and application.
Second, the legal risk of disinformation brought about by ChatGPT
Shifting the focus from the macro to the micro, aside from the overall regulatory landscape of the AI industry and the ethical regulation of AI, the practical compliance issues existing in the foundation of AI chats such as ChatGPT also need urgent attention.
As mentioned in Part 2 of this article, ChatGPT’s working mechanism makes it possible for its replies to be completely “serious nonsense”, which is extremely misleading. Of course, false responses to questions such as “what are the tourist attractions in Dalian” may not have serious consequences, but if ChatGPT is applied to search engines, customer complaint systems, and other fields, the false information it replies may pose extremely serious legal risks.
In fact, such a legal risk has already emerged, and Galactica, a language model in the field of Meta service scientific research that was launched almost at the same time as ChatGPT in November 2022, was taken offline after only 3 days of testing because of the mixed questions of true and false answers. Under the premise that the technical principles cannot be broken through in a short period of time, if ChatGPT and similar language models are applied to search engines, customer complaint systems and other fields, they must be transformed for compliance. When it is detected that a user may ask a professional question, the user should be directed to consult the appropriate professional instead of looking for the answer from the AI, and the user should be significantly reminded that the authenticity of the questions returned by the chat AI may need to be further verified to minimize the corresponding compliance risks.
Third, the intellectual property compliance issues brought about by ChatGPT
When shifting the focus from macro to micro, in addition to the authenticity of AI’s reply messages, the intellectual property issues of chat AI, especially large language models like ChatGPT, should also attract the attention of compliance officers.
The first compliance problem is whether “text data mining” requires corresponding intellectual property licensing. As pointed out above, ChatGPT relies on a huge amount of natural language texts (or speech databases), ChatGPT needs to mine and train the data in the corpus, and ChatGPT needs to copy the content in the corpus into its own database, and the corresponding behavior is often called “text data mining” in the field of natural language processing. On the premise that the corresponding text data may constitute a work, there is still controversy as to whether the text data mining infringes the right of reproduction.
In the field of comparative law, both Japan and the European Union have expanded the scope of fair use in their copyright legislation, adding “text data mining” in AI as a new case of fair use. Although some scholars advocated changing China’s fair use system from “closed” to “open” in the process of revising China’s copyright law in 2020, this proposition was not finally adopted, and at present, China’s copyright law still maintains the closed provisions of the fair use system, and only the thirteen circumstances stipulated in Article 24 of the Copyright Law can be recognized as fair use, in other words, at present, China’s copyright law does not include “text data mining” in AI Included in the scope of reasonable application, text data mining still requires corresponding intellectual property authorization in China.
As for the question of whether AI-generated works are original, Sister Sa’s team believes that the judgment criteria should not be different from the existing judgment standards, in other words, whether a response is completed by AI or a human, it should be judged according to the existing originality standards. Obviously, under the intellectual property laws of most countries, including China, the author of a work can only be a natural person, and AI cannot become the author of a work.
Finally, if ChatGPT splices a third-party work in its reply, how should its intellectual property rights be handled? Sister Sa’s team believes that if ChatGPT’s reply splices a copyrighted work in the corpus (although this is less likely to occur according to the working principle of ChatGPT), then according to China’s current copyright law, unless it constitutes fair use, it must be copied without the authorization of the copyright owner.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
ChatGPT, which is "soaring", urgently needs "compliance brakes"
Core Tips:
ChatGPT and other chat AI based on natural language processing technology have three main legal compliance issues that need to be solved urgently in the short term:
First, when it comes to the intellectual property rights of the answers provided by chat AI, the main compliance problem is whether the answers produced by chat AI generate corresponding intellectual property rights, and whether intellectual property rights authorization is required.
Second, does the process of data mining and training of chat AI on a huge amount of natural language processing text (commonly known as a corpus) need to obtain corresponding intellectual property rights?
Third, one of the mechanisms for ChatGPT and other chat AI to answer is to obtain a statistical-based language model by mathematically statistically counting a large number of existing natural language texts, which leads to the fact that chat AI is likely to “talk serious nonsense”, which in turn leads to the legal risk of the spread of false information.
In general, at present, China’s artificial intelligence legislation is still in the pre-research stage, and there is no formal legislative plan or relevant draft motion, and the relevant departments are particularly cautious about the supervision of the field of artificial intelligence.
1. ChatGPT is not a “cross-era artificial intelligence technology”
ChatGPT is essentially a product of the development of natural language processing technology, and is still essentially just a language model.
At the beginning of 2023, the huge investment of global technology giant Microsoft has made ChatGPT the “top stream” in the technology field and successfully out of the circle. With the sharp rise of the ChatGPT concept in the capital market, many domestic technology companies have also begun to lay out this field, while the capital market is enthusiastic about the concept of ChatGPT, as legal workers, we can’t help but evaluate what legal security risks ChatGPT itself may bring, and what is its legal compliance path?
Before discussing the legal risks and compliance paths of ChatGPT, we should first examine the technical rationale of ChatGPT – does ChatGPT give the questioner any questions they want, as the news suggests?
In the eyes of Sister Sa’s team, ChatGPT seems to be far from being as “god” as some news advertised - in a word, it is just an integration of natural language processing technologies such as Transformer and GPT, and it is still essentially a language model based on neural networks, rather than a “cross-era AI progress”.
As mentioned earlier, ChatGPT is a product of the development of natural language processing technology, and in terms of the development history of the technology, it has roughly gone through three stages: grammar-based language model, statistics-based language model, and neural network-based language model The working principle and the legal risks that may arise from this principle must first be clarified as the working principle of the statistical-based language model, the predecessor of the neural network-based language model.
In the statistics-based language model stage, AI engineers determine the probability of successive connections between words by counting a huge amount of natural language text, and when people ask a question, AI begins to analyze which words are highly probable in the language environment where the constituent words of the problem are composed, and then splices these high-probability words together to return a statistical-based answer. It can be said that this principle has run through the development of natural language processing technology since its emergence, and even in a sense, the subsequent emergence of neural network-based language models is also a modification of statistics-based language models.
To give an easy-to-understand example, Sister Sa’s team typed the question “What are the tourist attractions in Dalian?” into the ChatGPT chat box, as shown in the figure below:
In the first step, the AI will analyze the basic morphemes in the question, “Dalian, which, tourism, and scenic spots”, and then find the natural language text set where these morphemes are located in the existing corpus, find the collocations with the highest probability of occurrence in this set, and then combine these collocations to form the final answer. For example, the AI will find that there is the word “Zhongshan Park” in the corpus with a high probability of the occurrence of the three words “Dalian, tourism, and resort”, so it will return to “Zhongshan Park”, and the word “park” has the highest probability of collocation with words such as garden, lake, fountain, statue, etc., so it will further return "This is a historic park with beautiful gardens, lakes, fountains, and statues. 」
In other words, the whole process is based on the probability statistics of the natural language text information (corpus) that already exists behind the AI, so the answers returned are also “statistical results”, which leads to ChatGPT’s “serious nonsense” on many questions. As the answer to the question “What are the tourist attractions in Dalian”, although Dalian has Zhongshan Park, there are no lakes, fountains and statues in Zhongshan Park. Dalian did have “Stalin Square” in history, but Stalin Square was never a commercial square, nor did it have any shopping centers, restaurants, or entertainment venues. Apparently, the information returned by ChatGPT is false.
Second, ChatGPT is currently the most suitable application scenario as a language model
Although we bluntly explained the disadvantages of statistical-based language models in the previous part, ChatGPT is already a neural network-based language model that greatly improves the statistical-based language model, and its technical foundation Transformer and GPT are the latest generation of language models The model is combined to model the natural language in a very deep way, and the returned sentences are sometimes “nonsense”, but at first glance they still look like “human responses”, so this technology has a wide range of application scenarios in scenarios that require massive human-computer interaction.
For now, there are three such scenarios:
First, search engines;
Second, the human-computer interaction mechanism in banks, law firms, various intermediaries, shopping malls, hospitals, and government government service platforms, such as the customer complaint system, guidance navigation, and government affairs consultation system in the above-mentioned places;
Third, the interaction mechanism of smart cars and smart homes (such as smart speakers and smart lights).
Search engines that combine AI chat technologies such as ChatGPT are likely to present a traditional search engine-based approach + neural network-based language models. At present, traditional search giants such as Google and Baidu have a deep accumulation of neural network-based language model technology, for example, Google has Sparrow and Lamda, which are comparable to ChatGPT.
The application of AI chat technology such as ChatGPT in the customer complaint system, the guidance and navigation of hospitals and shopping malls, and the government affairs consultation system of government agencies will greatly reduce the human resource cost of relevant units and save communication time, but the problem is that the answers based on statistics may produce completely wrong content responses, and the risk control risks brought by this may need to be further evaluated.
Compared with the above two application scenarios, the legal risk of ChatGPT application becoming the human-computer interaction mechanism of the above-mentioned devices in the fields of smart cars and smart homes is much smaller, because the application environment in this field is relatively private, and the wrong content fed back by AI will not cause large legal risks, and at the same time, such scenarios do not have high requirements for content accuracy, and the business model is more mature.
III. A Preliminary Study on ChatGPT’s Legal Risks and Compliance Path
First, the overall regulatory landscape of artificial intelligence in China
Like many emerging technologies, the natural language processing technology represented by ChatGPT faces a “Collingridge dilemma” This dilemma includes the information dilemma, that is, the social consequences of an emerging technology cannot be predicted in the early stage of the technology, and the so-called control dilemma, that is, when the adverse social consequences of an emerging technology are discovered, the technology has often become a part of the entire social and economic structure, so that the adverse social consequences cannot be effectively controlled.
At a time when the field of artificial intelligence, especially natural language processing technology, is in a stage of rapid development, the technology is likely to fall into the so-called “Collingridge dilemma”, and the corresponding legal regulation does not seem to have “kept pace”. At present, there is no national legislation on the artificial intelligence industry in China, but there have been relevant legislative attempts at the local level. In September last year, Shenzhen announced the “Regulations on the Promotion of Artificial Intelligence Industry in the Shenzhen Special Economic Zone”, which is a special legislation for the national non-artificial intelligence industry, and then Shanghai also passed the “Regulations on Promoting the Development of the Artificial Intelligence Industry in Shanghai”.
In terms of the ethical regulation of artificial intelligence, the National Professional Committee for the Governance of the New Generation of Artificial Intelligence also issued the “New Generation of Artificial Intelligence Ethics Code” in 2021, proposing to integrate ethics into the whole life cycle of artificial intelligence R&D and application.
Second, the legal risk of disinformation brought about by ChatGPT
Shifting the focus from the macro to the micro, aside from the overall regulatory landscape of the AI industry and the ethical regulation of AI, the practical compliance issues existing in the foundation of AI chats such as ChatGPT also need urgent attention.
As mentioned in Part 2 of this article, ChatGPT’s working mechanism makes it possible for its replies to be completely “serious nonsense”, which is extremely misleading. Of course, false responses to questions such as “what are the tourist attractions in Dalian” may not have serious consequences, but if ChatGPT is applied to search engines, customer complaint systems, and other fields, the false information it replies may pose extremely serious legal risks.
In fact, such a legal risk has already emerged, and Galactica, a language model in the field of Meta service scientific research that was launched almost at the same time as ChatGPT in November 2022, was taken offline after only 3 days of testing because of the mixed questions of true and false answers. Under the premise that the technical principles cannot be broken through in a short period of time, if ChatGPT and similar language models are applied to search engines, customer complaint systems and other fields, they must be transformed for compliance. When it is detected that a user may ask a professional question, the user should be directed to consult the appropriate professional instead of looking for the answer from the AI, and the user should be significantly reminded that the authenticity of the questions returned by the chat AI may need to be further verified to minimize the corresponding compliance risks.
Third, the intellectual property compliance issues brought about by ChatGPT
When shifting the focus from macro to micro, in addition to the authenticity of AI’s reply messages, the intellectual property issues of chat AI, especially large language models like ChatGPT, should also attract the attention of compliance officers.
The first compliance problem is whether “text data mining” requires corresponding intellectual property licensing. As pointed out above, ChatGPT relies on a huge amount of natural language texts (or speech databases), ChatGPT needs to mine and train the data in the corpus, and ChatGPT needs to copy the content in the corpus into its own database, and the corresponding behavior is often called “text data mining” in the field of natural language processing. On the premise that the corresponding text data may constitute a work, there is still controversy as to whether the text data mining infringes the right of reproduction.
In the field of comparative law, both Japan and the European Union have expanded the scope of fair use in their copyright legislation, adding “text data mining” in AI as a new case of fair use. Although some scholars advocated changing China’s fair use system from “closed” to “open” in the process of revising China’s copyright law in 2020, this proposition was not finally adopted, and at present, China’s copyright law still maintains the closed provisions of the fair use system, and only the thirteen circumstances stipulated in Article 24 of the Copyright Law can be recognized as fair use, in other words, at present, China’s copyright law does not include “text data mining” in AI Included in the scope of reasonable application, text data mining still requires corresponding intellectual property authorization in China.
As for the question of whether AI-generated works are original, Sister Sa’s team believes that the judgment criteria should not be different from the existing judgment standards, in other words, whether a response is completed by AI or a human, it should be judged according to the existing originality standards. Obviously, under the intellectual property laws of most countries, including China, the author of a work can only be a natural person, and AI cannot become the author of a work.
Finally, if ChatGPT splices a third-party work in its reply, how should its intellectual property rights be handled? Sister Sa’s team believes that if ChatGPT’s reply splices a copyrighted work in the corpus (although this is less likely to occur according to the working principle of ChatGPT), then according to China’s current copyright law, unless it constitutes fair use, it must be copied without the authorization of the copyright owner.