Input or Output? Framing legal arguments in ChatGPT training model copyright cases

Generative AI raises a host of interesting legal issues. An area of large controversy involves the creation of large language models to train systems such as ChatGPT. Professor Festinger has mentioned the case of  Margaret Atwood, specially the world-renowned author’s response when she found out pirated copies of her books were used to train AI.

In the recent podcast by Prof Michael Geist, guest speaker Andres Guadamuz broke down the process of how big data is trained by large language models, using artistic or literary works as the original source. The process is divided into the input phase and output phase. During the input phase, the training of big data could involve the scraping of information from Wikipedia, reddit, journal articles, etc. All words are copied into database. To generate the output, machine learning researchers would then extract the token information, with the resulting model being the data (mostly statistics/mathematics). Interestingly, unlike search engines, text themselves do not play a critical role in terms of generating the output.

The recent class action lawsuit with Stability AI highlights how lawyers should frame legal argument carefully, with a fundamental understanding of the workings of these training models. The more promising legal claim would be in the input phase rather than the output phase. In this case, three visual artists who accuse Stability AI, Midjourney and DeviantArt of misusing their copyrighted work in connection with the companies’ generative artificial intelligence system. Out of these three artists, only one of the artists was able to advance her claim. Some of her claims were struck down due to the lack of evidence (for example, no evidence that the copyright management information was removed when her work was inputed in the model). The only promising claim that could potentially move forward is that her pictures were in the training data. The claims of other two artists were struck down by the court because they did not register copyright for their work.

Andres, in the podcast, raised an important issue that the main challenges around the legal argument is that the training models have yet to produce an infringing output. Open AI usually have a strong fair use case because the artists’ work are put in a large corpus of other works. Because the training involves an internal process that extracts information, the large corpus of work generated during the intermediate of the process does not have commercial value on their own. They are also not published or communicated to the public. The output produced would not compete with the work of the authors (fair use factor 4- impact on the commercial market of the original work).

Therefore, lawyers for the authors need to think about the specific way their work is used for the training model, and the actual damage they suffered when considering launching a claim against open AI.

Moreover, Andres proposed some alternative (perhaps better!) solutions than going to the court. For instance, technical tools that would remove work from the models, and encouraging content creators to enter into a licensing agreement with these corporations.