A class action has been filed against, among others, Microsoft, Open AI and GitHub in the US District Court for the Northern District of California.
The class action primarily relates to the use of software code hosted in GitHub, an internet hosting repository for software development which commonly includes software code licensed under open-source licences such as GPL and Apache. This software code is said to have been utilised to train GitHub’s "Copilot", a programming tool that enables software developers to quickly generate code by generating coding suggestions from a programmer’s natural language comments. Copilot is powered by OpenAI’s "Codex" artificial intelligence system, which is also included in the action. The Copilot tool is a proprietary tool that requires programmers to sign up to a subscription model to use it.
This appears to be the first US class action challenging the training and output of an AI model and is therefore likely to been keenly watched by developers, data scientists and lawyers in the field.
The press release from the claimants’ lawyers states “By training their AI systems on public GitHub repositories […] we contend that the defendants have violated the legal rights of a vast number of creators who posted code or other work under certain open-source licenses on GitHub.”
The action is likely to involve a number of issues, including the following:
Has copyright in software hosted in GitHub been infringed? In that respect:
- Has code been copied from GitHub in order to be used as training data for, or incorporated in, Copilot/Codex?
- Are Copilot coding suggestions derivative works of GitHub code or do they render such derivative works as outputs?
- Is the use made of code hosted in GitHub permitted under the US doctrine of fair use - for example, because it is transformative and used for a different purpose, namely to train Copilot/Codex to generate code? Certainly there have been instances in the US where text and data mining has been upheld as a fair use.
Do the open-source licence terms included with the code hosted in GitHub prevent the use made of the code by Microsoft/OpenAI and GitHub? For example:
- Do the terms preclude use of the code (i) as training data or (ii) which results in it being incorporated verbatim in proprietary code such as Copilot/Codex?
- Have those license terms otherwise been breached – for example, by a failure to attribute the author’s name and copyright in a manner required by the open-source licences?
Do GitHub’s own Terms of Service entitle it to use the GitHub code as it has done? Notably, the lawsuit suggests that the defendants have violated their own Terms of Service.
The claimant lawyers have noted, “[t]his is the first step in what will be a long journey.”
We shall keep a close eye on this one.