It is clear from the recent consultation paper on ‘Copyright and Artificial Intelligence’ that the UK Government would like to introduce a new copyright exception to allow text and data mining for any purpose in a way that broadly mimics the one introduced by the EU under Article 4 of the DSM Directive. This means that the exception would only apply where a rightholder has not ‘reserved their rights.’
The Government shied away from such a solution in 2022 when it proposed a far more liberal exception without the possibility of any such reservation of rights. Maybe it was not a surprise that the Government at that time wished to take a demonstrably different approach to the EU – it was a chance to cast aside the shackles of the EU and show AI model developers that post-Brexit UK was firmly with them. Rightholders marshalled themselves and the short-lived proposal was ultimately abandoned. But that was then, and this is now.
The Labour government’s approach in the new consultation – with its reservation of rights - comes as no surprise to many of us who work in this space. There are, however, significant challenges ahead, as the EU will testify from its own experience. In theory, providing rightholders with the ability to reserve their rights enables them to preserve the ability to licence their copyright works - including for AI model training - and receive remuneration. In practice, reserving their rights is anything but straightforward, at least at present.
Can’t I just use robots.txt under the Robot Exclusion Protocol?
You may be able to use a robots.txt file to reserve your rights – and, indeed, it is the most widely used solution at present - but it can be a blunt instrument.
The standard was developed in the 1990s, primarily to address web crawling (in particular, indexing and caching). The robots.txt file applies at the domain and URL level, rather than applying to individual works/files located at those domains/URLs. So, if you do not want your domain indexed, you could include ‘noindex’ as a metatag (or HTTP response header) and the web crawler will not index that domain.
However, it is not possible to apply the standard to prevent some uses but not others. For example, you cannot apply the standard to prevent use of a particular copyright work for AI model training but allow all other uses. So robots.txt would not provide a solution for a website operator that wishes to be crawled and indexed by search engines but wishes to prevent use of copyright works for AI model training by those search engines. To that extent, it is a one-size-fits-all approach at the domain/URL level, which is unlikely to be sufficiently granular to operate as a truly effective reservation of rights for individual copyright works.
Additionally, there are at least two further difficulties with the standard. First, the rightholder would have to specify the reservation of rights for each web crawler. There are likely to be several web crawlers directed towards any one domain/URL, so the relevant rightholder will require visibility of these and then, assuming they have identified the web crawlers, go through the process of applying the standard to each (or asking the website operator to do so). Second, is what could be described as a latency problem. The standard is applied at the stage when the copyright work in question is crawled/collated, rather than the subsequent stage when the AI model is trained. This could be problematic if a rightholder places a robots.txt file after their work has been crawled/collated but before the AI model is trained, because the ‘reservation’ will not be picked up in the training data.
Are there any other viable solutions/standards for reserving rights?
The simple answer to this question is yes, there are. However, there is no single universal standard and, if the reservation of rights/opt out is to be effective, that is what is arguably required. This is where the UK is likely to run in to the same challenges as the EU is currently experiencing. Rightholders need to feel confident that a particular standard is worth investing in and committing to, and AI model developers want certainty. This is a point recognised in the consultation paper.
To address some of the issues of granularity associated with the robots.txt standard, an alternative (but also possibly complementary) approach is to have a standard that applies to individual copyright works, rather than to domains/URLs, by attaching metadata to the files in which those works are contained, such as C2PA Specifications, which can then ‘follow’ the work around the Internet (although this metadata can be stripped away, which is problematic).
Some AI model providers also offer their own standards for reservation of rights, such as Google and OpenAI. However, if multiple standards across different model providers were to evolve, rather than a single universal standard, rightholders would have to exercise their reservation of rights in relation to each model which is likely to be burdensome and less attractive. Additionally, the standards may vary greatly and may be little more effective than what is currently available if they are based upon the robots.txt standard. For example, Spawning.AI offers ai.txt, a standard to address web scraping for AI training, which is based upon the robots.txt standard but applies only at the level of media type (such as text, images, audio, video, or code) rather than to individual copyright works.
Spawning.AI also affords rightholders the ability to search the LAION5B training dataset for images (see https://haveibeentrained.com/) and, if their copyright works are included in it, ask for them to be added to Spawning’s ‘Do Not Train’ registry so they are not used in future. This ‘collective reservation of rights’ does not quite provide AI model developers with a one-stop shop to identify works to be excluded from training datasets because it only covers LAION5B (although Spawning does aggregate opt out information from some other sources too). While it is not perfect, the registry applies to individual works rather than domains/URLs and so it could provide a template for a more standardised approach to reservation of rights.
There are other methods by which to reserve rights (such as the TDM Reservation Protocol which applies to text and data mining generally rather than mining for AI model training specifically) but there is no one method at the present time that can be considered an effective solution for all reservations of rights for training AI models.
As much is reflected in the current second draft for the General Purpose AI Code of Practice for the EU AI Act, which encourages signatories to support relevant standardisation efforts and engage in discussions with other stakeholders, with the aim of developing interoperable machine-readable standards for rights reservations. The UK consultation paper contains similar overtures and alludes to the possibility of regulation. While the UK has joined the party (perhaps a little late) it seems unlikely that UK regulation will be of much use unless it is part of a wider international move towards a common standard.
There are also some common challenges across all methods of reservation of rights.
In particular, a particular copyright work may have been duplicated on the internet and therefore available at multiple sources (some of which may have been authorised and others not). If a reservation of rights is to be effective, it will need to attach to the work in question at each source, otherwise a work where rights have been reserved for a work at source X may be freely available at sources Y and Z if a machine-readable reservation has not been implemented there too.
Another potential challenge is validation. How is the AI model developer to know that the copyright holder has expressed any reservation rather than an interloper? Some veracity needs to be built into the standard (or operate alongside it).
Is consensus on a universal standard achievable?
It is certainly desirable and should be plausible. However, there are presently well over 30 cases being litigated in the US in the field of copyright and AI and in many of those cases it is an open question as to whether the use of copyright works in training data for AI models amounts to a transformative fair use for which no authorisation would be required. AI model developers may feel less incentivised to focus on standards for rights reservations in the UK and EU while this question remains unanswered. However, even if their activities are permitted as fair use in the US, where the majority of AI model training takes place, Article 53(1)(c) and recital 106 of the EU AI Act would require them to comply with the EU AI Act, including observing any reservation of rights under Article 4 of the Digital Single Market Directive, where their AI models are placed on the market in the EU. This ‘long-arm’ jurisdiction is controversial. The UK consultation paper presently adopts a slightly softer approach, expressing the desirability of aligning approaches at an international level, but without making any firm proposal at this stage.
There is also some distrust on the side of rightholders. Many publishers, for example, feel aggrieved that developers used their portfolios of works to train AI models without them having been consulted. The relationship between AI model developers and rightholders did not start particularly well in that respect and the dialogue between them is sometimes strained.
In summary, while there are some technical issues to resolve around a universal standard, perhaps the larger challenge lies in convincing AI model developers to embrace the notion and copyright holders to place their trust in it as a foundation for licensing.