This browser is not actively supported anymore. For the best passle experience, we strongly recommend you upgrade your browser.
| 3 minute read

Generative AI in context of copyright: knowing your hypersurfaces from your stochastic parrots

The European Parliament’s Policy Department for Justice, Civil Liberties and Institutional Affairs has prepared a briefing entitled ‘Technological Aspects of Generative AI in the Context of Copyright’. 

It is a technical paper which explains (albeit in a simplified way) the function of GenAI systems by reference to ‘hypersurfaces’. If you are already scratching your head, here is a quote from the paper “[I]magine the hypersurface as a rubber sheet stretched in many dimensions, with each training point exerting a small pull on it. The result is a smooth but complex surface that passes near many of the original [training] data points”.

GenAI systems differ from traditional machine learning models. Rather than predicting a label/property from inputs (determinism) they are trained to synthesise new outputs (probabilistic sampling over learned distributions). This is not like ‘human learning’. It does not involve ‘understanding’. Instead, it relies on statistical approximation: that is, it learns patterns from large datasets and can reproduce those patterns in generated outputs. You could say that GenAI systems are stupid: they cannot think. 

The briefing paper suggests that the core technical challenge lies in the functional dependency that exists between the training data and the learned hypersurface. Or put another way, generating outputs based on training data that ‘probabilistically persists’ in the model’s generative function, even if the model does not contain copies of the training data itself. 

The paper also considers the challenge of “stochastic parroting”. I have never fully understood why, in an already technical field, the industry chooses to adopt expressions that I have to look up. Anyway, this describes the phenomenon where an AI model reproduces outputs which are very similar to training data inputs, perhaps because the hypersurface has been disproportionately influenced by specific training data (for example, the same work might appear several times in the training dataset). 

The paper also recognises that the architectures for GenAI systems do not embed traceability of training inputs by design, meaning that there is a disconnect between those training inputs and generated outputs. In other words, it is not possible to determine the influence of any given piece of training data in an output (the traceability gap). Clearly, this is a significant challenge to any copyright holder that wishes to identify whether their work has been used to generate a specific output. There are, however, some early-stage efforts to try and address some of these challenges.

The paper concludes that it should be possible to find a technical solution to these challenges, but that will require research, investment, standardisation and industry adoption. The challenges, it states, are not intrinsic limitations of AI but are the result of current design limitations and insufficient investment in transparency. It recommends that developers of GenAI systems should take responsibility for documenting data provenance and enabling transparency audits. 

If the hypersurface for a GenAI model is ‘overfitted’ (i.e., it closely follows the training data) and is as a result prone to ‘memorisation’ (i.e., replicating training data in generated outputs), then these concerns around generated outputs have some validity. However, I have no sense of the frequency - among GenAI models in general - with which outputs infringe works included in the training data used for the model. A copyright holder may have concerns that the use of their works in training data inputs may have a ‘cannibalising’ effect at the output stage - for example, if a user can use the GenAI model to produce artistic works in the style of an particular artist. But that may not be a copyright problem at the output stage (works ‘in the style of’ might not infringe copyright). Instead it could be viewed as something which needs to be factored into a valuation for the use of the copyright works at the input stage. Of course, that itself requires appropriate transparency so that a right holder can determine whether their work has been used in the training data (again, assuming that the use is not covered by a copyright exception).

Perhaps more controversially, the briefing also recommends that remuneration frameworks should reflect not only literal copying but also statistical usage and cumulative influence over generated outputs. From a copyright perspective, ‘Statistical usage’ and ‘cumulative influence’ may be relevant, but only to the extent that they involve the reproduction of the creative expression of the author of the copyright work within the training data, and assuming the use is not covered by any copyright exception. If such rights of remuneration are to be considered, copyright may not be the appropriate legal basis.

Subscribe to receive our latest insights - on the topics that matter most to you - direct to your inbox, at your preferred frequency. Subscribe here

Tags

copyright, artificial intelligence, article