Artificial intelligence (AI), and in particular large language models (LLMs) have been dominating the tech sphere, as well as the mainstream media, over the past few months. AI chatbots, such as ChatGPT, Google Bard and BLOOM from BigScience have been showcased as doing extraordinary things that are often extremely intellectually challenging for humans, including: passing the U.S. Medical Licensing Exams; taking part in Google’s software engineer interview process; and the US multi-state Bar Exam. These AI chatbots rely on LLMs and this article describes what large language models are, what they may be used for and sets out some of the potential legal issues which may arise from such use.
What are LLMs?
LLMs are mathematical models which use machine learning methods known as ‘deep structured learning’ to predict sequences of words. Trained using large datasets (such as Wikipedia), LLMs create a statistical probability model which generates an output or prediction in response to a specific input (e.g., a question) by a ‘user’. Simply put, LLMs use statistics based on the data (i.e., words) that they are trained upon, to generate the most likely sequences of words that follow on from a prompt or question from a user. They are generative (hence the use of the term ‘generative AI’ which is often used to describe them), meaning that they can create new content that can often appear as human-like.
Potential scope for LLMs
Given that LLMs can generate human-like text responses, the potential application for such technology is enormous. They can be used to create chatbots to answer customer service questions, translate different languages in real time or generate content based on specific instructions. Such content could include real-time news, marketing material, summaries of legal judgments or your weekly shopping list.
Another exciting application for LLMs is their integration with internet search engines. Microsoft has recently launched an LLM powered search engine as part of a major Windows 11 upgrade. The Bing app sits in the PC taskbar and allows users to receive better search results, more complete answers, a new chat experience and the ability to generate content quickly.
Microsoft has also made some recent waves by integrating LLMs into Microsoft Teams, announcing in February this year that Microsoft Teams Premium users can now use OpenAI’s GPT-3.5 AI language model to generate notes, tasks and suggest meetings based on the context of conversations between colleagues.
Potential legal issues
Notwithstanding the potential commercial benefits of this technology, and in common with other tech advances over the years, there are potential legal issues that may arise from the use of LLMs, particularly from an intellectual property or data protection perspective. For example:
An LLM will generate its own written content by using its algorithm to ‘search’ its dataset for likely ‘answers’. This may result in the LLM using or ‘scraping’ data from sources which may be protected by existing IP rights such as copyright. Who would therefore own the copyright in the ‘output’ generated by LLMs and what happens if that generated content infringes a third party’s copyright?
Since an LLM is trained on large amounts of text from varying sources, it could be argued that the authors of material which feed into the machine learning algorithm have a claim to copyright in the generated content.
Conversely, the creators of the algorithm which sits behind the LLM could be seen as being the legal copyright owners in the generated content. From an infringement perspective, it is also possible that an LLM produces content which infringes the copyright of existing works. In these instances, whether liability would sit with the LLM user or the creator of the algorithm is unclear and this may be an area of uncertainly for IP experts in the coming years.
The more data an LLM is trained upon, the more accurate and efficient it becomes in predicting the sequencing of words and phrases. However, the more data that is used or ‘scraped’, the more likely it is that there could be data protection issues. For example, if datasets created for use by an LLM contain personal data, the relevant data subjects may not have consented to the processing of such data, contrary to the GDPR.
Further, it is unclear how the ‘right to be forgotten’ under the GDPR would be enforced against an LLM. Whilst it may be possible to remove personal data from content generated by an LLM, it may be practically impossible to remove all traces of an individual’s personal information from the initial dataset used by the LLM to create such content, particularly if the dataset in question is an enormous online word repository such as Wikipedia.
A final area of concern would be that of ‘fake news’ creation and attribution. LLMs are generative and create content based on the most likely sequence of words. However, an LLM’s output is not guaranteed to be factually correct. Consequently, the LLM may produce inaccurate or misleading answers, whether by mistake or design, which could be perceived as genuine output from an individual, corporate and/or state body. This may be confusing for the individual user who may believe as ‘true’ everything he or she reads from a website or computer but the potential impact for society of mass ‘fake news’ creation by LLMs will undoubtedly be an area for concern for national and international regulators.
LLMs have the potential to transform the way in which individuals search for and access information and produce ‘generative’ works such as books and art. This could blur the line between human and machine, making it more difficult for members of the public to understand who (or what) they are dealing with. Despite this, there is no doubt that the positive aspects of such technology will change the way that we communicate, work and play in the digital world.
With the speed that this technology is developing, it is important that regulators closely monitor the impact that LLMs may have on such sensitive areas like intellectual property, data protection and accurate reporting of facts.
You might also be interested in:
Event: What a time to be (AI) live!
Join us on 25 April when we will explore the commercial, legal and ethical questions prompted by the rapid rise of Generative AI, including text-to-text models like ChatGPT, and text-to-image models like Stable Diffusion.
During this thought-provoking discussion, we’ll be taking a look under the hood of models, providing an introduction to the technology, as well as an overview of the regulatory issues, before taking a deeper dive into the big questions with a panel of industry experts.
For more information, and to register your interest in attending, please click here.
 General Data Protection Regulation 2016/679 or the UK GDPR as implemented by virtue of section 3 of the European Union (Withdrawal) Act 2018 (where applicable).