Big Data, Big Problems: The Legal Challenges of AI-Driven Data Analysis

Machine learning and artificial intelligence (AI) are having a moment. Some models are busy extracting information—recognizing objects and faces in video, converting speech to text, summarizing news articles and social media posts, and more. Others are making decisions—on loan approvals, detecting cyberattacks, bail and sentencing recommendations, and many other issues. ChatGPT and other large language models are busy generating text, and their image-based counterparts are generating images. Although these models do different things, all of them ingest data, analyze the data for correlations and patterns, and use these patterns to make predictions. This article looks at some legal aspects of using this data.

Defining Machine Learning and AI

Machine learning and AI are not quite the same, but they are often used interchangeably. One version of the Wikipedia entry for AI defines it as “intelligence of machines or software, as opposed to the intelligence of other living beings.” Some AI systems use predefined sets of rules (mostly made by human experts) to make their decisions, while other AI systems use machine learning, in which a model is given data and told to figure out the rules for itself.

There are two basic types of machine learning. In supervised learning, the input data used for model training has labels. For instance, if you were training a model to recognize cats in images, you might give the model some images labeled as depicting cats, and some images labeled as depicting items other than cats. During training, the model uses the labeled images to learn how to distinguish a cat from a non-cat. In unsupervised learning, the training data does not have labels, and the model identifies characteristics that distinguish one type of input from another type of input. In either type of learning, training data is used to train a model, and test or validation data is used to confirm that the model does what it is supposed to do. Once trained and validated, the model can be operated using production data.

Contracting for AI Solutions

Joe Pennell, Technology Transactions Partner at Mayer Brown, notes: “The approach to contracting for AI depends on where your client sits in the AI ecosystem. A typical AI ecosystem contains a number of parties, including talent (e.g., data scientists), tool providers, data sources, AI developers (who may assemble the other parties to deliver an integrated AI system or solution), and the end user, buyer, or licensee of the AI system or solution. The contracts between these parties will each have their own types of issues that will be driven by the unique aspects of specific AI solutions. For example, those might include the training data, training instructions, input/production data, AI output, and AI evolutions to be created during training and production use of the AI.”

Intellectual Property Considerations

In addition to, or in the absence of clear contract provisions, intellectual property rights may also govern AI models and training data, as well as the models’ inputs and outputs. Patent, copyright, and trade secret issues can all be implicated.

Patents (at least in the United States) protect a new or improved and useful process, machine, article of manufacture, or composition of matter. However, abstract ideas (for example, math, certain methods of organizing human activity, mental processes), laws of nature, and natural phenomena unless integrated into a practical application are not patent-eligible. Case law delineating what is patent-eligible is a moving target. Thus, a model training or testing method, or a model itself, might be patentable, but not input data (because data is not a process or machine) or output data (because only humans can be inventors—so far).

Copyright (at least in the United States) protects original works of authorship including literary, dramatic, musical, and artistic works, such as poetry, novels, movies, songs, computer software, and architecture—but not facts, ideas, systems, or methods of operation (although copyright may protect the way in which these things are expressed). Thus, input data, depending on what it is and how it is arranged, might be copyrightable, including as alleged in a much-covered copyright lawsuit recently filed by the New York Times against OpenAI. Because only humans can be copyright holders (at least so far), protecting AI output via copyright requires that a human must have played a role in generating the AI output, and the output must be sufficiently transformed from copyrighted input data. How much of a role? How much of a transformation? Courts are only beginning to grapple with these questions. In addition, model training/testing methods and the model itself are probably not copyrightable, because they’re not original works of authorship.

Trade secrets are information that is not generally known to the public and that confers economic benefit on its holder because the information is not publicly known, and trade secret protection only applies if the holder makes reasonable efforts to maintain its secrecy. So, a model’s architecture, training data, and training method might be protectable as a trade secret, but having to explain model output can defeat the required secrecy.

Privacy Considerations

Moreover, AI training and input data can often implicate privacy issues. Much of that data comes from sources that would be considered as some form of personal data under various federal or state laws.

US enforcement agencies—including the Consumer Financial Protection Bureau, the Equal Employment Opportunity Commission, the Federal Trade Commission (FTC), and the Civil Rights Division of the Department of Justice—have made it clear that they will use privacy as a lever to regulate AI. The FTC has even gone so far as to effectively confiscate AI models trained on data that was obtained or maintained in violation of privacy laws seven times in the last four years. However, beyond federal agencies, because the US currently lacks any generally applicable/non-sectoral data privacy law, much of the action to protect consumers may fall to the states. More than a dozen states have passed general data privacy laws. Some of these state laws, including the Colorado Privacy Act, and as proposed for the California Consumer Privacy Act, contain detailed requirements on privacy notifications and obtaining consent on certain forms of what they call “automated decision-making.”

The first state civil complaint concerning data privacy has already been filed, and state attorneys general have begun bringing actions under state unauthorized and deceptive practices (UDAP) acts. At current count, forty-one state UDAP laws entail a private right of action. Class action attorneys have used those UDAP laws, along with state constitutional privacy claims, to bring massive actions against data brokers.

From a European perspective, perhaps the greatest risk to businesses comes from training data. If the training data is personal data (and the definition of that in the GDPR is significantly wider than the definitions generally found in US state laws), the GDPR applies, and if the data underlying the AI has been processed in a manner that is not GDPR compliant, this could create significant risks and liability to the businesses who are using those data.

Counsel for any organization that uses AI or machine learning should be clear about what information has been collected and the basis of such collection, and they should also ensure that any required permissions have been obtained. With the enactment of the European Union’s Artificial Intelligence Act this year, the penalties for getting it wrong may be significant—and would be in addition to the penalties that might already apply under the GDPR.

AI Bias Risks

In addition to privacy issues, bias in training data can negatively impact the safety and accuracy of deployed AI solutions. Common biases found in datasets are biased labeling, over- or underrepresentation of a demographic, and data that reflects a society’s existing or past prejudices. Biased labeling occurs when a programmer labels or classifies data in a way that incorporates her own biases. Data that reflects a society’s existing or past prejudices creates a similar outcome without manual labeling because the datasets come from a society with systemic exclusion, stereotyping, and marginalization of different groups of people. Over- or underrepresentation in data occurs when the use case of the AI solution is broader or more diverse than the data on which it is trained.

To avoid liability, businesses should confirm that the training dataset of AI they use mirrors the diversity of the intended use case. Sometimes, a particular bias in the dataset is not known until model deployment. In such cases, pre-deployment testing, specifically for bias, is crucial. Companies are well advised to implement data governance standards and bias checks at key points, including in connection with dataset collection/selection, algorithm training, pre-deployment testing, and post-deployment monitoring. Risks can be substantially mitigated if anti-bias data governance is made an integral part of creating, training, and monitoring AI and machine learning models.


This article is based on a CLE program titled “Big Data, Big Problems: The Legal Challenges of AI-Driven Data Analysis” that took place during the ABA Business Law Section’s 2023 Fall Meeting. To learn more about this topic, listen to a recording of the program, free for members.



Connect with a global network of over 30,000 business law professionals


Login or Registration Required

You need to be logged in to complete that action.