Data: The Fuel Powering Artificial Intelligence and Machine Learning

4 Min Read By: Pedro Pavón, Alexandra Goumas


  • The quality of the results produced by artificial intelligence and machine learning technologies is only as good as the data provided to them.
  • Data-licensing attorneys will increasingly be at the center of navigating tensions between accessing and securing the most and best data.
  • Attorneys must get involved during the development process so the development team understands the legal ramifications of its design choices, and the attorney can pave the way to accessing the best data assets available.

“Garbage in, garbage out” is a phrase data scientists often use. One can apply it in many contexts, but it arises from the idea that the quality of the output of a computer will only be as good as the quality of its programming. In that simple phrase, there is a central truth often overlooked in discussions surrounding artificial intelligence (AI) and machine learning (ML): no matter how skilled your coders, or how great your algorithms, the quality of the results, signals, insights, and “learnings” these technologies provide will only be as good as the data fed to them.

AI and ML are nebulous terms often used interchangeably; however, they are not the same. AI was famously described by Marvin Minsky in Steps Toward Artificial Intelligence as the ability of machines to behave and learn like humans. ML, a subset of AI, has been best described recently by Amanda Levendowski in How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem as the application of mathematics and computer science to create a machine’s ability to improve automatically through experience. Just like humans use massive data sets (our memories) to become more efficient and make better decisions over time, AI and ML systems rely on data to do the same.

Therefore, data is at the core of developing new AI and ML systems. Accessing and securing the most and best data is where the attorneys come in. Companies and organizations developing AI technologies want access to the best data possible with the least restrictions possible. At the same time, companies and organizations want to retain exclusive access to datasets that are (or could be) market differentiators. There are tensions between these interests, and data-licensing attorneys will increasingly be at the center of navigating them, including during negotiations with third parties and regulators and in the creation and management of data retention and usage policies.

There are terabytes of free harvestable data available online. Wikipedia, for example, is a common source of no-cost “general knowledge” data for AI and ML technologies. Many other datasets are freely available to AI and ML developers with open-access or attribution-only licenses. Attorneys must analyze each one of these licenses in the context of the use cases their clients are pursuing (or may pursue later). Conducting such an analysis requires careful planning and interviews with clients so that the attorney can accurately understand the use case for the data accessed to ensure it is consistent with the licenses granted by the licensor of the data set.

The owners and licensors of free data often lack the resources to actively curate or inspect the data to ensure quality and accuracy. Therefore, many free data sets contain errors, bias, and misinformation that can negatively affect the AI and ML systems using it. This quality-control problem has created a large market for curated data sets that sit behind paywalls, restricted licenses, and regulatory hurdles. Data-licensing attorneys are increasingly asked by clients to negotiate and draft license rights to highly protected and expensive data sets. This raises the stakes of the license analysis and negotiations. Given that exactly how AI and ML systems use data to “learn” can be opaque, a simple contractual miscalculation or limitation on use rights in a data-licensing agreement can have catastrophic results on development if the developers cannot use the data in the way that is most effective for the AI or ML system they’ve designed. Therefore, it is a good practice for data-licensing attorneys to get involved and provide advice and counsel during the development process so that that the development team can understand the legal ramifications of its design choices, and so that the attorney can work in parallel to pave the way toward accessing the best data assets available. Being embedded will also help attorneys counsel clients on regulatory compliance, privacy, and security related to the onboarding and processing of such data in real time.

AI and ML systems have a tremendous appetite for data. Failure to use the best and most accurate data can have both negative business consequences (loss of revenue and failure to remain competitive) and harmful societal implications (biased, unethical, or disparate outcomes). Some of the biggest obstacles to fueling AI and ML with the best data are not technological, but legal in nature. Thus, collaboration among lawyers, product developers, regulators, and data scientists is critical to ensuring that, as AI and ML technologies continue to develop and mature, they have access to the best data available and simultaneously protect the privacy of data subjects and the security and integrity of the data and the technologies that use it.


Washington, DC

Pedro Pavón

Pedro Pavón serves as Senior Corporate Counsel for Oracle Corporation and is based in Washington, D.C. He advises Oracle executives on data licensing, artificial intelligence and machine learning,…

New York City, NY

Alexandra Goumas

Allie Goumas joined Oracle Corporation after graduating from law school in 2014, and she now serves as Counsel in New York, NY. She is on the company’s cloud and ad-tech legal team, and advises on…


Login or Registration Required

You need to be logged in to complete that action.