Data Governance and Provenance: Two words that are critical to the future of generative AI

Using "borrowed" material for training. Fair use or not?

By Bob O'Donnell February 29, 2024, 11:16

Data Governance and Provenance: Two words that are critical to the future of generative AI

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

Editor's take: There is little doubt that many people in the tech industry are excited about the potential that Generative AI offers to our work and personal lives. As enthralling as those opportunities may be, however, there are two essential but little understood principles that need to be addressed in order to use the technology in a safe and responsible way. In a word (or actually, two), those are provenance and governance.

Provenance refers to knowing the source of where a particular text, an image, a video, a snippet of code or other bit of information comes from, while governance refers to the management and control over the creation and usage of information.

These two similar sounding words haven't been a common part of the tech world lexicon until recently.

But the explosive growth of GenAI and the tools and applications associated with it has brought them to the fore. It's also focusing more attention on companies like Adobe and IBM that are addressing these issues in unique and important ways.

"In a world now overflowing with foundation models that generate new material based on the input of enormous amounts of existing data, the provenance, or origin, of a piece of content has multiple meanings"

First, is the question of whether that content was created directly by a person or generated by an algorithm. If it indeed comes from an algorithm, there's increasing interest in knowing which foundation model or GenAI tool produced it. Second, and most importantly, are big questions about what original sources of information were used to train the models that generated that content. Finally, there are enormous legal and ethical concerns about using generated content, particularly if it's based on copyrighted material.

Already there have been numerous court cases around these issues, including one with the NY Times suing OpenAI for what they believe is copyright infringement based on generated output that was virtually identical to some NY Times articles (including many behind a paywall). While nothing has been resolved here yet, it will likely be the first of many similar suits and is already starting to lead to large licensing deals between content providers and GenAI model makers.

Bus image generated with using Stable Diffusion – Masthead created by Dall-E.

In the world of generated graphics, the problem is particularly acute as recent examples involving Dall-E 3, Stable Diffusion and Midjourney showed what seem to be very obvious cases of infringement for things like movie scenes and characters. Again, there are likely to be a wide range of legal disputes based on these issues.

Some will likely help determine whether using copyrighted material for training is considered fair use or not. More importantly will be outcomes that clarify what can be done about new generated content that closely resembles copyrighted content.

Creative software giant Adobe has ended up taking a very different approach to the situation with its new GenAI offerings and, in the process, is seemingly avoiding the copyright concerns that others may have. For years, the company has run a stock image, photo, and video service it calls Adobe Stock, where it pays content creators for their work and offers a marketplace where they can sell it to Adobe users. Over time that library of content – all of which is checked for copyright-related issues before it gets included – has blossomed into millions of images, video content and more. When it came time to start training their own GenAI image models, the company wisely chose to use that material as its source.

In the process, they've managed to avoid the kinds of legal scrutiny that others are facing. Adobe both disclosed the content it used for training – an issue that very few GenAI models of any kind have yet to do – and made it clear that it's safe for commercial use. They did so via a legal process called indemnification that's also becoming a bigger issue in the world of GenAI.

Adobe was able to easily do this – and explain it to others – because none of the source material from Adobe Stock has any copyright-related concerns. In fact, content providers are even getting payouts (though some have argued they're too little) for having their content included as part of the training set.

The net result is an easily explainable and understandable offering that could serve as a good example for others trying to work their way through the potential legal quagmires of GenAI-created content. The work also ties in with the Content Authenticity Initiative (CAI), a group Adobe founded in 2019 and that has grown to close to 2,500 members. The CAI focuses on helping to increase transparency in the digital ecosystem through tools like Content Credentials, which function as a nutrition label for online content. These labels make it easy for potential users of the content to understand where it came from.

Not really The Pope

Another critical factor in ensuring the safe use of GenAI is a process known as governance, which is the tracking of data sets and models being used in GenAI-based applications. As a result of its many decades of working with key industries and critical applications, IBM has developed a very mature set of methodologies and best practices around governance that it has recently started applying to the world of GenAI.

As part of the company's watson:x suite of GenAI tools, watsonx.governance incorporates tools that let organizations record what data sets were used to train what models, what changes are made over time to data sets and models, the quality of the output that resulted from the various permutations that have been tried, and more. In addition, recent additions to the governance tools can now track internal details of LLM operations including things like data size, latency, and throughput.

The idea is to have a thorough understanding of the raw materials that go into the GenAI model and application building process. In so doing, governance tools can help companies avoid potential issues with things like hallucinations, model drift, and other data output problems while also improving performance. Interestingly, IBM refers to its governance capabilities as offering a nutrition label for AI.

IBM originally built these governance tools to help improve the quality of its own GenAI models but soon realized the need to make these capabilities work across models made by others as well. As a result, the watsonx.governance tools can now work with GenAI models made with tools from Amazon, Microsoft, and Google and that run on platforms from those companies as well as OpenAI, among others. To give potential customers as much flexibility as possible, the governance work can be done either in the cloud or on premise for any of these different models.

"Together (provenance and governance) they can bring important legal, ethical, and qualitative enhancements to the creation of GenAI-based models and applications. Even more importantly, they can help enable a sense of security and clarity for organizations that are diving into this rapidly changing field"

Another intriguing part of the wastonx.governance capabilities is linking it to the outside world. For example, another new feature is the ability to track regulatory changes that could have an influence on what a model generates. By defining a business strategy for a given model, the governance tools can notify organizations of just the relevant regulations they need to know about and tie those new changes to key risks, controls, and policies associated with a given model. Collectively, these rules can help enterprises more confidently build or refine their GenAI-based efforts.

While provenance and governance probably wouldn't be the first two words that come to mind when someone asks about GenAI, it's becoming increasingly clear that these principles need to be an essential part of any company's GenAI strategy. Together they can bring important legal, ethical, and qualitative enhancements to the creation of GenAI-based models and applications. Even more importantly, they can help enable a sense of security and clarity for organizations that are diving into this rapidly changing field.

Bob O'Donnell is the founder and chief analyst of TECHnalysis Research, LLC a technology consulting firm that provides strategic consulting and market research services to the technology industry and professional financial community. You can follow him on Twitter @bobodtech

1 comment 52 likes and shares

// Related Stories

Tech Jobs: Find the next step in your career

Featured on TechSpot