[Part 2 of the Demystifying Data Governance Series]

In the first article, we established that governance collapses into four pillars: know the data, secure it, use it properly, and improve it over time. The first pillar has to come first – everything else depends on it. Before we decide who gets access, before we set quality targets, before we translate any data into business value, we have to answer a foundational question: what data do we actually have, and what does it mean?

This may sound obvious, but it is not. In a data estate of any meaningful size – thousands of tables, hundreds of pipelines, multiple cloud environments, and a growing collection of AI-generated artifacts – answering this question without a systematic approach is impossible.

This article is all about answering that question – how to build and maintain an accurate picture of the data estate, what to do with that picture once we have it, and how AI both complicates the inventory and gives us better tools to build it.

We cannot Govern What We cannot see

Governance begins with visibility. We cannot protect what we don’t know exists; we cannot limit access if we don’t know whether the data contains sensitive information; we cannot assess quality if we don’t know what the data is supposed to contain.

Knowing the data, at scale, means two things:

Discovery: finding what is there.
Description: recording what it means, where it came from, who owns it, how sensitive it is, and how it is used.

Discovery without description is just an inventory of unknowns. Descriptions without discovery are a catalog that covers only what someone remembered to document. Both are needed to complete the map – the map to know the data.

Table of Contents

Metadata: The Data about Data

As ancient Mesopotamians used clay tablets to record grain inventories and trade transactions, in the data world, that is the role metadata plays – specifically, where it is stored and what governance controls are in place for it.

Metadata can be used for different perspectives.

Technical metadata: the truth of the data, such as schemas, data types, table sizes, partition keys, freshness, etc. This type of metadata is usually generated by the systems themselves. It tells us what the data looks like, but not what it means.
Business metadata: the meaning that humans give to the data, such as definition, ownership, business purpose, and anything given to describe the data. This is the layer that turns a schema browser into a governance tool. It is also the hardest to maintain, because it requires human input and decays as organizations change.
Operational metadata: record how the data is used, including access logs, query patterns, and lineage. This layer is usually generated automatically as data is used.

Together, these three layers form a data catalog – the organized inventory of the data estate.

However, a catalog without governance information is just a schema browser. A catalog with governance information – classes, quality scores, ownership, sensitivity, usage pattern – becomes the organizational backbone of a modern data environment. It accelerates discovery by making assets searchable through clear descriptions and context. It strengthens quality by capturing origins, transformations, and update history. And because a catalog tracks where sensitive data lives and who interacts with it, it’s also a critical tool for privacy, security, and compliance.

The catalog is a dynamic system that we manage continuously. New data comes in, schemas evolve, and ownership changes can occur. As a result, a catalog that was accurate six months ago may no longer reflect the current state. Therefore, ongoing discovery is essential to ensure that the catalog remains up to date and accurate.

Turning the Unknown into Known

When there are only a handful of books, we remember their titles, where they sit, and what each one contains. Once the collection grows, memory stops working. We need a way to organize it – index cards, catalogs, shelves arranged by author, genre, or year. The system becomes what we know as the library.

Data works the same way. When dealing with big data, we need a way to turn the unknown into the known, which means two things: finding what we have and giving the data descriptions.

Some practical methods to know the meaning of the data.

Profiling generates a statistical summary of a dataset: field-level null rates, cardinality, value distributions, min/max, and outliers. Profiling surfaces the gap between what the schema says and what the data actually contains. A column labeled user_id that contains email addresses is a misclassification waiting to propagate downstream.

Sampling selects representative rows for inspection. It is especially important for unstructured fields and for tables where the schema is uninformative. A column named notes in a support ticket table could contain names, phone numbers, order details, or pasted-in payment information. Sampling finds this; schema inspection does not.

Usage patterns reveal which datasets are active and which are orphaned. A table queried hourly by production systems is a different governance priority than a table nobody has touched in fourteen months. Usage patterns also reveal undocumented dependencies: if analysts consistently join two tables with no documented relationship, the lineage graph is incomplete.

The process of turning the unknown into known is not a one-time exercise. New tables are created. Pipelines are added. Schemas drift. It has to run continuously and feed directly into the catalog so others can act on its output.

Types of Data – What Are We Working With

Governance decisions depend on understanding what kind of data we are dealing with.

Structured vs. unstructured. Structured data lives in tables with known schemas. Unstructured data can include documents, images, videos, logs, and almost anything else. Governance methods differ between the two, but the concepts – discovery, classification, lineage, and access – apply to both. As the article was written, most governance tooling still handled structured data better than unstructured.

Operational vs. analytical. Operational data, such as orders and transactions, powers the business. Analytics data powers business decisions, including reports, models, and dashboards. The same underlying event often occurs in both forms, and governance must follow it across the boundaries.

Raw vs. curated. Raw data is ingested as it first arrives, whereas curated data has been cleaned, deduplicated, and reshaped. Raw data usually has greater sensitivity and less trust; curated data has greater trust but may have lost fidelity. Both need governance – the risk profile differs.

Personal vs. non-personal. Data that identifies or describes individuals, versus data that does not. This distinction is the single most important one for privacy, compliance, and access control.

AI-generated data

This category has often been overlooked by governance programs, yet it is the fastest-growing segment of the modern data estate, making it worthwhile to have a dedicated section.

AI systems produce data continuously:

Model outputs and inference logs include predictions, classifications, generated text, and confidence scores. These are produced in high volumes and are rarely regulated. They provide important provenance information, such as which model and version were used and the specific inputs. Additionally, they contain quality signals such as calibration and drift that many catalogs fail to capture. When model behavior requires auditing, the absence of logs makes a thorough review impossible.
Embeddings and vector stores refer to dense numerical representations of documents, images, or structured records that are stored in vector databases for semantic retrieval. These embeddings are derived from the source data and can be reverse-engineered to extract information from that source. For example, a customer support ticket embedded in a vector store retains the sensitivity of its content, even if no one has explicitly labeled the embedding. Therefore, vector stores require classification and access control, similar to traditional tables.
Training datasets and fine-tuning sets are curated subsets of data prepared for model training. They require the same classification and lineage as the source data, along with an additional record linking dataset versions to model versions.
Synthetic data refers to AI-generated records that are created to replicate the statistical properties of real data without containing any actual personal information. This type of data is increasingly used for analytics and model training on sensitive datasets, helping to manage privacy risks. It should be classified separately; although it is not considered personally identifiable information (PII), it still contains sensitive information that must be acknowledged.
Agent interaction logs are records of the queries made by AI agents, the information they retrieved, the actions they took, and the results they produced. Without these logs, there is no audit trail for agent behavior.

None of this data governs itself. The absence of governance on AI-generated data is not a niche problem. It is the fastest-growing gap in most organizations’ data estates.

Sensitivity and Classification

Not all data is equal. Some require protection; some do not. The process of deciding which is which is called classification, and the output is called data class – a label or a tag attached to a dataset (or a column or a field).

Common sensitive data classes include:

PII (Personally Identifiable Information). Information that (when used alone or with other relevant information) can identify an individual includes mailing and email addresses, phone numbers, or government IDs.
PHI (Protected Health Information). HIPAA-protected health information, such as medical records, diagnoses, treatment, or any individual’s medical record that can be personally identified.
Financial. Bank accounts, credit card numbers, transaction histories, etc.
Business confidential. Intellectual property, trade secrets, competitive data, internal plans, etc.

The definition of sensitive data classes is usually established by the legal team or the information security team. Once the definition is established, as in the library analogy earlier in this article, a scale and an automated method for discovering sensitive data are necessary when the volume of data is large.

Discover Sensitive Data

For structured data, three methods are commonly combined:

Pattern matching: regular expressions for known formats – social security numbers, credit card numbers, email patterns. Fast and cheap, but misses fields with non-standard formats and context-dependent sensitivity.
Column name matching: heuristic rules based on field names. A column named “email” is likely PII. Misses columns with generic or uninformative names.
Context-aware AI classification: embedding -based or LLM-based classifiers that look at column names, sample values, and schema context together. Catches what the other two methods miss.

For unstructured data — images, video, documents — rule-based methods do not scale, which is exactly the gap AI-based classification fills.

One edge case most classification systems miss: combinatorial sensitivity. A postal code alone is not sensitive. A birth date alone is not sensitive. Together, a postal code, birth date, and gender can identify a large fraction of the population. Systems that classify columns independently miss this. The classification scheme needs to account for it, and the detection methods need to consider field sets rather than individual fields.

Where Data Comes from and Where It Goes

Every dataset has a history: produced by some sources, moved through pipelines, transformed and joined with other data, and eventually consumed by some downstream use cases. The full path — origin, transformations, destinations — is called lineage.

Lineage is how we answer questions that no single system can answer on its own. The following are some common examples.

If an upstream table is wrong, which dashboards are wrong?
If a column is sensitive, does that sensitivity propagate to the feature store used by the ML team?
If a customer requests deletion, where has their data actually traveled?
If we change a table’s schema, who (downstream) will be impacted?

Model Lineage

Model lineage is the newer and largely unsolved extension of this concept.

When a model is trained on data, the training dataset becomes part of the model’s provenance. When that model generates outputs, those outputs inherit lineage from the training data — loosely. This creates a specific problem: if a customer requests deletion under GDPR and their data was in the training set, removing the row from the source table does not remove their influence from the model.

Machine unlearning — the technical process of removing a specific data point’s contribution from a trained model — is expensive and imperfect. The governance implication: training data needs to be tracked at the dataset and version level before organizations face a deletion request under a regulatory deadline, not after.

AI System to Build Knowledge of the Data

The catalog is only as good as the data that feeds it. Manual documentation does not scale. Engineers do not document tables built under pressure. Analysts do not update descriptions when they leave. Without a systematic input mechanism, the catalog decays toward emptiness.

Fortunately, we can now leverage AI tools to address these specific catalog maintenance problems.

Automated description generation. An LLM, given a table’s schema, a sample of its rows, and its query history, can generate a human-readable description in seconds. Not a perfect description — but dramatically better than an empty field. The engineer reviews and approves the model’s drafts. The catalog stops being a graveyard.

Classification at scale. Context-aware classifiers can scan an entire data estate — millions of fields across thousands of tables — and continuously attach sensitivity labels. Rule-based systems handle the easy cases; AI handles the ambiguous ones. The combination achieves coverage that manual classification cannot.

Ownership inference from usage. Usage patterns reveal who interacts with which dataset. A table primarily queried by finance analysts using finance-related terms likely has a finance owner, even if the catalog entry indicates otherwise. Ownership inference provides catalog administrators with a starting point for ownership reviews, rather than a blank slate.

A Conversational Interface to the Data Estate

The capabilities above are AI applied to specific catalog problems — generating descriptions, attaching tags, and suggesting owners. The next layer is AI as a unified interface to the catalog and the data sources it points to.

The architecture is straightforward: an LLM, wrapped in an agent that can call tools and connected via the Model Context Protocol (MCP) — an open standard that lets the agent invoke external systems — to every data source in the estate: storage, databases, permission management system, pipelines, the catalog itself, the lineage system, and access logs.

With this in place, a data engineer or anyone can ask questions in natural language and get answers grounded in the actual systems, such as:

“Where does the customer_orders table come from?” → The agent queries the lineage system and traces the source.
“Who last updated transactions_daily?” → The agent queries the catalog and the underlying warehouse metadata.
“Who has been using the PII tables in the last 30 days?” → The agent queries the access logs.
“Which downstream dashboards depend on this column?” → The agent traverses the lineage graph.

The value is not that the LLM knows the answer — it does not. The value is that the LLM knows which systems to query to compose an answer and can synthesize results across sources that no single tool exposes. The same query that previously required an engineer to check the catalog, the warehouse query history, the access log, and the lineage system can now be answered in a single round trip.

One thing worth noting is that the same governance discipline from earlier in this article still applies. The agent runs under its own scoped service account. Its queries flow through the access control system, not around it. Its actions are logged. Built this way, the conversational catalog is not a workaround for governance — it is the user interface to governance, the thing that finally makes a well-governed data estate as usable as it is correct.

These AI capabilities work only with the right governance scaffolding. A classifier that produces labels without a feedback loop, without human review of high-stakes decisions, and without integration with the access control system generates metadata that nobody acts on. The catalog must be the destination for classifier output, and the access control system must read from it. Without those integrations, AI-driven discovery is decoration.

The Compounding Problem: The Speed of Data Growth

Here is the counterintuitive truth about using AI in general: the more AI we apply to our data workload, the larger the data estate becomes.

The instinct is the opposite. We reach for AI to handle big data — to process volumes humans cannot, to extract value from streams that would otherwise sit untouched, and to automate work that does not scale by hiring. The premise is reasonable. AI is genuinely better at these tasks than the alternatives. However, the trap is that AI is also a data producer. Every model inference writes a log. Every embedding consumes vector storage. Every agent action generates a trace. Every training run produces checkpoints, metrics, and a versioned dataset pointer. Every synthetic data generator produces a new dataset that mirrors the input. AI applied to a workload does not merely consume the input data — it creates new data as a byproduct of doing the work.

As a result, the speed of data growth is outpacing what AI solutions can ever catch up to. Whatever capacity AI buys us, it consumes some of that capacity, creating the next layer of data that the next AI run will have to address. This is why governing AI usage is not optional. It is the only way to bound the loop.

The Compound Interest Analogy

The math behind this is the same math that drives compound interest — the math that the finance industry spends a lot of energy explaining to consumers in a positive light. In personal finance, compounding is the friend that turns small monthly contributions into meaningful retirement balances over decades. The principal earns interest. Interest earns interest. The growth accelerates without bound.

Data growth with AI works the same way — but the curve is on the wrong side of the ledger. The data estate generates new data through normal business activity (the “principal”). AI processing of that data generates additional data (the “interest”). That additional data is processed by AI, generating even more data (the “interest on interest”). The estate grows exponentially, and the exponent rises whenever AI adoption rises.

The good news in personal finance is that the saver controls the rate by controlling how much they contribute. The bad news in the data is that the rate is driven by AI adoption, which most organizations are accelerating rather than slowing. Without governance to set an explicit envelope on what AI consumes and produces, the compound rate has no ceiling.

We can use the exact formula everyone already knows from their savings account. The compound interest formula is:

A = P\left(1 + \frac{r}{n}\right)^{nt}

Where $A$ is the final amount, $P$ is the principal (the starting amount), $r$ is the annual rate, $n$ is how many times per year the interest compounds, and $t$ is the number of years. Nothing here is exotic. It is the math behind every retirement account.

Data growth uses the identical formula, with the variables renamed:

D_t = D_0\left(1 + \frac{r + \alpha\beta}{n}\right)^{nt}

The mapping is one-to-one:

Compound interest	Data estate
$A$ – future value	$D_t$ – data estate size after $t$ years
$P$ – principal (starting amount)	$D_0$ – data estate size today
$r$ – annual interest rate	$r$ – organic data growth rate per year
NA	$\alpha\beta$ – extra growth added by AI
$n$ – times compounded per year	$n$ – times per year, AI reprocesses the estate
$t$ – years	$t$ – years

Two new variables capture the AI effect. Let $\beta$ be the fraction of the data estate AI systems actively process — the share of tables, documents, streams, and events that pass through models, agents, or classifiers. Let $\alpha$ be the data amplification factor: the amount of new data generated per unit of data processed. For every 1 GB an AI system touches, it produces roughly $\alpha$ GB of new data — inference logs, embeddings, model outputs, agent traces, synthetic records, feedback signals. Together, $\alpha\beta$ is the AI contribution to the growth rate, added directly on top of the organic rate $r$.

The single most important thing this formula shows is the role of $n$ — How often does AI reprocess the estate per year? In a savings account, daily compounding ($n = 365$) beats annual compounding ($n = 1$) because the interest starts earning interest sooner. Data works the same way. An organization running AI continuously — agents always on, pipelines always classifying, models always inferring — has a high $n$. Its data compounds faster than an organization that runs AI in scheduled batch jobs a few times a year. The more continuously we run AI, the faster the data estate grows. That is not a flaw in the AI; it is the compounding mechanism doing exactly what compounding does.

(For the mathematically inclined: as $n$ grows without bound — AI reprocessing the estate continuously rather than at discrete intervals — the formula approaches the continuous form $D_t = D_0\, e^{(r + \alpha\beta)t}$. The discrete and continuous versions are the same model; the continuous one is just the “compounded infinitely often” limit. The discrete formula above is the more honest one to reason with, because it makes the reprocessing frequency $n$ explicit instead of hiding it.)

What $\alpha$ Looks Like in Practice

The ranges below are first-order engineering estimates, not measured benchmarks — except where noted. They exist to show the relative amplification of different AI activities and to make the point that $\alpha$ is often larger than intuition suggests. The right number for any organization is the one measured from its own pipelines. Treat this table as a prompt to go measure, not as a citation.

AI activity	Data generated	Approximate $\alpha$	Basis
LLM inference, metadata-only logging	Token counts, latency, model ID, request ID	0.01 – 0.1	Estimate
LLM inference, full prompt + completion logging	Full request and response payload retained for audit	0.5 – 2.0	Estimate
Embedding generation	One vector per chunk in a vector DB	1.5 – 24	Computed (See below)
Agent task execution	Tool call logs, retrieved chunks, action traces	0.2 – 1.0	Estimate
Model training run	Checkpoints, training metrics, dataset version pointers	0.3 – 1.0	Estimate
Synthetic data generation	New dataset derived from original	0.5 – 2.0+	Estimate

The embedding row is computed from documented numbers, and it is the one most likely to surprise people. OpenAI’s text-embedding-3-small returns a 1,536-dimension vector; text-embedding-3-large returns 3,072 dimensions (OpenAI embeddings documentation, verified May 2026). At 4 bytes per dimension (float32), that is 6 KB per vector for small, 12 KB for large — before the index overhead and metadata a vector database adds on top. Text is typically embedded in small chunks. A 1 KB chunk embedded with the large model produces a 12 KB vector: $\alpha = 12$. Even a generous 4 KB chunk with the small model yields $\alpha = 1.5$. Embedding routinely produces more stored data than the text it represents — the opposite of the intuition that AI “summarizes” data down.

This is exactly why measuring $\alpha$ matters rather than guessing it. The intuition that AI processing shrinks data is wrong for one of the most common AI workloads in production today. For the growth model, the relevant $\alpha$ is an effective average across whatever mix of activities an organization runs, weighted by volume. An organization doing heavy RAG ingestion (lots of embedding) will have a high effective $\alpha$; one doing mostly lightweight inference with metadata-only logging will have a low one. The worked examples below use deliberately conservative averages to show that the compounding problem is real even when $\alpha$ is small.

Even at the conservative end, effective $\alpha = 0.1$, $\beta = 0.5$ (AI touches half the estate), so $\alpha\beta = 0.05$ on top of an organic $r = 0.3$ — the combined rate becomes 0.35. Over five years with annual compound growth of 1, a data estate of 100 units grows to 371 units through organic growth alone, and to 448 units with the AI contribution. That extra ~77 units is pure AI-generated data — logs, embeddings, traces — that has to be classified, cataloged, monitored, and governed like everything else.

At a more aggressive but still plausible setting, for an AI-heavy organization, effective $\alpha = 0.5$, $\beta = 0.8$, giving $\alpha\beta = 0.40$ and a combined rate of 0.70 — the same 100 units grow to roughly 1,420 units over five years at annual compounding. Run that AI more continuously (monthly reprocessing, $n = 12$) and the figure climbs past 3,000. And note that with the embedding $\alpha$ well above 1, an organization doing heavy RAG ingestion can exceed even this. Whatever governance coverage took years to build now covers a shrinking fraction of a far larger target.

Why This Means AI Usage Has to Be Governed

The implication is not “use less AI.” AI is genuinely better at work than the alternatives, and the workloads that justify AI deployment are real. The implication is that AI usage itself becomes a governance object — something that must be measured, bounded, and held accountable, in the same way data is.

Governing AI usage means making three things explicit:

What AI is processing. Which datasets, which fields, which streams? Without this, the input to the growth equation is unknown — $\beta$ cannot even be estimated, much less managed.

What AI is producing. Logs, embeddings, traces, synthetic records, model outputs. Each is a new data type with its own classification, retention, and ownership requirements. The AI-generated data category covered earlier in this article exists for this reason — to make the produced data a first-class governance concern rather than an afterthought.

The rate of growth itself. At the current moment, $D(t)$ is a number every organization can compute. The derivative $dD/dt$ is the number that matters, and the one most programs do not track. Without it, growth in coverage cannot be compared to growth in the estate.

The mental shift is from “do we govern our data?” to “is our governance growing at least as fast as the data estate is growing?” If governance coverage increases by 20% per year and the data estate grows by 35% per year, governance is losing ground regardless of how much absolute coverage exists. Coverage is a rate, not a state.

This is why discovery must run continuously, why AI-assisted classification is not optional at scale, and why the catalog has to be a live system fed by automated pipelines. The organic growth rate alone was already winning against manual governance processes. Once AI raises the rate — and especially once it raises the compounding frequency — no version of governance works without systematic automation, and without an explicit accounting of how much AI processing the organization is generating.

Compound interest is a friend when we control the rate. Without governance, AI-driven data growth is compound interest, with the rate controlled by every team that deploys a new model, agent, or pipeline — and the bill is paid by whoever has to maintain the data estate.

Summary

In short, know the data = Discovery + Description, and metadata is the place where we store what we know about the data.

We now have a map of the data estate: what it contains, how sensitive it is, who owns it, where it came from, and where it goes. The next question is obvious: who gets to touch it, and how do we keep it safe?

Next in the series: Who Gets Access and How Do We Keep It Safe?

Self-evaluation: Do We Really Know Our Data? — a checklist for the Know the Data pillar.

Demystifying Data Governance – What Do We Have and What Does It Mean?