Data Governance
Unity Catalog and Enterprise Data Governance for AI-Ready Platforms
Abstract
Unity Catalog is Databricks' unified governance solution for data and AI assets. It introduces a three-level namespace (catalog, schema, table/volume), attribute-based access control, automated lineage tracking, and a centralised asset registry across all Databricks workspaces in an account. This note examines its core capabilities, how they compose into an enterprise governance model, and the specific implications for organisations building AI-ready data platforms where trust, traceability, and access control are non-negotiable.
Why Governance Matters for AI-Ready Platforms
Governance in data engineering is often treated as a compliance activity—something applied after the platform is built to satisfy audit requirements. This ordering is wrong, particularly for platforms intended to support AI workloads. AI systems amplify governance failures: a model trained on data with incorrect access controls can leak PII across business unit boundaries; a feature pipeline without lineage tracking makes it impossible to identify and remediate the impact of an upstream data quality incident.
The governance-first approach requires tooling that makes access control, lineage, and discoverability first-class platform capabilities rather than afterthoughts. Unity Catalog provides this foundation for the Databricks Lakehouse, offering a single governance surface across all workspace types—SQL warehouses, notebooks, Delta Live Tables pipelines, and ML runtimes.
Under GDPR Article 25, personal data must be processed with appropriate access controls and data minimisation by default—not as a remediation step applied after a breach. Unity Catalog's column-level masking and row filter capabilities provide the technical enforcement mechanism: analysts and ML engineers receive only the data they are authorised to process, regardless of how they construct their queries. For organisations subject to EU AI Act Article 10, which requires documentation of training data governance for high-risk AI systems, Unity Catalog's lineage tracking can provide an auditable record of where training data originates and the transformations applied to it.
Access Control and Permissions
Security Model
Unity Catalog implements attribute-based access control (ABAC) on top of a three-level namespace: account → catalog → schema → table or volume. Permissions are granted on objects using standard SQL GRANT/REVOKE syntax and integrate with Azure Active Directory groups for principal management. This makes Unity Catalog's permission model familiar to data engineers and DBAs while providing the centralised enforcement that was previously absent from Databricks' workspace-scoped Hive metastore.
Column-level security and row filters are available for tables where data sensitivity varies within a dataset—for example, a customer table where phone numbers must be masked for analysts but visible to the data engineering team. These controls are enforced at the query engine level, meaning they cannot be bypassed by reading underlying Parquet files directly from ADLS, provided the storage account permissions are correctly configured to route all access through Unity Catalog.
Dynamic data masking—where sensitive columns are automatically obscured based on the principal's group membership—is implemented as a masking function attached to a column definition. The masking logic runs at query time and is transparent to the querying application. Operationally, managing permissions at scale requires treating grants as code: storing GRANT and REVOKE statements in version control, applying them through CI/CD pipelines, and reviewing permission sets as part of data product release processes. Ad-hoc permission grants accumulate silently into a permission sprawl that is difficult to audit and nearly impossible to clean up retroactively.
Lineage and Discoverability
Unity Catalog's automated lineage captures table-to-table and column-to-column dependencies across notebook runs, SQL queries, Delta Live Tables pipelines, and Databricks Jobs. This lineage is captured passively—it does not require pipeline developers to annotate their code—which means it accumulates over time as normal platform activity occurs rather than requiring a dedicated lineage instrumentation project.
Column-level lineage is particularly valuable for AI governance. When a feature used in a production model traces back to a specific column in a Silver table, and that column contains data from a source system that experiences a quality incident, Unity Catalog makes it possible to answer the question 'which models need to be retrained?' in minutes rather than days.
Discoverability is provided through the Databricks workspace search and the catalog explorer UI, with the ability to add descriptions, tags, and owner metadata to assets. For organisations that require governance spanning multiple platforms—Unity Catalog for Databricks, Azure Purview for Azure SQL and Blob Storage, Collibra or Alation for business glossary and data stewardship—Unity Catalog's open lineage format allows lineage events to be exported and consumed by external catalogue tools, enabling a unified data estate view without duplicating the metadata management burden.
Catalogs, Schemas and Data Products
Namespace Diagram Placeholder
To be inserted in the final version.
The three-level namespace (catalog → schema → table) maps naturally to common data product design patterns. A catalog can represent a data domain (finance, operations, customer), a schema represents a data product or subject area within that domain, and tables represent the individual assets exposed by that product. This mapping makes Unity Catalog's namespace a natural surface for implementing data mesh ownership principles.
Cross-catalog querying is fully supported, which means a Gold table in the analytics catalog can join directly with a reference table in the shared catalog without requiring data copying or view proxies. This is a significant improvement over the previous workspace-scoped model, where cross-workspace data sharing required either a shared external storage location with manually managed permissions or Databricks Delta Sharing.
Catalog proliferation is a real governance risk: if every team creates its own catalog without coordination, the three-level namespace quickly becomes as fragmented as the pre-Unity Catalog state, just with a more structured appearance. A practical constraint is to limit catalog creation authority to the central platform team, while granting domain teams full schema-level autonomy within their allocated catalog. This single policy decision has an outsized effect on long-term governance coherence and is worth establishing formally before the platform reaches more than a handful of data product teams.
Governance Patterns in Enterprise Environments
In a multi-team enterprise environment, Unity Catalog governance is most effective when it is configured and maintained by a central data platform team rather than by individual data product teams. This does not mean central control of data content—data product teams retain ownership of their schemas and tables—but it does mean centralised management of catalogs, top-level permission structures, and audit logging configuration.
A practical governance pattern is the separation of the platform catalog (containing shared infrastructure datasets—holiday calendars, currency rates, geography reference data) from domain catalogs (containing data products owned by domain teams). The platform catalog is maintained by the central team; domain catalogs are granted to domain teams with schema-level CREATE and USAGE privileges. This separation prevents the governance anti-pattern of domain teams creating objects in shared namespaces.
Unity Catalog's system tables—available under the system.access schema—record all data access events, permission changes, and query histories in queryable Delta tables. Standard SQL queries against these tables can produce the audit evidence most compliance frameworks require: which principals accessed a given table in the last 30 days, which service accounts hold permissions that were never exercised, which tables have no documented owner. Running these queries on a scheduled basis and routing anomalies to a governance dashboard costs little to build and can surface permission drift before it becomes a compliance problem.
My Opinion / Critique
Editorial
Unity Catalog is the most significant architectural improvement to the Databricks platform in recent years, and its adoption should be treated as mandatory for any new enterprise deployment. The migration from legacy Hive metastore—while technically involved—is worth prioritising because the governance capabilities it unlocks are foundational to everything else: reliable AI workloads, defensible compliance posture, and a data catalogue that engineers actually use.
The main limitation today is coverage outside the Databricks boundary. Unity Catalog governs Databricks-native assets comprehensively, but organisations with mixed platforms—some tables in Azure Synapse, some in Azure SQL, some in Databricks—will need a separate governance layer (Azure Purview being the natural choice in Azure environments) to achieve a unified view. Unity Catalog and Purview are complementary rather than competing tools, but the integration requires deliberate design.
Open Questions
How does Unity Catalog's permission model interact with Databricks clusters running in customer-managed VNets, where network-level access controls may conflict with catalog-level permissions? What is the operational model for rotating service principal credentials used by automated pipelines when those principals have permissions granted across dozens of catalogs and schemas?
The open table format question carries the most strategic weight. Unity Catalog now supports Apache Iceberg and Apache Hudi tables alongside Delta, positioning it as a potential cross-format governance layer rather than a Delta-only solution. Whether this broadening eventually extends to Kafka topics, ML model artefacts, and BI assets—currently outside Unity Catalog's scope—will determine whether it can serve as the single governance surface for a fully unified data and AI estate, or whether organisations will continue to manage multiple overlapping governance tools for the foreseeable future.
References
- [1]Unity Catalog Overview — Databricks Documentation — Databricks, 2024
- [2]Unity Catalog: Open-Source Governance for the Lakehouse — Databricks Engineering Blog, 2023
- [3]Data Mesh: Delivering Data-Driven Value at Scale — Zhamak Dehghani, O'Reilly Media, 2022
- [4]Azure Purview and Databricks Unity Catalog — Integration Guide — Microsoft Azure Documentation, 2024
Daniel Conejo Sobrino
Enterprise Data Engineer
Related Notes