Data Catalogs are becoming an essential component in the new data world. They are an inventory of an organisations data assets. The meta data collected helps in finding the most appropriate data at speed, know what data is held, the security levels and lineage of data. Microsoft have tool in preview that helps with data governance called Azure Purview.
Azure Purview is a unified data governance service that helps you manage and govern your on-premises, multicloud and software-as-a-service (SaaS) data. Easily create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification and end-to-end data lineage. Empower data consumers to find valuable, trustworthy data.
Once installed it moves into the hands of data governors. There are predefined data plane roles dictating who can access what.
Purview Data Reader - access to the Purview portal and can read all content except for scan bindings
Purview Data Curator - access to the Purview portal and can read all content except for scan bindings, can edit information about assets, classification definitions and glossary terms, and can apply classifications and glossary terms to assets.
Purview Data Source Administrator - Can manage all aspects of scanning data into Azure Purview but does not have read or write access to content beyond those related to scanning. The role does not have access to the Purview Portal (the user needs to also be in the Data Reader or Data Curator roles)
Access to Purview is through Purview Studio. One the home page there are 2 menus. The quick access to menus in the centre of the page take you to
- Knowledge centre
- Register sources
- Browse assets
- Manage glossary
The left hand menu has 5 icons
Home- to see a summary of sources
Sources – to add and manage sources
Glossary – to add and manage glossary term collections
Insights – the visual dashboards of the assets
Management – a place to perform administrative tasks
There 3 areas Data Map, Data Catalog and Data Insights
The data map shows your sources in collection grouping of your choice. Many data types are added by default with more coming all the time. Azure Purview is built on top of Apache Atlas and with its API you can gain extra functionality.
Once a data source is add with the relate permissions granted on the asset regular scans can be scheduled. The is network isolation can give the ability to scan on-premises and Azure data sources behind a vnet using SHIR ensuring E2E network isolation.
Governing your scan has these step
- Register your source
- Apply and set up your credentials
- Set up and run your scan
- Discover your SQL server data
This contains a set of terms the business uses. There might be multiple terms in the business that mean the same thing.
- synonyms - different terms with the same definition
- related - different name with similar definition
There are some default attributes that exist and can be enhanced by custom attributed. The default attributes are
- Data stewards
- Data experts
- Related terms
Classification and labelling
Labelling of data is important to aid with communication. Consistency is important. Classifications can describe
- A type of data that exists in a data asset or schema to help identify the content of a data asset.
- It could describe a data preparation process
- Can also help with compliance
Azure Purview provides a set of default classification rules which are automatically detected during scanning. Purview uses the same classifications, also known as sensitive information types, as Microsoft 365. Azure Purview integrates with Microsoft Information Protection Sensitivity Label. There is automated scanning and labelling for files Azure Blob storage, Azure Data Lake Storage Gen 1 and Gen 2. Automatic labelling for database columns for SQL Server, Azure SQL Database, Azure SQL Database Managed Instance, Azure Synapse, Azure Cosmos DB. The default classification rules are not editable although it is possible to define your own custom classification rules using Regex or custom expressions. Classification looks at the data e.g. select top 100 from customers for data profiling , customer pattern matching , expression matching and applying the classification tags afterwards.
This enables you to search your data , use workflows in the business glossary and view data linage for sources in the data ingestion pipeline. It connects with tools such as Power BI, Azure Data Factory and Azure Synapse. It enables curations and collaboration.
This is a key pillar in Purview to enable a single pane of glass in the catalog.
The insight reports that currently exist are
- Asset Insights
- Scan Insights
- Glossary Insights
- Classification Insights
- Sensitivity labelling insights
- File Extension Insights