Tuesday, 6 July 2021

Azure Purview - A Data Catalog

Data Catalogs are becoming an essential component in the new data world. They are an inventory of an organisations data assets. The meta data collected helps in finding the most appropriate data at speed, know what data is held, the security levels and lineage of data. Microsoft have tool in preview that helps with data governance  called Azure Purview.

Azure Purview is a unified data governance service that helps you manage and govern your on-premises, multicloud and software-as-a-service (SaaS) data. Easily create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification and end-to-end data lineage. Empower data consumers to find valuable, trustworthy data.




Once installed it moves into the hands of data governors. There are predefined data plane roles dictating who can access what.

Purview Data Reader - access to the Purview portal and can read all content except for scan bindings

Purview Data Curator - access to the Purview portal and can read all content except for scan bindings, can edit information about assets, classification definitions and glossary terms, and can apply classifications and glossary terms to assets.

Purview Data Source Administrator - Can manage all aspects of scanning data into Azure Purview but does not have read or write access to content beyond those related to scanning. The role does not have access to the Purview Portal (the user needs to also be in the Data Reader or Data Curator roles)

Access to Purview is through Purview Studio. One the home page there are 2 menus. The quick access to menus in the centre of the page take you to

  • Knowledge centre
  • Register sources
  • Browse assets
  • Manage glossary

The left hand menu has 5 icons

Home-  to see a summary of sources

Sources – to add and manage sources

Glossary – to add and manage glossary term collections

Insights – the visual dashboards of the assets

Management – a place to perform administrative tasks

 












 

There 3 areas Data Map, Data Catalog and Data Insights

Data Map

The data map shows your sources in collection grouping of your choice. Many data types are added by default with more coming all the time. Azure Purview is built on top of Apache Atlas and with its API you can gain extra functionality.




Once a data source is add with the relate permissions granted on the asset regular scans can be scheduled. The is network isolation can give the ability to scan on-premises and Azure data sources behind a vnet using SHIR ensuring E2E network isolation.

Governing your scan has these step

  • Register your source
  • Apply and set up your credentials
  • Set up and run your scan
  • Discover your SQL server data

Glossary

This contains a set of terms the business uses. There might be multiple terms in the business that mean the same thing.

  • synonyms - different terms with the same definition
  • related - different name with similar definition

There are some default attributes that exist and can be enhanced by custom attributed. The default attributes are

  • Name
  • Definition
  • Data stewards
  • Data experts
  • Acronym
  • Synonyms
  • Related terms
  • Resources

Classification and labelling

Labelling of data is important to aid with communication. Consistency is important. Classifications can describe

  • A type of data that exists in a data asset or schema to help identify the content of a data asset.
  • It could describe a data preparation process
  • Can also help with compliance

Azure Purview provides a set of default classification rules which are automatically detected during scanning.  Purview uses the same classifications, also known as sensitive information types, as Microsoft 365. Azure Purview integrates with Microsoft Information Protection Sensitivity Label. There is automated scanning and labelling for files Azure Blob storage, Azure Data Lake Storage Gen 1 and Gen 2. Automatic labelling for database columns for SQL Server, Azure SQL Database, Azure SQL Database Managed Instance, Azure Synapse, Azure Cosmos DB.  The default classification rules are not editable although it is possible to define your own custom classification rules using Regex or custom expressions. Classification looks at the data e.g. select top 100 from customers for data profiling ,  customer pattern matching , expression matching and applying the classification tags afterwards.



 

 

















Data Catalog

This enables you to search your data , use workflows in the business glossary and view data linage for sources in the data ingestion pipeline. It connects with tools such as Power BI, Azure Data Factory and Azure Synapse. It enables curations and collaboration.

Data Insights

This is a key pillar in Purview to enable a single pane of glass in the catalog.

The insight reports that currently exist are

  • Asset Insights
  • Scan Insights
  • Glossary Insights
  • Classification Insights
  • Sensitivity labelling insights
  • File Extension Insights

No comments:

Post a Comment

Note: only a member of this blog may post a comment.