Data Catalogs are becoming an essential component in the new data world. They are an inventory of an organisations data assets. The meta data collected helps in finding the most appropriate data at speed, know what data is held, the security levels and lineage of data. Microsoft have tool in preview that helps with data governance called Azure Purview.
Azure
Purview is a unified data governance service that helps you manage and govern
your on-premises, multicloud and software-as-a-service (SaaS) data. Easily
create a holistic, up-to-date map of your data landscape with automated data
discovery, sensitive data classification and end-to-end data lineage. Empower
data consumers to find valuable, trustworthy data.
Once
installed it moves into the hands of data governors. There are predefined data
plane roles dictating who can access what.
Purview
Data Reader - access to the Purview portal and can
read all content except for scan bindings
Purview Data
Curator - access to the
Purview portal and can read all content except for scan bindings, can edit
information about assets, classification definitions and glossary terms, and
can apply classifications and glossary terms to assets.
Purview Data
Source Administrator -
Can manage all aspects of scanning data into Azure Purview but does not have
read or write access to content beyond those related to scanning. The role does
not have access to the Purview Portal (the user needs to also be in the Data
Reader or Data Curator roles)
Access to
Purview is through Purview Studio. One the home page there are 2 menus. The
quick access to menus in the centre of the page take you to
- Knowledge centre
- Register sources
- Browse assets
- Manage glossary
The left hand
menu has 5 icons
Home- to
see a summary of sources
Sources – to
add and manage sources
Glossary – to
add and manage glossary term collections
Insights – the
visual dashboards of the assets
Management – a
place to perform administrative tasks
There 3 areas
Data Map, Data Catalog and Data Insights
Data Map
The data map
shows your sources in collection grouping of your choice. Many data types are
added by default with more coming all the time. Azure Purview is built on top
of Apache Atlas and with its API you can gain extra functionality.
Once a data
source is add with the relate permissions granted on the asset regular scans
can be scheduled. The is network isolation can give the ability to scan
on-premises and Azure data sources behind a vnet using SHIR ensuring E2E
network isolation.
Governing
your scan has these step
- Register your source
- Apply and set up your
credentials
- Set up and run your scan
- Discover your SQL server data
Glossary
This
contains a set of terms the business uses. There might be multiple terms in the
business that mean the same thing.
- synonyms - different terms
with the same definition
- related - different name
with similar definition
There are some default attributes that exist and can be enhanced by custom attributed. The default attributes are
- Name
- Definition
- Data stewards
- Data experts
- Acronym
- Synonyms
- Related terms
- Resources
Classification and labelling
Labelling of data is
important to aid with communication. Consistency is important. Classifications
can describe
- A type of data that exists in a data asset or schema to help
identify the content of a data asset.
- It could describe a data preparation process
- Can also help with compliance
Azure Purview provides a set of default classification rules
which are automatically detected during scanning. Purview uses the
same classifications, also known as sensitive information types,
as Microsoft 365. Azure Purview integrates with Microsoft
Information Protection Sensitivity Label. There is automated scanning
and labelling for files Azure Blob storage, Azure Data Lake Storage Gen 1 and
Gen 2. Automatic labelling for database columns for SQL Server, Azure SQL
Database, Azure SQL Database Managed Instance, Azure Synapse, Azure Cosmos DB. The
default classification rules are not editable although it is possible to define
your own custom classification rules using Regex or custom expressions. Classification
looks at the data e.g. select top 100 from customers for data profiling ,
customer pattern matching , expression matching and applying the classification
tags afterwards.
Data
Catalog
This
enables you to search your data , use workflows in the business glossary and
view data linage for sources in the data ingestion pipeline. It connects with
tools such as Power BI, Azure Data Factory and Azure Synapse. It enables
curations and collaboration.
Data Insights
This is a
key pillar in Purview to enable a single pane of glass in the catalog.
The insight
reports that currently exist are
- Asset Insights
- Scan Insights
- Glossary Insights
- Classification Insights
- Sensitivity labelling insights
- File Extension Insights
No comments:
Post a Comment
Note: only a member of this blog may post a comment.