It is such a privilege to have an article published in the Communications of the ACM journal.
July 2021, Vol. 64 No. 7, Page 7
DOI: 10.1145/3464919
It can be read here
Chaos, complexity, curiosity and database systems. A place where research meets industry
It is such a privilege to have an article published in the Communications of the ACM journal.
July 2021, Vol. 64 No. 7, Page 7
DOI: 10.1145/3464919
It can be read here
Initially published on the Coeo blog.
Data Governance is a core area that businesses need to adopt in the data-driven world. Data has been around since the earliest of times, from the first libraries in the ancient world that started to collect and store information.
The collection of scientific research information, from census information about human populations, weather and spatial data to DNA genetic data, have all been contributing to the need to store data for analysis. The breadth of the information that is available for analysis covers our entire planet and beyond, and the population as well as different species. With our life and environment becoming documented to the finest degree the need for categorisation, data labelling and data management has become engrained into our society. Where research led the way for documentation of classification for data, business is now at a crucial time of growth and expansion to enable innovation.
With all data there becomes a continual need for its management and a core starting place is data governance. The DAMA Dictionary of Data Management defines Data Governance as “The exercise of authority, control and shared decision making (planning, monitoring and enforcement) over the management of data assets".
The goal of data governance is to help an organisation to manage data as an asset efficiently and effectively. It provides the principles, policy, processes, framework, metrics and oversight that are required to drive the most business value. Data governance programs have a goal of creating sustainable data management, good data quality that is measured and defining policies and practices. A much-needed area that needs to be considered is that of culture and embedding that culture of data management into the business.
We start with understanding what data assets a business has from the core known data and dark data; data that is collected but not used. The proliferation of duplicate data around a business is key to document. Often the first thing that comes to mind with data governance these days is compliance with all the data breaches that keep occurring. The areas one thinks of here are:
These require data inventories and audits to understand what personal data your organisation collects, where it is stored, how it is protected and who may have access to it. This is part of the picture that needs to be considered.
DAMA-DMBOK is an international guiding framework for the management of data. The framework includes areas such as:
Consideration for the allocation of roles and responsibilities within an operating model helps guide the adoption of best practices.
In conclusion, managing data assets within a business requires it to be embedded in the culture of an organisation. Having high quality data leads to better business decisions. Having a core oversight function that is provided by a Chief Data Officer helps with keeping the day to day running of data in the fore front of everyone’s minds and you never know where the next innovation will come from.
How exciting to receive this. Thank you SQLBits for being the most amazing data conference. Looking forward to when face to face events return.
Starting with Azure Purview requires a few prerequisites. The checklist sets out 4 phases.
Identify where to start to establish a Governance baseline foundation for your organization for general cloud governance.
Azure Purview Readiness Checklist
Purview-Deployment-Checklist will help prepare for data governance and data democratization in your environment.
There are four sections
The Azure Purview deployment prerequisites can help with those proof of concept deployments
Azure Purview automated readiness checklist
A set of
scripts have been written to help evaluate your exciting environment for
missing configuration that might prevent data sources being scanned. The
PowerShell scripts are
Data Catalogs are becoming an essential component in the new data world. They are an inventory of an organisations data assets. The meta data collected helps in finding the most appropriate data at speed, know what data is held, the security levels and lineage of data. Microsoft have tool in preview that helps with data governance called Azure Purview.
Azure
Purview is a unified data governance service that helps you manage and govern
your on-premises, multicloud and software-as-a-service (SaaS) data. Easily
create a holistic, up-to-date map of your data landscape with automated data
discovery, sensitive data classification and end-to-end data lineage. Empower
data consumers to find valuable, trustworthy data.
Once
installed it moves into the hands of data governors. There are predefined data
plane roles dictating who can access what.
Purview
Data Reader - access to the Purview portal and can
read all content except for scan bindings
Purview Data
Curator - access to the
Purview portal and can read all content except for scan bindings, can edit
information about assets, classification definitions and glossary terms, and
can apply classifications and glossary terms to assets.
Purview Data
Source Administrator -
Can manage all aspects of scanning data into Azure Purview but does not have
read or write access to content beyond those related to scanning. The role does
not have access to the Purview Portal (the user needs to also be in the Data
Reader or Data Curator roles)
Access to
Purview is through Purview Studio. One the home page there are 2 menus. The
quick access to menus in the centre of the page take you to
The left hand
menu has 5 icons
Home- to
see a summary of sources
Sources – to
add and manage sources
Glossary – to
add and manage glossary term collections
Insights – the
visual dashboards of the assets
Management – a
place to perform administrative tasks
There 3 areas
Data Map, Data Catalog and Data Insights
Data Map
The data map
shows your sources in collection grouping of your choice. Many data types are
added by default with more coming all the time. Azure Purview is built on top
of Apache Atlas and with its API you can gain extra functionality.
Once a data
source is add with the relate permissions granted on the asset regular scans
can be scheduled. The is network isolation can give the ability to scan
on-premises and Azure data sources behind a vnet using SHIR ensuring E2E
network isolation.
Governing
your scan has these step
Glossary
This
contains a set of terms the business uses. There might be multiple terms in the
business that mean the same thing.
There are some default attributes that exist and can be enhanced by custom attributed. The default attributes are
Classification and labelling
Labelling of data is
important to aid with communication. Consistency is important. Classifications
can describe
Azure Purview provides a set of default classification rules
which are automatically detected during scanning. Purview uses the
same classifications, also known as sensitive information types,
as Microsoft 365. Azure Purview integrates with Microsoft
Information Protection Sensitivity Label. There is automated scanning
and labelling for files Azure Blob storage, Azure Data Lake Storage Gen 1 and
Gen 2. Automatic labelling for database columns for SQL Server, Azure SQL
Database, Azure SQL Database Managed Instance, Azure Synapse, Azure Cosmos DB. The
default classification rules are not editable although it is possible to define
your own custom classification rules using Regex or custom expressions. Classification
looks at the data e.g. select top 100 from customers for data profiling ,
customer pattern matching , expression matching and applying the classification
tags afterwards.
Data
Catalog
This
enables you to search your data , use workflows in the business glossary and
view data linage for sources in the data ingestion pipeline. It connects with
tools such as Power BI, Azure Data Factory and Azure Synapse. It enables
curations and collaboration.
Data Insights
This is a
key pillar in Purview to enable a single pane of glass in the catalog.
The insight
reports that currently exist are
The answers lies in insights from data analytics whether diagnostic , predictive or prescriptive analytics.
The Gartner model looks at 4 areas to gain competitive advantage.
I Watching a presentation at Microsoft Build this slide was shared adding an additional dimension of cognitive analytics. Interesting to see the deeper insight showed in a diagram as AI as the driving force through the advanced analytics areas.
So excited, such amazing news to receive my 4th MVP award. Thank you #Microsoft Just so humbled to receive this at this time. There is no better time to be a part of such an amazing community #SQLfamily #DataToboggan #AzureSynapse #MVPBuzz
So many exciting things going on that i'm involved in with the community after my PhD, Data Toboggan (its 3 conferences and user group) Data Relay, SQLBits, SQL Saturday and data research #data #ai #bigdata #analytics #datascience #dataanalytics #research #innovation #datastrategy #datagovernance #phd #artificialintelligence