Monday, 29 April 2024

Open Lakes



This is an insightful article entitled Open Lakes, Not Walled Gardens by Raghu Ramakrishnan and Josh Caplan.  

The Fabric design principles consider the 

Open Ecosystem

Ensuring there are no proprietary barriers to data in OneLake, allowing integration with other services.

Security and Governance 

Data in OneLake must be secure and governed, integrating with Microsoft Purview for global policies.

Creating accessible data with no Silos 

Making the entire data estate easily accessible in OneLake without unnecessary data duplication.

SaaS Simplicity

Providing a suite of analytic engines in a secure, governed environment with single sign-on.

The article discusses the concept of open lakes for analytics, emphasizing the need for a unified view of data across an enterprise’s data estate to draw true insights. The advancements in big data tools, cloud storage, machine learning, and AI models, which offer opportunities to analyze core assets and processes through data in the Golden Age of Analytics.

The Microsoft implementation of the open lake vision with OneLake and Fabric focuses on data storage, analytics, sharing, and governance integrated with Microsoft Purview for data estate-wide governance. It outlines the importance of securing and governing enterprise data, detailing how OneLake and Fabric address these needs with built-in features and integration with Microsoft Purview for global data estate governance.

Governance for the organization, estate-level, and policy enforcement and sharing of data is a core tenant. Governance within Fabric and Onelake covers organizational governance, Estate-Level Governance where Microsoft Purview provides a global view of the entire data estate, offering a central catalog for all assets across all sources, global policies to secure sensitive data, and support for managing critical data risks and regulatory compliance. Policy Enforcement and Data Sharing are also discussed. 

Thursday, 25 April 2024

Data Governance, Data Ethics and Responsible AI video series

I wanted to be able to share some thoughts on 3 of my favourite topics, Data Governance, Data Ethics and Responsible AI. There are many tools that help frame the subject area, from a data management perspective and there are useful Microsoft Tools to help you down the responsible AI and Governance route. There is a wealth of information available and wanted to, in under 5 mins a video, empower people to quickly have useful tips to move forward in this important space.  So it is an easily digestible series that is time efficient, has standalone content with an overall theme.
  • Data Governance to help govern and manage that data to improve trust and data quality 
  • Data Ethics to help mitigate issues with data integrity and provenance
  • Responsible AI to look a bias, fairness and efficacy in decisions

Episode 1 Introduction

Episode 2 what is data governance

Episode 3 what is data ethics

Episode 4 What is Responsible AI

Episode 5 Responsible AI Tools Microsoft Standard v2

Episode 6 Responsible AI Tools Impact Assessment and guide

Episode 7 Responsible AI Tools HAX Toolkit

Episode 8 Responsible AI Tools Maturity Model

Episode 9 The EU Act

Episode 10 UK Government Assurance

Episode 11 Content Safety

Episode 12 Responsible AI Dashboard

Watch this space as the next set of videos will cover how this fits in with data quality and how Microsoft Purview can help with data preparation.

The Age of Data Governance

Microsoft Purview is rapidly changing in the data governance space.  It is offering Data value creation with essential defense & response offense . This new addition helps business address the issues that the AI outputs are only as good as the quality of the data that resides behind it.

Peter Aiken new definition of data governance ' Managing data decisions with guidance’.  


Suma Manohar has written a great article talking about data quality in the era of AI.  Microsoft purview introduced domain and data products adding that clear business context and terminology mapping.  Enhanced search capability to provide more understanding using Copilot is available. It also can help with suggesting Data Quality rules.  These autogenerated rules are context specific.

Creating data quality rules manually in Purview should follow the 6 standard data quality metrics.

  • Freshness – confirms that all values are up to date.
  • Duplicate rows- checks rows to find repeated values across two or more columns.
  • Empty/blank files – looks for blank and empty fields in a column where there should be values.
  • Unique values – confirms that values in a column are unique.
  • Data type match – confirms that values in a column match data type requirements.
  • String format match – confirms that text values in a column match a specific format or other requirements.
  • Table lookup – confirms that a value in one table can be found in a specific column of another table
  • Custom – create a custom rule with the visual expression builder.
  • Regular expressions can be used for pattern matching in the above.

When working on data quality there are standard guidelines that can help. A method I use is firstly from the DAMA-DMBOK and then the Data Management Capability Assessment Model (DCAM)

Scans take place to show quality score and  trends in the data quality dashboard and scores are shown on the data product page

The rollout of the new solution across the regions is shared here.

Tuesday, 9 April 2024

Fabric Mirroring Overview

There was a new feature announced last year that has been developing called Mirroring in Fabric which became Public Preview in March 2024.  This enables bringing your databases into Fabric.


Fabric mirroring is a feature within Microsoft Fabric that allows for seamless and real-time data replication from various databases into a centralized analytics platform known as OneLake. This process is designed to be frictionless, eliminating the need for complex Extract, Transform, Load (ETL) pipelines, which are traditionally used to move and transform data from one system to another.

The primary advantage of fabric mirroring is its ability to provide near real-time insights by continuously updating the data in OneLake as changes occur in the source databases. This uses Change Data Capture (CDC) technology, to capture and replicate data changes to OneLake to ensure the data is always current and synchronized.

By mirroring data into OneLake, organizations can break down data silos and unify their data estate, allowing for more efficient data governance and analysis. The data which has been mirrored can be used for analytics with ease to perform various analytical tasks.

Fabric mirroring simplifies the data access process by allowing databases to be securely accessed and managed within Fabric without the need to switch database clients or install additional software. It is possible for a mirrored database to be cross joined with other databases, warehouses or lakehouses whether that be data in Azure Cosmos DB, Azure SQL DB, Snowflake, etc.   

In summary, fabric mirroring is a transformative feature that streamlines data replication and analysis, providing businesses with a modern, fast, and safe way to access and ingest data, thereby accelerating the journey to valuable insights and informed decision-making.

Further Reading

https://blog.fabric.microsoft.com/en-US/blog/announcing-the-public-preview-of-database-mirroring-in-microsoft-fabric/

https://learn.microsoft.com/en-us/fabric/database/mirrored-database/overview

https://aka.ms/FabricRoadmap

https://aka.ms/MirrorSQLDBPublicPreviewBlog

https://devblogs.microsoft.com/cosmosdb/public-preview-mirroring-azure-cosmos-db-in-microsoft-fabric

Unify your data across domains, clouds, and engines in OneLake

Wednesday, 3 April 2024

Microsoft Purview Fabric announcements

There were a number of announcements at the Microsoft Fabric Community Conference including the new Microsoft Purview for modern data governance was shared.  With business moving towards federated governance models, managed by line of business to help with more local understanding and increasing volumes of data, Microsoft have launched in Purview the capability for organizations to create subdomains to refine the way the data estate is structured in Fabric. Security has also become easier with the ability to set security groups for default domains

Microsoft Fabric is now natively integrated with Microsoft Purview Data Governance solution. There is a reimagined data governance experience for the data estate governance practice. The new experience includes data curation, an important new feature including data quality with insights. The new experience is available in preview 8 April 2024. This new experience is aiming to help accelerate measurable business value with key results, simplification and to help with implementing efficiency with natural language recommendations. 

Purview enables business terminology linkage to 

  • Data Products (a collection of data assets used for a business function) 
  • Business Domains (ownership of Data Products) 
  • Data Quality (assessment of quality) 
  • Data Access, Actions 
  • Data Estate Health (reports and insights)

A really exciting new feature we have all been waiting for is the data quality capabilities.  The is now the Data Quality model to set rules top down with business domains, data products, and the data assets. The model generates data quality scores at the asset, data product, or business domain level from the policies on terms or rules.  The score rules show on the dashboard as red/yellow/green indicator scores. The 2 capabilities in this data quality model are:

  • Profiling—quick sample set insights 
  • Data quality scans—in-depth scans of full data sets

It is great to see the Microsoft Purview continues to align to the EDM Council set of 14 rules. 

There is now an actions centre showing the current health summarising actions by role, data product or business domain for governance. This actions centra aims to help improve governance posture for the business. 

There is partnership with Ernst & Young LLP who will share playbooks and reports for US financial services customers on Azure Marketplace, throughout the preview. 


In summary there is a shift away from traditional IT-centric data architecture to federated architectures such as data mesh. The automated way to deal with Data Quality is a game changer for business. 

References

Announcements from the Microsoft Fabric Community Conference

Easily implement data mesh architecture with domains in Fabric

Introducing modern data governance for the era of AI 

The foundation for responsible analytics with Microsoft Purview

Watch: The Unified Data Platform for the Era Of AI | Microsoft Fabric Community Conference Day 1 Keynote

Crash Course in Microsoft Purview (azureedge.net)

Learning

Monday, 1 April 2024

Responsible AI dashboard training

There is a new MSLearn course to Learn how to debug an AI model using the Responsible AI dashboard in Azure Machine Learning studio to ensure it performs responsibly and is less harmful. It is important to understand and learn how to use the dashboard to set any projects up for success.

Train a model and debug it with Responsible AI dashboard

The objectives are 

  • Create a responsible AI dashboard.
  • Identify where the model has errors.
  • Discover data over or under representation to mitigate biases.
  • Understand what drives a model outcome with explainable and interpretability.
  • Mitigate issues to meet compliance regulation requirements.

You do need the ability to understand beginner level Python.