Welcome

Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein



Wednesday 22 May 2024

Microsoft Build Fabric: What's new and what's next

The Microsoft Fabric announcements were covered by Amir Netz, Arun Ulagaratchagan, Flavien Daussy, Adam Penhaul. The session is recorded and can be seen here Microsoft Fabric: What's new and what's next.

I live blogged this great main data session at Microsoft Build.

AI is changing the world. AI revolution is based on Data. Data is the fuel that powers AI. It is hard because of the amount of innovation and lots of diversity and complexity.




Purpose built workloads. AI is built into Fabric. Governance is particular important and built in and driven through Microsoft Purview.

Aka.ms/try-fabric



There are weekly Fabric released with 60-80 pages of blogs . The roadmap for these features can be found at

Aka.ms/FabricRoadmap


What is the point of having data in the lake if no one is using it. It is a bout immediate business access to the data

A SaaS product that looks like Office.  No knobs to optimise Fabric. Results in hours.

  • Starts with built-in CI/CD
  • Creating deployment pipelines
  • And Taskflows (public Preview) to provide help to create things like the medallion architecture.


In Fabric you can now bring in partner workloads such as MDM and ESRI. It was announced Microsoft Fabric Workload Development kit as Public Preview.


Al your data, all your teams in one place. You can publish to workload hub for a native fabric workload experience. Aka.ms/FabDevKit

There are multiple methods to get data into Fabric for multi-clouds. Shortcuts to On-Premises Sources for OneLake was announced as Public Preview.


Not everything stored in open formats like databases, so Mirroring helps with this. There is Free Mirroring storage for Replicas. 


Delta format is not the only open format. Iceberg is another major storage function.  There is transparent simultaneous support of Delta Lake and Iceberg formats just announced. It is now possible to also connect to Salesforce and not move the data.  Also now an expanded partnership with snowflake and Adobe.

To have unified API with the public preview of the developer friendly API for GraphQL to all data in OneLake. (GraphQL uses JSON structures).

Unified data culture requires real time data. Microsoft announced Real-Time Intelligence. It uses the the Real Time hub powered by AI for data in motion. (OneLake data hub is for date at Rest)


So Real-Time Intelligence in the real world.

Copilot is integrated in every Microsoft Fabric Experience. Copilot in Fabric is now Generally Available.  This means AI driven insights drive insights out of the box and with custom generative AI for your data. 


Announcing Public Preview of AI Skills in Fabric.  It allows you to build your own Generative AI in Fabric

Simple to get started

  • Create AI Skill
  • Add data – ground in data
  • Select tables to ground the data

Query in natural language

In conclusion come and Join the Microsoft Fabric Team in Stockholm, Sweden 24-27 September 2024

Aka.ms/FabCon-Europe



Tuesday 7 May 2024

Responsible AI Transparency Report

Microsoft have shared how they work with AI responsible in this paper  Responsible AI Transparency Report How we build, support our customers, and grow.  The report outlines Microsoft’s approach to building generative AI applications responsibly, adhering to six core values of transparency, accountability, fairness, inclusiveness, reliability and safety, and privacy and security.  The framework is all based around the govern, map, measure and manage cycle.  

Govern 

Establishes the context for AI risk management, including adherence to policies and pre-deployment reviews.

  • Policies and principles
  • Procedures for pre-trained models
  • Stakeholder coordination
  • Documentation
  • Pre-deployment reviews

Map 

Involves identifying and prioritizing AI risks and conducting impact assessments to inform decisions.

  • Responsible AI Impact Assessments
  • Privacy and security review
  • Red teaming

Measure

Implements procedures to assess AI risks and the effectiveness of mitigations through established metrics.

  • Metrics for identified risks
  • Mitigations performance testing

Manage

Focuses on mitigating identified risks at both the platform and application levels, with ongoing monitoring and user feedback.

  • User agency
  • Transparency
  • Human review and oversight
  • Managing content risks
  • Ongoing monitoring
  • Defense in depth

These are all depicted in the diagram in the paper which is a very informative read.



References

Responsible AI Transparency Report How we build, support our customers, and grow

https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RW1l5BO

Thursday 2 May 2024

Responsible AI – A Data Governance Approach

I am speaking at the Bath Azure User Group meeting about Responsible AI - a Data Governance approach. I see Responsible AI a subset of Data Governance. This session covers where we are with legislation and tools, why good data quality is a must for AI and how to get started. 

Data Governance and Responsible AI, and the embellishment of AI within Microsoft Purview aid and prepare business for using AI. Moving forward I believe that combining the use of both Data Governance and Responsible AI into one actionable framework that  it will bring immediate rewards to every business use case.

Hope you can join us join us 22 May 2024 18-20 in Bath

https://lnkd.in/eRT8RijE 



Monday 29 April 2024

Open Lakes



This is an insightful article entitled Open Lakes, Not Walled Gardens by Raghu Ramakrishnan and Josh Caplan.  

The Fabric design principles consider the 

Open Ecosystem

Ensuring there are no proprietary barriers to data in OneLake, allowing integration with other services.

Security and Governance 

Data in OneLake must be secure and governed, integrating with Microsoft Purview for global policies.

Creating accessible data with no Silos 

Making the entire data estate easily accessible in OneLake without unnecessary data duplication.

SaaS Simplicity

Providing a suite of analytic engines in a secure, governed environment with single sign-on.

The article discusses the concept of open lakes for analytics, emphasizing the need for a unified view of data across an enterprise’s data estate to draw true insights. The advancements in big data tools, cloud storage, machine learning, and AI models, which offer opportunities to analyze core assets and processes through data in the Golden Age of Analytics.

The Microsoft implementation of the open lake vision with OneLake and Fabric focuses on data storage, analytics, sharing, and governance integrated with Microsoft Purview for data estate-wide governance. It outlines the importance of securing and governing enterprise data, detailing how OneLake and Fabric address these needs with built-in features and integration with Microsoft Purview for global data estate governance.

Governance for the organization, estate-level, and policy enforcement and sharing of data is a core tenant. Governance within Fabric and Onelake covers organizational governance, Estate-Level Governance where Microsoft Purview provides a global view of the entire data estate, offering a central catalog for all assets across all sources, global policies to secure sensitive data, and support for managing critical data risks and regulatory compliance. Policy Enforcement and Data Sharing are also discussed. 

Thursday 25 April 2024

Data Governance, Data Ethics and Responsible AI video series

I wanted to be able to share some thoughts on 3 of my favourite topics, Data Governance, Data Ethics and Responsible AI. There are many tools that help frame the subject area, from a data management perspective and there are useful Microsoft Tools to help you down the responsible AI and Governance route. There is a wealth of information available and wanted to, in under 5 mins a video, empower people to quickly have useful tips to move forward in this important space.  So it is an easily digestible series that is time efficient, has standalone content with an overall theme.
  • Data Governance to help govern and manage that data to improve trust and data quality 
  • Data Ethics to help mitigate issues with data integrity and provenance
  • Responsible AI to look a bias, fairness and efficacy in decisions

Episode 1 Introduction

Episode 2 what is data governance

Episode 3 what is data ethics

Episode 4 What is Responsible AI

Episode 5 Responsible AI Tools Microsoft Standard v2

Episode 6 Responsible AI Tools Impact Assessment and guide

Episode 7 Responsible AI Tools HAX Toolkit

Episode 8 Responsible AI Tools Maturity Model

Episode 9 The EU Act

Episode 10 UK Government Assurance

Episode 11 Content Safety

Episode 12 Responsible AI Dashboard

Watch this space as the next set of videos will cover how this fits in with data quality and how Microsoft Purview can help with data preparation.

The Age of Data Governance

Microsoft Purview is rapidly changing in the data governance space.  It is offering Data value creation with essential defense & response offense . This new addition helps business address the issues that the AI outputs are only as good as the quality of the data that resides behind it.

Peter Aiken new definition of data governance ' Managing data decisions with guidance’.  


Suma Manohar has written a great article talking about data quality in the era of AI.  Microsoft purview introduced domain and data products adding that clear business context and terminology mapping.  Enhanced search capability to provide more understanding using Copilot is available. It also can help with suggesting Data Quality rules.  These autogenerated rules are context specific.

Creating data quality rules manually in Purview should follow the 6 standard data quality metrics.

  • Freshness – confirms that all values are up to date.
  • Duplicate rows- checks rows to find repeated values across two or more columns.
  • Empty/blank files – looks for blank and empty fields in a column where there should be values.
  • Unique values – confirms that values in a column are unique.
  • Data type match – confirms that values in a column match data type requirements.
  • String format match – confirms that text values in a column match a specific format or other requirements.
  • Table lookup – confirms that a value in one table can be found in a specific column of another table
  • Custom – create a custom rule with the visual expression builder.
  • Regular expressions can be used for pattern matching in the above.

When working on data quality there are standard guidelines that can help. A method I use is firstly from the DAMA-DMBOK and then the Data Management Capability Assessment Model (DCAM)

Scans take place to show quality score and  trends in the data quality dashboard and scores are shown on the data product page

The rollout of the new solution across the regions is shared here.

Tuesday 9 April 2024

Fabric Mirroring Overview

There was a new feature announced last year that has been developing called Mirroring in Fabric which became Public Preview in March 2024.  This enables bringing your databases into Fabric.


Fabric mirroring is a feature within Microsoft Fabric that allows for seamless and real-time data replication from various databases into a centralized analytics platform known as OneLake. This process is designed to be frictionless, eliminating the need for complex Extract, Transform, Load (ETL) pipelines, which are traditionally used to move and transform data from one system to another.

The primary advantage of fabric mirroring is its ability to provide near real-time insights by continuously updating the data in OneLake as changes occur in the source databases. This uses Change Data Capture (CDC) technology, to capture and replicate data changes to OneLake to ensure the data is always current and synchronized.

By mirroring data into OneLake, organizations can break down data silos and unify their data estate, allowing for more efficient data governance and analysis. The data which has been mirrored can be used for analytics with ease to perform various analytical tasks.

Fabric mirroring simplifies the data access process by allowing databases to be securely accessed and managed within Fabric without the need to switch database clients or install additional software. It is possible for a mirrored database to be cross joined with other databases, warehouses or lakehouses whether that be data in Azure Cosmos DB, Azure SQL DB, Snowflake, etc.   

In summary, fabric mirroring is a transformative feature that streamlines data replication and analysis, providing businesses with a modern, fast, and safe way to access and ingest data, thereby accelerating the journey to valuable insights and informed decision-making.

Further Reading

https://blog.fabric.microsoft.com/en-US/blog/announcing-the-public-preview-of-database-mirroring-in-microsoft-fabric/

https://learn.microsoft.com/en-us/fabric/database/mirrored-database/overview

https://aka.ms/FabricRoadmap

https://aka.ms/MirrorSQLDBPublicPreviewBlog

https://devblogs.microsoft.com/cosmosdb/public-preview-mirroring-azure-cosmos-db-in-microsoft-fabric

Unify your data across domains, clouds, and engines in OneLake