Welcome

Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein



Monday, 26 July 2021

Responsible Innovation: A Best Practices Toolkit

Responsible innovation is a toolkit that helps developers become good stewards for the future of science and its effect on society.  

There are 3 areas
  • Judgment Call
  • Harms Modelling
  • Community Jury

This toolkit provides a set of practices currently in development, for anticipating and addressing the potential negative impacts of technology on people. This is an early release of this development.

Judgment Call 

Judgment Call is an award-winning game and team-based activity that puts Microsoft’s AI principles of fairness, privacy and security, reliability and safety, transparency, inclusion, and accountability into action. The game cultivates stakeholder empathy through scenario-imagining. Game participants write product reviews from the perspective of a particular stakeholder, describing what kind of impact and harms the technology could produce from their point of view.

To prepare for this game, download the printable Judgment Call game kit.

Harms Modelling 

Harms Modelling is a framework for product teams, grounded in four core pillars of responsible innovation, that examine how people's lives can be negatively impacted by technology: injuries, denial of consequential services, infringement on human rights, and erosion of democratic & societal structures. Similar to Security Threat Modelling, This modelling enables product teams to anticipate potential real-world impacts of technology.

Community Jury

Community Jury is a technique that brings together diverse stakeholders impacted by a technology. It is an adaptation of the citizen jury. The stakeholders are provided an opportunity to learn from experts about a project, deliberate together, and give feedback on use cases and product design. This responsible innovation technique allows project teams to collaborate with researchers to identify stakeholder values, and understand the perceptions and concerns of impacted stakeholders.

These 3  new tools under development are underdevelopment but quiet interesting to look at. 

References

Citizens Juries

The Ethics of AI Ethics: An Evaluation of Guidelines

Hagendorff, T. The Ethics of AI Ethics: An Evaluation of Guidelines. Minds & Machines 30, 99–120 (2020) 

Saturday, 10 July 2021

Purview Readiness one-pager checklist

Starting with Azure Purview requires a few prerequisites. The checklist  sets out 4 phases. 

Identify where to start to establish a Governance baseline foundation for your organization for general cloud governance. 












Azure Purview Readiness Checklist

Purview-Deployment-Checklist will help prepare for data governance and data democratization in your environment.

There are four sections

  • Readiness
  • Build foundation
  • Register data sources
  • Curate and consume data

The Azure Purview deployment prerequisites can help with those proof of concept deployments


Azure Purview automated readiness checklist

A set of scripts have been written to help evaluate your exciting environment for missing configuration that might prevent data sources being scanned. The PowerShell scripts are

  • Azure-Purview-automated-readiness-checklist.ps1
  • Azure-Purview-automated-readiness-checklist-csv-Input.ps1





















Tuesday, 6 July 2021

Azure Purview - A Data Catalog

Data Catalogs are becoming an essential component in the new data world. They are an inventory of an organisations data assets. The meta data collected helps in finding the most appropriate data at speed, know what data is held, the security levels and lineage of data. Microsoft have tool in preview that helps with data governance  called Azure Purview.

Azure Purview is a unified data governance service that helps you manage and govern your on-premises, multicloud and software-as-a-service (SaaS) data. Easily create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification and end-to-end data lineage. Empower data consumers to find valuable, trustworthy data.




Once installed it moves into the hands of data governors. There are predefined data plane roles dictating who can access what.

Purview Data Reader - access to the Purview portal and can read all content except for scan bindings

Purview Data Curator - access to the Purview portal and can read all content except for scan bindings, can edit information about assets, classification definitions and glossary terms, and can apply classifications and glossary terms to assets.

Purview Data Source Administrator - Can manage all aspects of scanning data into Azure Purview but does not have read or write access to content beyond those related to scanning. The role does not have access to the Purview Portal (the user needs to also be in the Data Reader or Data Curator roles)

Access to Purview is through Purview Studio. One the home page there are 2 menus. The quick access to menus in the centre of the page take you to

  • Knowledge centre
  • Register sources
  • Browse assets
  • Manage glossary

The left hand menu has 5 icons

Home-  to see a summary of sources

Sources – to add and manage sources

Glossary – to add and manage glossary term collections

Insights – the visual dashboards of the assets

Management – a place to perform administrative tasks

 












 

There 3 areas Data Map, Data Catalog and Data Insights

Data Map

The data map shows your sources in collection grouping of your choice. Many data types are added by default with more coming all the time. Azure Purview is built on top of Apache Atlas and with its API you can gain extra functionality.




Once a data source is add with the relate permissions granted on the asset regular scans can be scheduled. The is network isolation can give the ability to scan on-premises and Azure data sources behind a vnet using SHIR ensuring E2E network isolation.

Governing your scan has these step

  • Register your source
  • Apply and set up your credentials
  • Set up and run your scan
  • Discover your SQL server data

Glossary

This contains a set of terms the business uses. There might be multiple terms in the business that mean the same thing.

  • synonyms - different terms with the same definition
  • related - different name with similar definition

There are some default attributes that exist and can be enhanced by custom attributed. The default attributes are

  • Name
  • Definition
  • Data stewards
  • Data experts
  • Acronym
  • Synonyms
  • Related terms
  • Resources

Classification and labelling

Labelling of data is important to aid with communication. Consistency is important. Classifications can describe

  • A type of data that exists in a data asset or schema to help identify the content of a data asset.
  • It could describe a data preparation process
  • Can also help with compliance

Azure Purview provides a set of default classification rules which are automatically detected during scanning.  Purview uses the same classifications, also known as sensitive information types, as Microsoft 365. Azure Purview integrates with Microsoft Information Protection Sensitivity Label. There is automated scanning and labelling for files Azure Blob storage, Azure Data Lake Storage Gen 1 and Gen 2. Automatic labelling for database columns for SQL Server, Azure SQL Database, Azure SQL Database Managed Instance, Azure Synapse, Azure Cosmos DB.  The default classification rules are not editable although it is possible to define your own custom classification rules using Regex or custom expressions. Classification looks at the data e.g. select top 100 from customers for data profiling ,  customer pattern matching , expression matching and applying the classification tags afterwards.



 

 

















Data Catalog

This enables you to search your data , use workflows in the business glossary and view data linage for sources in the data ingestion pipeline. It connects with tools such as Power BI, Azure Data Factory and Azure Synapse. It enables curations and collaboration.

Data Insights

This is a key pillar in Purview to enable a single pane of glass in the catalog.

The insight reports that currently exist are

  • Asset Insights
  • Scan Insights
  • Glossary Insights
  • Classification Insights
  • Sensitivity labelling insights
  • File Extension Insights

Sunday, 4 July 2021

Operational Intelligence

The answers lies in insights from data analytics whether diagnostic , predictive or  prescriptive analytics. 

The Gartner model looks at 4 areas to gain competitive advantage.



I Watching a presentation at Microsoft Build this slide was shared adding an additional dimension of cognitive analytics. Interesting to see the deeper insight showed in a diagram as AI as the driving force through the advanced analytics areas.


 

Thursday, 1 July 2021

2021-2022 Microsoft Most Valuable Professional

So excited, such amazing news to receive my 4th MVP award. Thank you #Microsoft Just so humbled to receive this at this time. There is no better time to be a part of such an amazing community #SQLfamily #DataToboggan #AzureSynapse #MVPBuzz







So many exciting things going on that i'm involved in with the community after my PhD, Data Toboggan (its 3 conferences and user group) Data Relay, SQLBits, SQL Saturday and data research #data #ai #bigdata #analytics #datascience #dataanalytics #research #innovation #datastrategy #datagovernance #phd #artificialintelligence



Tuesday, 29 June 2021

Data Culture

Data culture is a term that has been talked about over many years. It is about using data to drive an organizations decisions.  Mckinsey state there are seven principles that underpin a healthy data culture

  • data culture is a decision culture
  • data culture is a C-Suite imperative, and that of the board
  • the democratization of data
  • data culture puts risk at its core
  • culture catalysts with people bridging data science and on the ground operations
  • sharing data beyond company walls is shifting for in house competitive advantage to assembling the breath of best data assets in the market
  • marrying talent and culture 

Alation have started doing quarterly State of Data Culture reports . The latest report is June 2021. Within that report they are sharing a Data Culture Index (DCI) which is a quantitative assessment of how well organisations are positioned to enable data driven decision making. The index they have is based on data search and discovery, data literacy and data governance. I do think that data culture is also about data ethics.












The report states the top initiative to foster data culture is managing data governance and improving that data quality. 

Monday, 28 June 2021

Distinguished Engineer and Research Fellows

I have always been fascinated by the job roles distinguished engineer and research fellows. To me it seems these role transcend research and industry. A distinguished engineer is a title applied to someone who is thought (by those conferring this title) to have achieved noteworthy technical, professional accomplishments while working as an engineer.  A research fellow is an academic research position at a university or research institution that is usually held by academic staff or faculty members. 

I came across a slide show on behaviours and qualities of an IBM Distinguished Engineer. An interesting quote is on the diagram "I want to be distinguished from the rest; to tell the truth, a friend to all mankind is not a friend for me" 





















Distinguished Engineer IBM Fellows are world famous inventors and theorists. A Distinguished Engineer has a unique and fascinating job that transcends many boundaries. A few of the attributes they mention:
 
  • Eminence takes responsibility 
  • Be learned, erudite 
  • Integrity and trust in all things
  • Learn from your mistakes
  • Apply common sense
  • Make decisions
  • Have a point of view
  • Provide hope
  • Inspire others
  • Be collaborative
  • Be optimistic and cheerful
  • Adapt proactively
  • Be curious and fearless
  • Build a track record and keep notes!
  • Know yourself and be true to you
  • Enhance your communication skills and image
  • Listen actively
  • Be a mentor and coach
  • Lead diverse teams
  • Be a member of a professional body 

Tuesday, 22 June 2021

Combining research and industry learning

I am very privileged to have an article about my career published in the ACM journal. Computing enabled me to... obtain a PhD. and a Career in Data.

The DOI reference for my paper is https://doi.org/10.1145/3464919 .

Communications of the ACM Volume 64 Issue 7 pp 7

About the journal

ACM, the world's largest educational and scientific computing society, delivers resources that advance computing as a science and a profession. ACM provides the computing field's premier Digital Library and serves its members and the computing profession with leading-edge publications, conferences, and career resources. They see a world where computing helps solve tomorrow’s problems – where we use our knowledge and skills to advance the profession and make a positive impact.

Research 

Being part of the research world is a huge part of who I am and it is very important to have research and industry working together to help shape the future of data innovation.

Wednesday, 16 June 2021

Data Toboggan Cool Runnings 21 Summary

The 12 hour event ran on 12 June. We used Teams Live Events. The tool has evolved since we used it in January with some plus points and some behaviour we hadn't seen before. There were some amazing speakers from around the world: Finland, Malta, Australia, Hungary, Serbia, UK (including Scotland), The Netherlands, USA (Seattle, New Mexico), China, (Shanghai)  Canada and Norway. 

It was amazing having Robin Sutara, the Microsoft UK Chief Data Officer present the Keynote. We were also privileged to have a number of product group speakers sharing in depth details about the Azure Synapse features. Then there were community speakers and MVPs speaking. All in all a lot of content with the 45 minutes main sessions wrapped with 5 minute pre recorded lightning talks.  

We had our first expert panel discussion session which I found really interesting and informative. People were able to ask questions in advance using Microsoft forms and during the event using Slido. I would like to thank the people who watched the presentations live on the day and contributed to such a lively chat on Slack. 

I was interested to see where our twitter followers are located. It is interesting to see where Azure Synapse is being used or investigated.














There were many discussions to be had by people on the slack channel. It is nice to have free open discussions on Azure Synapse all day. The links shared in the slack channel were

Presentation tricks : http://blogs.lobsterpot.com.au/2021/01/30/presentation-trickery-online-glassboard-like-lightboard-but-using-just-free-software/ 

Set the utf 8 collation after the dB has been created described by Jovan here https://techcommunity.microsoft.com/t5/azure-synapse-analytics/always-use-utf-8-collations-to-read-utf-8-text-in-serverless-sql/ba-p/1883633

Learn more about distributed execution flow that serverless uses, you can read it in this VLDB article: https://www.vldb.org/pvldb/vol13/p3204-saborit.pdf

Andy C Slides for Turbocharge here https://www.datahai.co.uk/power-bi/turbocharge-power-bi-using-azure-synapse-analytics-session/

Using file metadata in queries : https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-specific-files

Best Practices for serverless SQL pool https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool

Article that explains the cost management for serverless pools in details: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/data-processed

Synapse Link for Dataverse: https://docs.microsoft.com/en-us/powerapps/maker/data-platform/export-to-data-lake

Roman Pijack: Cost control for serverless SQL pool - the source code of the views I'll be showing during the lightning talk is available   https://github.com/RomanPijacek/DataToboggan/tree/main/CostControlForServerlessSqlPool

The official MS doc that describes Cost management for serverless SQL pool: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/data-processed

Craig Porteous: github repo  for his session 

https://github.com/cporteou/Presentations/tree/main/Adventures%20in%20CICD%20with%20Azure%20Synapse

Data Profiler Summary Stats in ADF Data Flows: https://techcommunity.microsoft.com/t5/azure-data-factory/how-to-save-your-data-profiler-summary-stats-in-adf-data-flows/ba-p/1243251

Data Factory to Synapse user voice: https://feedback.azure.com/forums/307516-azure-synapse-analytics/suggestions/41026642-allow-to-share-a-self-hosted-ir-from-data-factory

Well-Architected Framework session slides  https://www.datahai.co.uk/synapse-analytics/applying-the-azure-well-architected-framework-to-azure-synapse-analytics-session/

Mark PM: SQLPackage.exe https://docs.microsoft.com/en-us/sql/tools/sqlpackage/sqlpackage?view=sql-server-ver15 

SqlPackage for Azure Synapse Analytics - SQL Server https://docs.microsoft.com/en-us/sql/tools/sqlpackage/sqlpackage-for-azure-synapse-analytics?view=sql-server-ver15

Wolfgang: First preview this Summer https://devblogs.microsoft.com/visualstudio/visual-studio-2022/

From Drew Skwiers-Koballa  "Build a Data Warehouse with SQL Database Projects": https://github.com/dzsquared/synapse-sqlproj-demo

Finishing with the Azure Synapse Analytics Blog: https://techcommunity.microsoft.com/t5/azure-synapse-analytics/bg-p/AzureSynapseAnalyticsBlog