Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein

Wednesday, 11 July 2018

Microsoft Research Open Data Sets

Microsoft Research Outreach team worked with the community to enable adoption of cloud based research. As a result they have launched  Microsoft Research Open Data,  a new data repository for the global research community. Microsoft wish to bring processing to the data rather than rely on data movement through the internet. This useful addition allows the data sets to be copied directly to the Azure based Data Science virtual machines. More details can be read here. The aim is to provide anonymized curated and meaningful datasets that are findable, accessible, interoperable and reusable. This follows on from the data-intensive science fourth paradigm of discovery discussed by Jim Gray.  

The open data set categories can be seen below. 

Saturday, 7 July 2018

The Future State - Serendipitous Data Management

Gone are the days where companies can survive on existing products and services. The need to continually innovate to stay ahead in a fluid world, requires a change in direction. Many articles have been written, in both academic research and Industry, to try to predict what will be the future state of data technology and what will be this year's trends.

Currently research meets industry in a rebirth of industry-based research teams consisting of organisational only teams or industry collaborating with universities. Guzdial shares his thoughts in the Communicationsof ACM March 2018 journal that "for the majority of new computer science PhD's, the research environment in industry is currently more attractive". Particularly the need within industry to continually innovate cries out for more research divisions in industry. Part of this change is due to the rapid expansion of emerging technology but also the realization, of what data science and artificial intelligence (AI) can add to a business. Data science requires collaboration between people, teams and organisations as interdisciplinary skills are needed to solve today’s problems.

There is an emerging trend whereby more research institutes have been created or existing ones hiring more staff. Microsoft have created a new organization, Microsoft Research AI (MSR AI), to pursue game-changing advances in artificial intelligence. The research team combines advances in machine learning with innovations in language and dialog, human computer interaction, and computer vision to solve some of the toughest challenges in AI.

AI machine learning intelligence, based on big data, is a complex problem to solve, to empower people for the future. In the current world there is the need for collaboration. Greengard in the Communications of ACM March 2018 journal, raised a concern that "mountains of data produce incremental gains, and coordinating all the research groups and silos is a complex endeavour".  Managing data is complex and the key areas that I think will define the next revolution are in the graph.

Telling stories from the data is increasingly important in this ever-changing holistic environment. Skills need to be developed in this area as communicating the meaning of data is crucial. Aiming for improvement in business, science, robotics, space and health can initially appear through intelligent automation and can produce further actionable insights. 

Data visualization is a key component to telling the story and seeing anomalies. Parameswaran discussed at SIGMOD 2018, that it is the scale that brings databases and visualisation together. He highlighted two problem areas, too many tuples and too many visualisations. It is an interesting point to consider how to address the excessive data points and how to appropriately find the right visualization for the data, to gain insight at speed.  

Innovation is key to the next step. I believe that is by making beneficial discoveries by design through scientific experiments from quality data in a continuous and autonomous fashion. I call this Serendipitous Data Management. This improvement and innovation will come from having sound practices for big data management that enable actionable data insights at speed.

Another trend I am seeing in research and industry is looking at how data is processed in centralised data lakes and moving that processing to the edge, particularly for IOT at the moment. As well as this increasing security, if the data can remain at source, it also reduces the volume of data transit which is currently unsustainable. How to consolidate these distributed data sources and produce analysis across disparate systems is an interesting challenge to solve. In conclusion the system built on data creates a rapidly changing landscape of which I see as the key components in defining revolutionary changes to society and culture. 

Thursday, 5 July 2018

2018 MVP Reactions

Peter Laker published an article about reactions from the 2018's NEWEST Most Valuable Professional (MVP) award winners. What a great set of reactions from some amazing people.

Privileged to have my comment listed on the reactions list.

Sunday, 1 July 2018

Tutorial on Tree Based Modeling

I found a useful tutorial on tree based learning

The tutorial includes

  • What is a Decision Tree? How does it work?
  • Regression Trees vs Classification Trees
  • How does a tree decide where to split?
  • What are the key parameters of model building and how can we avoid over-fitting in decision trees?
  • Are tree based models better than linear models?
  • Working with Decision Trees in R and Python
  • What are the ensemble methods of trees based model?
  • What is Bagging? How does it work?
  • What is Random Forest ? How does it work?
  • What is Boosting ? How does it work?
  • Which is more powerful: GBM or Xgboost?
  • Working with GBM in R and Python
  • Working with Xgboost in R and Python
  • Where to Practice ?

Monday, 25 June 2018

Research Method and Robustness

During my research I had to review and assess many types of research methods, designs and analysis. The choices that were made aligned to the type of research questions under investigation. A key component was the consideration to provide robust research. The areas I reviewed are below.

My research followed a mixed method approach with a sequential explanatory design. The quantitative research using a survey as the data collection method and qualitative research using Focus Groups. As part of the qualitative analysis I used Thematic Analysis. I found using both methods complementary. To read more about my method see my thesis A Study into Best Practices and Procedures used in the Management of Database Systems. 

Wednesday, 20 June 2018

Microsoft Inspire in 2018

Microsoft Inspire is the Microsoft partners conference next month. Microsoft personnel and industry experts from around the globe will come together for a week of networking and learning. It is a good conference to hear partners share their success stories and to see real life use cases for technology. Last year the keynotes were broadcast live, so I am hoping this will happen again this year.

Microsoft Ready and Microsoft Inspire will be co-located for the first time.

Tuesday, 12 June 2018

Apache Calcite: A Foundational Framework

At ACM SIGMOD/PODS 2018, this week, Hortonworks are talking about Calcite.  A foundational framework for optimized query processing over heterogeneous data sources. It seems an interesting dynamic data management framework that omits some key functions: storage of data, algorithms to process data, and a repository for storing metadata.  

The main goal was to originally improve Apache Hive in three different axes: latency, scalability, and SQL support. Hive and Calcite are more integrated now and the new features for its optimizer aimed to generate better plans for query execution.

It can be used for data virtualization/federation. It supports heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is an attractive choice for adoption in big-data frameworks.

Sunday, 10 June 2018

24 Hours of PASS Summit Preview 2018

Free Microsoft Data Platform training is to be held 12 June 2018 starting at 12:00 UTC. It is a great opportunity to hear some amazing sessions in 24 hours, of 1 hour back to back content.

Topics covered in this edition include Performance Tuning, Azure Data Lake, Digital Storytelling, Advanced R, Power BI and more! Join in on sessions from Kendra Little, Melissa Coates, Rob Sewell, Brent Ozar, Dejan Sarka, Devin Knight, Mico Yuk and many more experts in their fields.

Saturday, 9 June 2018

ACM SIGMOD/PODS Conference 2018

It is that time of year again when the annual ACM SIGMOD/PODS Conference. To be held 10 -15 June 2018 in Houston. The ACM SIGMOD/PODS Conference is a leading international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences.

Microsoft has contributed to the programming of SIGMOD with several researchers serving on committees and inclusion in workshops, research sessions, industry sessions, demo sessions, and poster sessions.

Thursday, 7 June 2018

Microsoft Graph - a useful tool

Microsoft Graph contains digital artifacts on life and work, and also ensures privacy and transparency. An overview of Microsoft Graph explains how it is a gateway to data and intelligence in Microsoft 365. 
The Microsoft Graph exposes APIs for:
  • Azure Active Directory
  • Office 365 services: SharePoint, OneDrive, Outlook/Exchange, Microsoft Teams, OneNote, Planner, and Excel
  • Enterprise Mobility and Security services: Identity Manager, Intune, Advanced Threat Analytics, and Advanced Threat Protection.
  • Windows 10 services: activities and devices
  • Education
The quick start page explain usage and provides examples.

Tuesday, 5 June 2018

Splunk Structure

The Splunk infrastructure is made up of various components

Indexer – processes incoming machine data and stores the results in indexes for searching. Raw data is compressed and indexes point to the data
Search Head – takes the search request and distributes it to the indexes which search the data, then consolidates the results and displays them. Knowledge objects on the search head can be used to create additional fields and transform the data
Forwarder – consume data and forwards it to the indexers for processing
Deployment Server - distributes content and configurations
Cluster Master -  coordinates the replicating activities of the peer nodes and tells the search head where to find data
License Master – shares licenses with other nodes

The Folder Structure

The structure of the folders within Splunk is as follows:

Friday, 1 June 2018

Azure Cloud Collaboration Center

In May Microsoft in Redmond shared the news of their Azure Cloud Collaboration Center. This facility is a first of its kind. The Cloud Collaboration Center space shows customers a snapshot of what is happening with their data 24/7 and enables real-time troubleshooting of issues by multiple teams simultaneously from across the organization.  It combines innovation and scale to address operational issues and unexpected events in order to drive new levels of customer responsiveness, security and efficiency. 

It is great to see a space for correlation of information with the possibility to pull the data up on individual workstations. I think this kind of collaboration is great news for customers.

Monday, 28 May 2018

CODEX: The Control of Data EXpediently

Complexity exists in database systems whether managing data or databases. The ability to understand this complexity and use this understanding to improve and innovate in the management of these database systems is an important step.  My PhD research investigations centred around understanding best practices and the complexity that exists around management of database systems. 

I used the analogy of a CODEX. The CODEX is a blueprint for database systems management. The acronym CODEX was selected by analogy with the revolutionary introduction of the Codex (Netz & Noel 2007, pp.69–85) in the first century AD which changed the storage medium from a roll to a Codex (book format). This brought challenges migrating the data, but significant benefits of increased speed of data access, reference and durability (of the parchment). Not all texts were migrated from rolls to Codex and those that were not migrated became defunct. Text case was changed from capitals to lowercase and minuscule copies made, resulting in further change; original majuscule manuscripts have not survived. This scholarly activity led to a revival in reading classic documents and a development of a centre of culture.

The CODEX is one of the outputs from the research, based on interpretation of the data. The components stated in the CODEX, are the most prevalent components that are connected when managing database systems.

The CODEX (Control of Data EXpediently) blueprint acronym is constructed as: C for control; O for control of Operations; D for data; E for expediently; and X for unpredictable events. It is an acronym for a system or way of controlling operations and data in a rapid, efficient and accurate manner.

The five inputs into the CODEX are required for every piece of data or database management work. These five inputs are described in the paragraphs below.

C. An important step in any database system is the control system: defining the business needs, budget, controlling the people and time factors. People are important in the management of the database system. It involves the stakeholders and the teams working together to achieve a single goal. The culture driving this collaborative venture forward will undoubtedly raise conflict, but this should be integrated with a high level of communication with all levels in management, the stakeholders, the teams and data and database staff. Also, the governance related to data and data quality should be controlled.

O. Control of operations of the database system is the core day to day running of management tasks, the processes and the performance of the system, orchestrating management through automated and self-managing systems where possible. Technical management needs to understand how internal and external technologies integrate. This is vital when using cloud technologies because internal managers have no control of the details. All of these operations require security to be considered to protect the data.

D. The increasing volume of data acquired today requires storage in various forms. Thus, databases or big data solutions have developed to satisfy the current demand not only for storage but also to provide information quickly and accurately. The variety of data, big or small, requires governance and has a purpose. The reporting and visualization of data is key to enhance business ability to grow, adapt and understand the complexity. Data is continually changing and more of it needs to be stored to meet the demands of society. Being able to understand the data for it to be available and useful, is a core requirement to improve and innovate.

E. Expediency is driven from the need to have efficient control over costs, speed of delivery and change. Designing database systems that are easy to manage, simple and agile utilising reference architectures and blueprints is key to performing expediently. To be able to proceed the critical factors are knowledge, skills, learning, leading to understanding and allowing planning to unfold unhindered. Development can lead to fast performing applications and efficient management through automation.

X. With any system and particularly in a diverse and ubiquitous database system, systems change is always happening, be it with the number and type of database platforms, the new technologies, global business or environment change. Change is rapid and diverse. Using patterns and always establishing best practice will help in the management of database systems. These best practices need to be able to rapidly change as requirements change or are not known at the outset. Producing documentation that can be automatically created is key for accuracy and ensuring documentation is available. Also, unpredictable events can occur and any changes to the components must be documented and changes to all respective components made. There is continuous feedback over time. 

I will discuss in another blog post, how the suggested pattern, CODEX (Control of Data EXpediently), can help improvement in managing database systems. Using AI to improve management, incorporating telemetry, and systems diagramming will aid with the change to come.

Holt, Victoria (2017). A Study into Best Practices and Procedures used in the Management of Database Systems. PhD thesis The Open University

Wednesday, 23 May 2018

Deep learning

Deep Learning is a subset of machine learning which aims to solve thought related problems. To understand this bleeding edge technology here are a few links.

Learn an intuitive approach to building the complex models that help machines solve real-world problems with human-like intelligence with the Microsoft AI school.

Free webinar
On 31 May there is a free webinar with Jen Stirrup on Deep Learning and Artificial Intelligence in the Workplace . The webinar asks what is Microsoft’s approach to Deep Learning, and how does it differ from Open Source alternatives? In this session, it will will look at Deep Learning, and how it can be implemented in Microsoft and Azure technologies with the Cognitive Toolkit, Tensorflow in Azure and CaffeOnSpark on AzureHDInsight

How deep learning will change customer experience
This article discusses artificial neural networks for machines which will allow machines and devices to function in some ways as humans do.

Monday, 14 May 2018

MVP Award

My Microsoft MVP (Most Valuable Professional) Data Platform award has now arrived. What a privilege it is to have received it. I like to share my passion for data and Microsoft technology.  There is something about the data technology that excites my curiosity. I feel privileged to have found such a career focus. The award is for exceptional community leadership and expertise in technology focus.

Monday, 7 May 2018

Microsoft Build Azure Cosmos DB

Microsoft Build is underway sharing many useful features. The Azure Cosmos DB API is a versatile tool with a number of options. There are some quickstart tutorials and samples for these.

 Azure Cosmos DB now has multi-master write support. Multi-master in Azure Cosmos DB provides single-digit millisecond latency to write data and availability with built-in flexible conflict resolution support. There are some good examples in the article to help understand this functionality better.

Azure Operational Data Services includes Azure SQL DB; PostgreSQL; MySQL; Redis Cache; and Cosmos DB.

Saturday, 5 May 2018

Azure CosmosDB Change Feed

The Azure CosmosDB change feed can provide a persistent log of records within an Azure CosmosDB container. You can learn about this from this concise presentation.

Tuesday, 1 May 2018

Microsoft MVP Award

I received my first Data Platform MVP award yesterday. What an honour it is to be a part of an amazing community. I am humbled by the enormity of the award and it is a privilege to be able to share my passion for data. I am listed here.

Big Data Exploration

The big data landscape is growing and exploration of the data can help make better decisions. I came across this great infographic from IBM.

Monday, 30 April 2018

Machine Learning Algorithm Cheat Sheet

Another machine learning cheat sheet to help you choose your algorithm. The cheat sheet is designed for beginner data scientists and analysts.

The types of learning.

Wednesday, 25 April 2018


The General Data Protection Regulation (GDPR) comes into effect on 25 May 2018, one month from now. The EU General Data Protection Regulation is the most important change in data privacy regulation in 20 years. GDPR, is fundamentally about protecting and enabling the privacy rights of the individual.

A Guide to enhancing privacy and addressing GDPR requirements with the Microsoft SQL platform is an interesting read. The obligations related to controls and security around handling of personal data are some of the the concepts discussed in the the document.

GDPR Article 25—“Data protection by design and default”: Control exposure to personal
• Control accessibility—who is accessing data and how.
• Minimize data being processed in terms of amount of data collected, extent of
processing, storage period, and accessibility.
• Include safeguards for control management integrated into processing.
GDPR Article 32—“Security of processing”: Security mechanisms to protect personal data.
• Employ pseudonymization and encryption.
• Restore availability and access in the event of an incident.
• Provide a process for regularly testing and assessing effectiveness of security
GDPR Article 33—“Notification of a personal data breach to the supervisory authority”:
Detect and notify of breach in a timely manner (72 hours).
• Detect breaches.
• Assess impact on and identification of personal data records concerned.
• Describe measures to address breach.
GDPR Article 30—“Records of processing activities”: Log and monitor operations.
• Maintain an audit record of processing activities on personal data.
• Monitor access to processing systems.
GDPR Article 35—“Data protection impact assessment”: Document risks and security
• Describe processing operations, including their necessity and proportionality.
• Assess risks associated with processing.
• Apply measures to address risks and protect personal data, and demonstrate
compliance with the GDPR.

Friday, 20 April 2018

DataWorks Summit 2018

This was the first time I had attended the DataWorks summitIdeas. Insights. Innovation. for big data. I had the privilege to attend the Luminaries dinner on arrival at the conference. The dinner was held for the European data heroes award. The Hortonworks data heroes initiative recognizes the data visionaries, data scientists, and data architects transforming their businesses and organizations through Big Data.

Each day started with a set of keynotes.

Day 1 Opening Keynotes
The Single Most Important Formula for Business Success Scott Gnau - Hortonworks
Changing the Data Game with Open Metadata and Governance Mandy Chessell - IBM
Big Data Success In Practice: The Biggest Mistakes To Avoid Across The Top 5 Business Use Cases Bernard Marr - Bernard Marr & Co.
Munich Re: Driving a Big Data Transformation Andreas Kohlmaier - Munich Re

Scott Gnau opened his talk with an hypothesis “Data is your cloud is your business” Connecting disparate data to provide for real time information enables us to innovated fast. A data strategy is imperative, it needs to include governance, security and adopt rapid change. Data drives our lives everyday from smart edge devices to all businesses.

He concluded with your data strategy is your cloud strategy is your business strategy if (A) =(B) and (B) = (C) then (A) =(C).

Bernard Marr then shared his insights about AI automating more things faster and the fourth industrial revolution.  He mentioned the top 5 business use cases as

  • Informing: to make better decisions
  • Understand: know you customers better
  • Improvement: customer value proposition
  • Automation: key business processes
  • Monetization: data as an asset
A couple of interesting points raised were about specialist data hunting units to find new data sources and automation requirements to improve operations.  Data diversity is key to improve analytics along with data governance.

Day 2 Keynotes
Renault: A Data Lake Journey Kamelia Benchekroun - Renault Group
Are You Ready For GDPR? Jamie Engesser - Hortonworks, Srikanth Venkat - Hortonworks Inc
Embracing GDPR to Improve Your Business Practices in the Digital Age Enza Iannopollo - Forrester Research
Driving High Impact Business Outcomes from Artificial Intelligence Frank Saeuberlich – Teradata

Day 2 Forester Enza Iannopollo discussed embracing GDPR to improve your business practices in the digital age. Privacy by design and by default requires new business processes to be established and cultural change to happen. GDPR requires compliance across the organization and with external partners. The compliance strategies are only as good as your risk assessment and mitigation. The classification of data is a key place to start. Concluding the sessions with a quote
“Good Data protection normally enables you to do more things with data, not less” Tim Gough Head of Data Protection Guardian News and Media

Data Steward Studio (DSS) was launched at the conference. It is one of several services available for Hortonworks DataPlane Service; it provides a suite of capabilities that allows users to understand and govern data across enterprise data lakes.

Saturday, 14 April 2018

SQL Information Protection with Data Discovery and Classification

The public preview of SQL Information Protection brings advanced capabilities built into Azure SQL Database for discovering, classifying, labeling, and protecting the sensitive data in your databases. SQL Data Discovery and Classification are also added to SQL Server Management Studio.

This tools will help meet data privacy standards and regulatory compliance requirements, such as GDPR. It will enable data-centric security scenarios, such as monitoring (auditing) and alerting on anomalous access to sensitive data to be viewed in dashboards. It will help with controlling access to and hardening the security of databases containing highly sensitive data.

The SQL Information Protection (SQL IP) introduces a set of advanced services and new SQL capabilities, forming a new information protection paradigm in SQL aimed at protecting the data. The four areas covered are:

  • Discovery and recommendations 
  • Labeling
  • Monitoring/Auditing (Azure SQL Db only)
  • Visibility

Ph.D Graduation

“A story has no beginning or end: arbitrarily one chooses that moment of experience from which to look back or from which to look ahead.”
― Graham Greene, The End of the Affair

After 7 years of hard work, bringing industry and research together, I was excited to attend my Ph.D graduation. What an awesome and humbling day. Words can't express how it felt as a Ph.D graduate, with a Doctor of Philosophy, to sit on the stage along side the university academic staff. It is something I will never forget.

Now it is time to utilize my research skills gained throughout the Ph.D and begin something new. My aspirations in the academic field, are to write many  papers, share my research findings and to become a research fellow. 

Thursday, 12 April 2018

Microsoft Professional Program for Artificial Intelligence

With Artificial Intelligence (AI) defining the next generation this Microsoft course seems a great way to jump start your skills.

The course covers these modules

  • Introduction to AI
  • Use Python to Work with Data
  • Use Math and Statistics Techniques
  • Consider Ethics for AI
  • Plan and Conduct a Data Study
  • Build Machine Learning Models
  • Build Deep Learning Models
  • Build Reinforcement Learning Models
  • Develop Applied AI Solutions
  • Final Project
At the end you gain the Microsoft Professional Program Certificate in Artificial Intelligence.

Tuesday, 10 April 2018

Leverage data for building

The leverage data to build intelligent apps presentation gives an insightful overview of the Microsoft Data Platform and how to innovate with analytics and AI. 

Monday, 9 April 2018

Advice and guidance on becoming a speaker or volunteer

I watched this great session giving 'advice and guidance on becoming a speaker or volunteer' from SQLBits this year. 

I felt humbled when I listened to the SQLBits session recording as I am named as an absolute legend for attending all 16 SQLBits and helping for over 8 years. I had never spoken, never presented or been involved in the public facing side of the conference. It is such a great feeling helping the conference be successful, helping others enjoy what working with data brings and being a part of the sqlfamily. Thanks to SQLBits for enabling me to be a part of such an amazing event for all of these years.

The PhD Bookshelf

Following on from the creation of a literature map for my PhD, I started to formulate a plan of literature to read. These are some of the books on my bookshelf.

I also read many academic papers, stored in seven box files and in Mendeley

Mendeley is a free reference manager.  It enables you to manage your research, showcase your work, connect and collaborate with over six million researchers worldwide.

I found the Communications of the ACM journal and SIGMOD, the ACM Special Interest Group on Management of Data journal great reads.

Friday, 6 April 2018

Demystify complex relationships with SQL Server 2017 and graph

This great infographic shows some quick tips about SQL Server 2017 and graph databases. You can view this at: . The picture demonstrates nodes and edges and provides a clear example of the code changes between the Traditional SQL query and the Graph query.

Tuesday, 3 April 2018

Cosmos DB SQL query cheat sheet

The new Azure Cosmos DB: SQL Query Cheat Sheet helps you write queries for SQL API data by displaying common database queries, keywords, built-in functions, and operators in an easy to print PDF reference sheet. Reference information for the MongoDB API, Table API, and Gremlin/Graph API are also included.

Sunday, 1 April 2018

Literature Map

When you start any research project, you need to set the research in the context of the current literature. This will establish a framework for the importance of the study. This document was the starting place for organizing the literature of interest in my research.

Thesis Title: A Study in Best Practices and Procedures for the Management of Database Systems

Friday, 30 March 2018

Comparison of big data engines

A comparison of big data querying engines is below.

Apache HBase is the Hadoop database, a distributed, scalable, big data store. HBase is an open-source, non-relational, distributed database modelled after Google's Bigtable and is written in Java.

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Splunk turns machine data into answers with the leading platform to tackle the toughest IT, IoT and security challenges.

Tuesday, 27 March 2018

Machine Learning

Predictive analytics uses various statistical techniques such as, machine learning to analyze collected data for patterns or trends to forecast future events. Machine learning uses predictive models that learn from existing data to forecast future behaviors, outcomes, and trends.

Machine Learning libraries enable data scientists to use dozens of algorithms, each with their strengths and weaknesses. Download the machine learning algorithm cheat sheet to help identify how to choose a machine learning algorithm.