Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein

Wednesday, 25 April 2018


The General Data Protection Regulation (GDPR) comes into effect on 25 May 2018, one month from now. The EU General Data Protection Regulation is the most important change in data privacy regulation in 20 years. GDPR, is fundamentally about protecting and enabling the privacy rights of the individual.

A Guide to enhancing privacy and addressing GDPR requirements with the Microsoft SQL platform is an interesting read. The obligations related to controls and security around handling of personal data are some of the the concepts discussed in the the document.

GDPR Article 25—“Data protection by design and default”: Control exposure to personal
• Control accessibility—who is accessing data and how.
• Minimize data being processed in terms of amount of data collected, extent of
processing, storage period, and accessibility.
• Include safeguards for control management integrated into processing.
GDPR Article 32—“Security of processing”: Security mechanisms to protect personal data.
• Employ pseudonymization and encryption.
• Restore availability and access in the event of an incident.
• Provide a process for regularly testing and assessing effectiveness of security
GDPR Article 33—“Notification of a personal data breach to the supervisory authority”:
Detect and notify of breach in a timely manner (72 hours).
• Detect breaches.
• Assess impact on and identification of personal data records concerned.
• Describe measures to address breach.
GDPR Article 30—“Records of processing activities”: Log and monitor operations.
• Maintain an audit record of processing activities on personal data.
• Monitor access to processing systems.
GDPR Article 35—“Data protection impact assessment”: Document risks and security
• Describe processing operations, including their necessity and proportionality.
• Assess risks associated with processing.
• Apply measures to address risks and protect personal data, and demonstrate
compliance with the GDPR.

Saturday, 14 April 2018

SQL Information Protection with Data Discovery and Classification

The public preview of SQL Information Protection brings advanced capabilities built into Azure SQL Database for discovering, classifying, labeling, and protecting the sensitive data in your databases. SQL Data Discovery and Classification are also added to SQL Server Management Studio.

This tools will help meet data privacy standards and regulatory compliance requirements, such as GDPR. It will enable data-centric security scenarios, such as monitoring (auditing) and alerting on anomalous access to sensitive data to be viewed in dashboards. It will help with controlling access to and hardening the security of databases containing highly sensitive data.

The SQL Information Protection (SQL IP) introduces a set of advanced services and new SQL capabilities, forming a new information protection paradigm in SQL aimed at protecting the data. The four areas covered are:

  • Discovery and recommendations 
  • Labeling
  • Monitoring/Auditing (Azure SQL Db only)
  • Visibility

Ph.D Graduation

“A story has no beginning or end: arbitrarily one chooses that moment of experience from which to look back or from which to look ahead.”
― Graham Greene, The End of the Affair

After 7 years of hard work, bringing industry and research together, I was excited to attend my Ph.D graduation. What an awesome and humbling day. Words can't express how it felt as a Ph.D graduate, with a Doctor of Philosophy, to sit on the stage along side the university academic staff. It is something I will never forget.

Now it is time to utilize my research skills gained throughout the Ph.D and begin something new. My aspirations in the academic field, are to write many  papers, share my research findings and to become a research fellow. 

Thursday, 12 April 2018

Microsoft Professional Program for Artificial Intelligence

With Artificial Intelligence (AI) defining the next generation this Microsoft course seems a great way to jump start your skills.

The course covers these modules

  • Introduction to AI
  • Use Python to Work with Data
  • Use Math and Statistics Techniques
  • Consider Ethics for AI
  • Plan and Conduct a Data Study
  • Build Machine Learning Models
  • Build Deep Learning Models
  • Build Reinforcement Learning Models
  • Develop Applied AI Solutions
  • Final Project
At the end you gain the Microsoft Professional Program Certificate in Artificial Intelligence.

Tuesday, 10 April 2018

Leverage data for building

The leverage data to build intelligent apps presentation gives an insightful overview of the Microsoft Data Platform and how to innovate with analytics and AI. 

Monday, 9 April 2018

Advice and guidance on becoming a speaker or volunteer

I watched this great session giving 'advice and guidance on becoming a speaker or volunteer' from SQLBits this year. 

I felt humbled when I listened to the SQLBits session recording as I am named as an absolute legend for attending all 16 SQLBits and helping for over 8 years. I had never spoken, never presented or been involved in the public facing side of the conference. It is such a great feeling helping the conference be successful, helping others enjoy what working with data brings and being a part of the sqlfamily. Thanks to SQLBits for enabling me to be a part of such an amazing event for all of these years.

The PhD Bookshelf

Following on from the creation of a literature map for my PhD, I started to formulate a plan of literature to read. These are some of the books on my bookshelf.

I also read many academic papers, stored in seven box files and in Mendeley

Mendeley is a free reference manager.  It enables you to manage your research, showcase your work, connect and collaborate with over six million researchers worldwide.

I found the Communications of the ACM journal and SIGMOD, the ACM Special Interest Group on Management of Data journal great reads.

Friday, 6 April 2018

Demystify complex relationships with SQL Server 2017 and graph

This great infographic shows some quick tips about SQL Server 2017 and graph databases. You can view this at: . The picture demonstrates nodes and edges and provides a clear example of the code changes between the Traditional SQL query and the Graph query.

Tuesday, 3 April 2018

Cosmos DB SQL query cheat sheet

The new Azure Cosmos DB: SQL Query Cheat Sheet helps you write queries for SQL API data by displaying common database queries, keywords, built-in functions, and operators in an easy to print PDF reference sheet. Reference information for the MongoDB API, Table API, and Gremlin/Graph API are also included.

Sunday, 1 April 2018

Literature Map

When you start any research project, you need to set the research in the context of the current literature. This will establish a framework for the importance of the study. This document was the starting place for organizing the literature of interest in my research.

Thesis Title: A Study in Best Practices and Procedures for the Management of Database Systems

Friday, 30 March 2018

Comparison of big data engines

A comparison of big data querying engines is below.

Apache HBase is the Hadoop database, a distributed, scalable, big data store. HBase is an open-source, non-relational, distributed database modelled after Google's Bigtable and is written in Java.

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Splunk turns machine data into answers with the leading platform to tackle the toughest IT, IoT and security challenges.

Tuesday, 27 March 2018

Machine Learning

Predictive analytics uses various statistical techniques such as, machine learning to analyze collected data for patterns or trends to forecast future events. Machine learning uses predictive models that learn from existing data to forecast future behaviors, outcomes, and trends.

Machine Learning libraries enable data scientists to use dozens of algorithms, each with their strengths and weaknesses. Download the machine learning algorithm cheat sheet to help identify how to choose a machine learning algorithm.

Saturday, 24 March 2018

SQL Relay Session Submission Open

SQLRelay session submission is open https://sessionize.com/sqlrelay2018/. Please submit a session to this travelling conference. Speakers can present at a single or multiple events, it's up to you.

We are running 5 events in the same week in 2018, Monday to Friday, covering 5 different cities within the UK.

    Mon 8 Oct - Newcastle
    Tue 9 Oct - Leeds
    Wed 10 Oct - Birmingham
    Thu 11 Oct - Reading
    Fri 12 Oct - Bristol

We cover a broad range of topics at different levels, from SQL Server DBA to Advanced Analytics in Azure, taking in all aspects of the Microsoft Data Platform. 

Friday, 16 March 2018

Big Data LDN Keynotes

The 2018 opening  keynotes of Big Data LDN have been announced.

Jay Kreps and Michael Stonebraker will be delivering the two opening keynotes.

Jay Kreps, opens the event on day 1, Tuesday 13th November. The Ex-Lead Architect for Data Infrastructure at LinkedIn, Co-creator of Apache Kafka and Co-founder & CEO of Confluent will take to the stage in the keynote theatre at 09:30.

Michael Stonebraker, the Turing Prize winner, IEEE John von Neumann Medal Holder, Co-founder of Tamr and Professor at MIT will address the keynote theatre at 09:30 on day 2, Wednesday 14th November.

Tuesday, 13 March 2018

Azure Cosmos DB Data Explorer

A new tool is available to use. Data Explorer provides a rich and unified experience for inserting, querying, and managing Azure Cosmos DB data within the Azure portal. The data explorer brings together 3 tools, Document Explorer, Query Explorer, and Script Explorer.

Monday, 12 March 2018

Apache Hive and HDInsight

Apache Hive is a data warehouse system for Hadoop. Hive enables data summarization, querying, and analysis of data. Hive queries are written in HiveQL, which is a query language similar to SQL.  Hive only maintains metadata information about your data stored on HDFS. Apache Spark has built-in functionality for working with Hive and HiveQL can be used to query data stored in HBase. Hive can handle large data sets. The data must have some structure. The query execution can be via Apache TezApache Spark, or MapReduce.

There are two types of tables within Hive.
  • Internal: Data is stored in the Hive data warehouse. The data warehouse is located at /hive/warehouse/ on the default storage for the cluster. This is for mainly temporary data.
  • External: data is stored outside the data warehouse. The data is also used outside of Hive or the data needs to stay in the underlying location

A Hive table consists of a schema stored in the metastore and the data is stored on HDFS. The supported file formats are Text File, SequenceFile, RCFile, Avro Files, ORC Files, Parquet, Custom INPUTFORMAT and OUTPUTFORMAT.

Apache Hive and Hive QL is on Azure HDInsight.  

Friday, 2 March 2018

The Magic of Data

It was the 10th anniversary of SQLBits this year, with the conference tag line being, the magic of data. The conference was held at Olympia, London between 21-24 February. I am proud to have attended every conference since its inception and that I have been a helper for the last 8 years. At the start of each conference it is always an interesting challenge to understand the new venue layout and what things we can do to make this the best SQLBits conference ever for the attendees and speakers.

There were the usual two days of expert instructor led training days. I looked after a PowerBI day with Adam Saxton and Python day with Dejan Sarka. The Python training included Machine Learning for in-database SQL Server. It was a very helpful to have an overview of data mining, machine learning and statistics for data scientists. Having an appreciation of the maths and algorithms is important in this new diverse data world.

Friday arrived with a mix of general sessions running in multiple tracks. I initially attended a session by Mark Wilcock on Text Analytics Of A Bank's Public Reports. The recording can be seen here. The text analysis in R that was demonstrated was very similar to the type of qualitative analysis I undertook in my PhD. The session an introduction to HDInsight was a great starting point for managing big data.

The date that every data person has in their head this year, is 25 May 2018. That is the GDPR deadline. The big question for Microsoft was understanding the telemetry data collected and its pipeline to ensure that they comply to GDPR. It was great to hear about all the work they have done to address GDPR for the data platform.

I attended more sessions in data science and SQL Graph. Graph databases are very useful in certain scenarios. The on-premises SQL Server 2017 graph engine is different to that of the graph API in Cosmos DB and has different syntax. There are many new features still to come for SQL Graph.

Other very interesting sessions were on performance tuning with the tiger tool box, R in PowerBI, the flexibility of SQL Server 2017, inside the classic machine learning algorithms with Professor Mark Whitehorn, and a session on don't cross the streams, a closer look at Stream Analytics by Johan Ludvig BrattÃ¥s. That concluded the breath of topics I  covered in this years conference. The conference covers an amazing breadth and depth of topics from database management, development, BI, data management and data science. My lightning talk experience from this year is shared here.

The rest of my time was spent mingling and sharing data experiences. It was an honor to have been able to be a part of the conference again.
There are revised patterns available for big data advanced analytical capabilities using the Azure Databricks platforms with Azure Machine Learning.

The new capabilities will enable advance analytics to be carried out using Azure Machine Learning. The different types of data requirements and consumption are integrated using CosmosDB.

Databricks in Azure

Databricks is a big data unified analytics platform that harness the power of AI. It is built on top of Spark, serverless and is highly elastic cloud based. Azure Databricks is in preview currently. This new Azure service aims to accelerate innovation by enabling data science with a high-performance analytics platform that’s optimized for Azure. It has native integration with other Azure services such as Power BI, SQL Data Warehouse, Cosmos DB as well as from enterprise-grade Azure security, including Active Directory integration, compliance, and enterprise-grade SLAs. More information can be found in these two links

A technical overview of Azure Databricks

Introduction to Azure Databricks

Databricks is a collaborative workspace.

Databricks have an ebook Simplifying Data Engineering to Accelerate Innovation which covers

  • The three primary keys to better data engineering
  • How to build and run faster and more reliable data pipelines
  • How to reduce operational complexity and total cost of infrastructure ownership
  • 5 examples of enterprises building reliable and highly performant data pipelines

Thursday, 1 March 2018

An Introduction to HDInsight

I attended a great session at SQLBits 2018 covering the basics of HDInsight by Edinson Medina. He introduced his talk explaining the term Big Data and that it is too complex for analysis in traditional databases. There are 2 types of processing batch processing, to shape the data for analysis and real time processing to capture streams of data for low latency querying.

Hadoop is described on the Hortonworks site as "Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware."

A Hadoop cluster looks like

The underlying structure uses map reduce. The Tez engine is a newer faster engine for map reduce. The model is explained in the paper on "Analyzing performance of Apache Tez and MapReduce with hadoop multinode cluster on Amazon cloud"

HDInsight is 100% Apache Hadoop, but powered by the cloud.

There are many tools within the Hadoop ecosystem.

A meta data service that projects tabular schemas over folders and enables the folders to be queried as tables using a SQL like query.

Pig (an ETL Tool)
Performs a series of transformations to data relations based on Pig Latin statements.

A workflow engine for actions in a Hadoop cluster supporting parallel work streams.

A database integration service which enables bi-directional data transfer between an Hadoop cluster and databases via JDBC.

A low latency NoSQL database built on Hadoop modeled on Googles's BigTable. HBase stores files on HDFS.

An event processor for data streams such as real time monitoring and for event aggregation and logging. It defines a streaming topology that consists of spouts and bolts.

A fast general purpose computation engine that supports in memory operations. It is a unified stack for interactive, streaming and predictive analysis.

A management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters.

Zepplin notebooks 
A multi-purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark.

Wednesday, 28 February 2018

SQLBits Lightning Talks

This year at SQLBits I was asked early Friday afternoon if I would put my name forward for the lightning talks. As a helper for 8 years I wanted to be helpful and agreed stating I had no laptop, access to material or pre-prepared slides. I quickly thought of the only topic I could reasonably cover with very little preparation, that being ‘how I became a doctor’.  As a first-time speaker it was quite daunting to stand up in front of lots of people. Without slides you have nothing to distract the attendees, at the session, from watching you.

I received a mail late afternoon confirming I had been chosen to present. I was originally down to room monitor another session at the same time. This resulted in my frantic search to find another helper who would swap room monitoring sessions. Luckily another helper kindly agreed.
Friday evening, knowing I had to get up early Saturday morning to help and monitor a few sessions, it left no time to prepare. I left the party early Friday evening to pack and to write some semblance of order for a 5 minute presentations.

I felt quite embarrassed to speak alongside others who had prepared slides, those who had presented before or those who had given lightning talks before. I was surprised with the first run though live I finished 10 seconds early, thus keeping to the strict 5 minute time limit.

After the end of the presentation the judges gave their constructive feedback to the room. I knew before they provided feedback that the lack of slides and laptop was against me obtaining great feedback. As another lightning speaker said, this was all a very daunting experience for a first-time speaker and wouldn’t encourage new speakers.

I would say standing on the stage takes courage and all the lightning talk presenters should be proud they did it. Now after returning home after a very exhilarating and exhausting week of helping, learning and networking, I have created the set of slides  for the presentation in case anyone wanted more information. 

Saturday, 24 February 2018

SQL Relay 2018

If you enjoyed SQLBits come to SQLRelay in October 8-12. We are visiting 5 venues around the UK. More details will follow in due course. We look forward to seeing you there,

Friday, 23 February 2018

Become an Azure Cosmos DB hero

There is an Azure Cosmos DB Technical Training Series available to sign up for. It has 7 parts covering a range of topics including a technical deep dive.

  • Technical overview of Azure Cosmos DB
  • Build real-time personalized experiences with AI and serverless technology
  • Using Graph API and Table API with Azure Cosmos DB 
  • Build or migrate your Mongo DB app to Azure Cosmos DB
  • Understanding Operations of Cosmos DB
  • Build Serverless Apps with Azure Cosmos DB and Azure Functions
  • Apply real-time analytics with Azure Cosmos DB and Spark

The training series has interactive Q&A throughout.