Welcome

Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein



Tuesday, 18 December 2018

Microsoft Power BI Roadmap

Microsoft have created a new Power BI roadmap site. The public product roadmap provides a glimpse into what will be made available in the next wave of product updates.

There roadmap priorities are: 


  • Unified platform for both self-service and enterprise B; to enable organizations to create a unified, scalable, global, governed, and secured BI platform.
  • Agile, self-service data prep with big data; to facilitate collaboration and reusability among business analysts, data engineers, and data scientists.
  • Pervasive application of AI; to make it easier for business users to determine what truly matters, by automatically uncovering hidden insights and identifying key drivers.

Thursday, 6 December 2018

Microsoft Connect() 2018

Microsoft shared yet more innovations at Connect()2018 which considered ubiquitous computing and the fact that technology is transforming business.



Watch the keynote and many other sessions.

They announced the general availability of Azure Machine Learning service, which enables developers and data scientists to efficiently build, train and deploy machine learning models.

Azure Kubernetes Service (AKS) virtual node public preview was announced for serverless Kubernetes. This new feature enabled you to elastically provision additional compute capacity in seconds.

Friday, 16 November 2018

Big Data LDN Day 2





















The Fourth Industrial Revolution Report – Download for Free

Keynote

The day 2 keynote was given by Michael Stonebraker, Turing Prize winner, IEEE John von Neumann Medal Holder, Professor at MIT, Co-founder of Tamr entitled Big Data, Disruption and the 800 pound gorilla in the corner.

A few vignettes were mentioned Hamiltons, Dewitts and Amadeus.  Hadoop (meaning map reduce) was started to be used in 2010 and Google stopped using it in 2011. Hadoop now means a HDFS file system.  Cloudera's big problem is that no one wants map reduce. Map reduce is not used for anything.

The data warehouse is yesterdays problem. BI is simple SQL. Data Science has complex problems and it is a different skill set. It is based on deep learning, machine learning and linear algebra, nothing to do with SQL. Deep learning is all the rage but you need vast amounts of training data.  It is not possible to explain why the black box gives certain recommendations so it is not good when data providence is required. 

Big velocity is a big problem over time. Pattern matching and CEP (Complex Event Processing)
like Storm is not competitive. Don’t run Oracle but instead run Mongodb, Cassandra or Redis. NoSQL means no standards and no ACID. ACID is a good idea. NoSQL means you always give up something as per CAP theorem. Declarative languages are a great idea.

Data discovery is a big problem. You spend 90% of your time finding and cleaning data. Then 10% finding and cleaning the errors. Very little time is spent doing data integration. It is a data integration challenge.

Graphs

Jim Webber from Neo4J gave an insightful talk about how useful graphs are to solve problems and predict outcomes. There was some great examples of how to use graphs. He talked about triad closure and strong and week ties. Also mentioning a couple of papers to read

Effects of organizational support on organizationalcommitment  Fakhraei M, Imami R, Manuchehri S (2015)
Semi-Supervised Classification with Graph ConvolutionalNetworks Thomas N. Kipf , Max Welling  (2017) 

and a free ebook

Free Book: Graph Databases By Ian Robinson, Jim Webber, and Emil Eifrém

It is important to have semantic domain knowledge for inference and understanding in graphs as graphs depend on the context. Graphml convolutional network graph will be the data structure for AI.

The Joy of Data
The closing session of the event was delivered by Dr Hannah Fry, Associate Professor in the mathematics of Cities – UCL. This was an amazing session exploring what visualization and insights can be achieved from understanding the data.

She started the talk with the strange Wikipedia phenomenon that all routes lead to philosophy. So by clicking the first proper link on a page you will eventually end up on the philosophy page. 

There are 2 parallel universes where people click the link and the mathematical universe.  Data is the bridge.

She showed how data could be used to investigate why the bicycle transport scheme in London was seeing all the bikes ending up in the wrong place. Vans have to go round moving bikes into the right places during the day. This was the result of people liking to cycle down the hills but not up.

Another example showed that Islington station was a bottleneck which caused a cascading problem because it has a lack of transport routes from there. There were many other interesting examples and how gossip can pay by using network science to track the problem down.

Big Data LDN had some amazing sessions and insightful content. Big Data LDN will be back next year 13-14 Nov 2019.

Wednesday, 14 November 2018

Big Data LDN day 1




















I attended Big Data LDN 13-14 November 2018.

The event was busy with vendor product session and technical sessions.  All the sessions were 30 mins so there was a quick turn around following each session. The sessions ran throughout the day with no break for lunch. 

One of the sessions discussed the fourth industrial revolution and the fact that it is causing a cultural shift. The areas of importance that were mentioned were
  • Skills
  • Digital Infrastructure
  • Search and resilience
  • Ethics and digital regulation
Two institutions were mentioned as leading the way. The Alan Turing Institute as the national institute for data science and artificial intelligence and the Ada Lovelace Institute, an independent research and deliberative body with a mission to ensure data and AI work for people and society.

Text Analytics
I attended an interesting session on text analysis. Text analytics process unstructured text to find patterns and relevant information to transform business.  It is far harder than image analysis due to

  • Obstalele - the quantity of data
  • Polymorphy of language
  • Polysemy of language – where words have many forms and meaning
  • Misspellings
Accuracy of sentiment analysis is hard. Sentiment analysis determines the degree of positive, negative or neutral expression. Some tools are bias. Topic modelling was a method discussed for latent dirichlet allocation (LDA). Topic modeling is a form of unsupervised learning that seeks to categorize documents by topic.

Governance

The changing face of governance has created a resurgence and rebirth of data governance. Data is important to classify, reuse and be trustworthy. A McKinsey survey about data integrity and trust of data was mentioned that the talked about defensive (single source of trust) and offensive (multi versions of the truth).

The Great Data Debate

The end of the first day Big Data LDN assembled a unique panel of some of the world’s leaders in data management for The Great Data Debate.

The panelists included
  • Dr Michael Stonebraker Turing Award winner the inventor of Ingres, Postgres, Illustra, Vertica, Streambase and now CTO of Tamr.
  • Dan Wolfson, Distinguished Engineer and Director of Data & Analytics, IBM Watson Media & Weather,
  • Raghu Ramakrishnan, Global CTO for Data at Microsoft
  • Doug Cutting co-creator of Hadoop
  • Chief Architect of Cloudera
  • Phillip Radley Chief Data Architect at BT
There is a growing challenge of complexity and agility in architecture. When data scientists start looking at the data, 80% of time is spent data cleaning and then a further 10% of the time cleaning errors from the data integration. Data scientists are data unifiers not data scientists.  There are two things to consider

  • How to do data unification with lots of tools
  • Everyone will move to the cloud at some point due to economic pressures.

Data lineage is important and privacy needs to be by design. It is possible to have self service for easy analytics but not for more complicated things. A question to also consider is why not clean data at source before migrating it. Democratizing data will require data that is always on and always clean.

There will be no one size fits all. Instead packages will come, such as SQL Server 2019 bundling tools outside such as Spark and HDFS. Going forward there is likely to be 

  • A database management regime in a large database management ecosystem. 
  • A need a best of breed of tools and a uniform lens to view all lineage, all data and all tasks.

The definition of what is a database is, has evolved over time.   There are a few things to consider going forward

  • Diversity of engines for storage and processing.  
  • Keep track of data meta systems after cleaning, data enrichment and provenance is important. 
  • Keep training data attached to the machine learning (ML)  model. 
  • Need enterprise catalog management. 
  • ML brings competitive advantage
  • Separate data from compute

It is a data unification problem in a data catalog era.

  

Friday, 9 November 2018

SQLBits 2019: The Great Data Heist

SQLBits 2019 registration is open. Next year it runs between 27 February 2019 and 02 March 2019 at Manchester Central. There are many amazing reasons to attend this data conference. Hope to see you there.

Thursday, 8 November 2018

PASS Summit 2018 Day 2 Keynote

The day 2 keynote today was given by Microsoft Data Platform CTO Raghu Ramakrishnan on the internals of our next evolution in engine architecture which will form the foundation for the next 25 years of the Microsoft data platform.














It covered Azure SQL DB Hyperscale. The changing landscape of data has many challenges. How to leverage unbounded storage and elastic compute as well as the perennial problems: size of data operations are slow with long painful recovery times while masking network latencies.




















There isn’t one database system that can do it all well. Users need to move data across systems which is slow and complicates governance.




















The most challenging for state management is ACID properties, transactional updates, high velocity of data changes and lowest response times. These issue lead to the SQL Hyperscale.













There are various technical themes: full separation of compute and storage, the quorum (log) is complex, uniquely skewed access pattern and network simply extends the memory hierarchy. He shared a newsflash about Multi-Version Timestamp CC rules resulting from 2 phase locking and MVCC (Hekaton) and lock free data structures.

Persistent Version Store (PVS)
This technical in depth talk was packed full of technical content about SQL Hyperscale and I would recommend watching the recording about this new product and era of database delivery.

Wednesday, 7 November 2018

PASS Summit 2018 Keynote Day 1












The first keynote of PASS summit was delivered by Rohan Kumar entitled SQL Server and Azure Data Services: Harness the ultimate hybrid platform for data and AI





Customer priorities for a modernized data estate are: modernizing on-premises, modernizing to cloud, build cloud native apps and unlocking insights.






The announcements follow:

SQL Server 2019
SQL Server 2019 Public Preview  is a great way to celebrate the 25th anniversary of SQL Server

There is the introduction of big data clusters which combines Apache Spark and Hadoop into a single data platform called SQL Server. This combines the power of Spark with SQL Server over the relational and non-relation data sitting in SQL Server, HDFS and other systems like Oracle, Teradata, CosmosDB.

There are new capabilities around performance, availability and security for mission critical environments along with capability to leverage hardware innovations like persistent memory and enclaves.

Hadoop, ApacheSpark, Kubernetes and Java are native capabilities in the database engine.

Accelerated data recovery (ADR) was demonstrated and is incredible. It is at public preview.  The benefits of ADR are
  • Fast and consistent Database Recovery
  • Instantaneous Transaction rollback
  • Aggressive Log Truncation

Azure HDInsight 4.0

HDInsight 4.0 is now available in public preview.

There are several Apache Hadoop 3.0 innovations. Hive LLAP (Low Latency Analytical Processing known as Interactive Query in HDInsight) delivers ultra-fast SQL queries. The Performance metrics provide useful insight.

Integration with Power BI direct Query, Apache Zeppelin, and other tools. To learn more HDInsight Interactive Query with Power BI.

Data quality and GDPR compliance enabled by Apache Hive transactions
Improved ACID capabilities handle data quality (update/delete) issues at row level. This means that GDPR compliance requirements can now be meet with the ability to erase the data at row level. Spark can read and write to Hive ACID tables via Hive Warehouse Connector.

Apache Hive LLAP + Druid = single tool for multiple SQL use cases

Druid is a high-performance, column-oriented, distributed data store, which is well suited for user-facing analytic applications and real-time architectures. Druid is optimized for sub-second queries to slice-and-dice, drill down, search, filter, and aggregate event streams. Druid is commonly used to power interactive applications where sub-second performance with thousands of concurrent users are expected.

Hive Spark Integration
Apache Spark gets updatable tables and ACID transactions with Hive Warehouse Connector

There are several Apache Hadoop 3.0 innovations. Hive LLAP (Low Latency Analytical Processing called Interactive Query in HDInsight) for ultra-fast SQL queries. The Performance metrics provide useful insight.

Integration with Power BI Direct Query, Apache Zeppelin, and other tools. To learn more watch HDInsight Interactive Query with Power BI.

Better data quality and GDPR compliance enabled by Apache Hive transactions
Improved ACID capabilities handle data quality (update/delete) issues at row level. GDPR compliance requirements can now be meet with the ability to erase the data at row level. Spark can read and write to Hive ACID tables via Hive Warehouse Connector

Apache Hive LLAP + Druid = single tool for multiple SQL use cases

Druid is a high-performance, column-oriented, distributed data store, which is suited for user-facing analytic applications and real-time architectures. Druid is optimized for sub-second queries to slice-and-dice, drill down, search, filter, and aggregate event streams. Druid is commonly used to power interactive applications where sub-second performance with thousands of concurrent users are expected.

Hive Spark Integration
Apache Spark gets updatable tables and ACID transactions with Hive Warehouse Connector.



















Apache HBase and Apache Phoenix
Apache HBase 2.0 and Apache Phoenix 5.0 get new performance and stability features and all of the above have enterprise grade security.

Azure
Azure event hubs for Kafka is generally available
Azure Data Explorer is in public preview.

Azure Databricks Delta is in public preview
  • Connect data scientist and engineers
  • Prepare and clean data at massive scales
  • Build/train models with pre-configured ML

Azure Cosmos DB multi master replication was demoed with a drawing app, Azure Cosmos DB PxDraw
Azure SQL DB Managed Instances will be at General Availability (GA) on Dec 1st. This provides Availability Groups managed by Microsoft.

Power BI
















The new Dataflows is an enabler for self-service data prep in Power BI

Power BI Desktop November Update
  • Follow-up questions for Q&A explorerIt is possible to ask follow-up questions inside the Q&A explorer pop-up, which take into account the previous questions you asked.
  • Copy and paste between PBIX files
  • New modelling view makes it easier to work with large models.
  • Expand and collapse matrix row headers


Friday, 2 November 2018

Future Decoded Day 2

Live stream updates had this  great picture summary of the keynote.

The Day 2 Keynote at Future Decoded by Satya Nadella was inspiring. He talked around this simple self-evident formula Tech intensity = (Tech adoption) ^ Tech capability and the Intelligent Cloud and Intelligent Edge in an era of digital transformation.

In any society you need three actors for growth. You need government, academia and entrepreneurs & the private sector.

The core areas to consider and build on in the future

Privacy
We need to protect privacy as a basic fundamental human right. Trust and GDPR are important to achieve this.

Security
We need to act with collective responsibility across the tech to help keep the world safe. Cyber Security threat detection and removal are core to have embedded in any platform. Microsoft have been leading the Tech Accord.

Ethical AI 
We need to ask ourselves not only what computers can do, but what computers should do

Thursday, 1 November 2018

Future Decoded The AI Future

Future Decoded in London ExCel 31 October - 1 November is an exciting place to be. The event is packed full of AI innovations.  AI is groundbreaking and will change the face of the market place. It needs substantial learning for business and people to maximise its capability. Three takeaways from today

Maximising the AI Opportunity



Artificial intelligence is changing the UK so fast that nearly half of today's business model won’t exist by 2023, a new Microsoft report has revealed. The article can be read here
UK companies at risk of falling behind due to a lack of AI strategy, Microsoft research reveals
and the report  Maximising the AI Opportunity shares insights on the potential of AI - including Skills & Learning - based on a survey & interviews with 1000s of UK leaders

Microsoft AI Academy
A new addition to Microsoft's commitment to advancing Digital skills in the UK, the Microsoft AI academy will run face-to-face and online training sessions for business and public sector leaders, IT professionals, developers and start ups.

aka.ms/learn

Microsoft Research and Cambridge University

Some amazing news from Microsoft is that is partnering with the University of Cambridge to boost the number of AI researchers in the UK.The Microsoft Research-Cambridge University Machine Learning Initiative will provide support for Ph.D. students at the world-leading university, and offer a postdoctoral research position at Microsoft Research Lab, Cambridge . Our aim is to realise artificial intelligence’s potential in enhancing the human experience and to nurture the next generation of researchers and talent in the field.

Read More:
Microsoft Research and Cambridge University strengthen their commitment to AI innovation and the field’s future leaders



Wednesday, 31 October 2018

Always Encrypted with Secure Enclaves

SQL Server 2019 preview Always Encrypted uses an enclave technology called Virtualization Based Security (VBS) memory enclaves. A VBS enclave is an isolated region of memory within the address space of a user-mode process.

The capabilities this brings are

  • In-place encryption. Encrypt column, rotate a column encryption key, or change an encryption type of a column, without moving your data out of the database
  • Rich computations. The engine can delegate some operations on encrypted database columns to the enclave. It can decrypt the sensitive data and execute requested operations in a query on plain text values.

Always Encrypted with secure enclaves allows computations on plain text data inside a secure enclave on the server side. Microsoft define a secure enclave as a protected region of memory within the SQL Server process. It acts as a trusted execution environment for processing sensitive data inside the SQL Server engine. A secure enclave is a black box to SQL Server and other processes on the server. It is not possible to view any data or code inside the enclave from the outside, even with a debugger.

You can now try and evaluate Always Encrypted with secure enclaves in the preview of SQL Server 2019.








This shows what an admin would see when browsing the enclave memory using a debugger (note the question marks, as opposed to the actual memory content).















Reading

Always Encrypted with Secure Enclaves – Try It Now in SQL Server 2019 Preview! 
Always Encrypted with Secure Enclaves

Monday, 29 October 2018

Open Data Initiative

At Microsoft Ignite a groundbreaking partnership was announced with a new vision for renewable data and intelligent applications. It is a jointly developed vision by Adobe, Microsoft, and SAP to deliver unparalleled business insight from your behavioral, transactional, financial, and operational data. It provides a single view of data built on one data model, artificial intelligence driven insights and an open extensible platform.

















Reading
Announcing the Open Data Initiative

Friday, 26 October 2018

Machine Learning on Azure

At Microsoft Ignite there were many data announcements. Azure AI is another such area that covers the next wave of innovation aimed at transforming business. There are 3 solution areas. 

Predictive models to optimise business process

These are a set of pretrained models for Azure Cognitive Services and ONNX (Open Neural Network Exchange) that enables model interoperability across frameworks. Machine Learning is available with Azure Databricks, Azure Machine Learning and Machine Learning VMs

















AI powered apps to integrate vision, speech and language

There are now services specifically designed to help build AI powered apps & agents.

Knowledge mining to uncover insight from documents
There is valuable information hidden in documents, forms, pdfs and images. Azure Cognitive Search (in preview) adds Cognitive Services on top of Azure Search. 

  



Reading
Azure AI – Making AI real for business

Wednesday, 24 October 2018

Azure SQL Database Hyperscale in preview

Azure SQL Database Hyperscale is a new highly scalable service tier.  It adapts on demand to different workloads and auto-scales up to 100 TB per database. This eliminates the need to pre-provision storage resources. This new service tier provides the ability to scale compute and storage resources independently, giving the flexibility to optimize performance for  workloads. Azure SQL Database Hyperscale will initially be available for single database deployments. It is useful to not be limited by storage size for apps.
 

Further Reading
Announcing Azure SQL Database Hyperscale public preview

Saturday, 20 October 2018

Microsoft Learn












There is a new approach to learning, with hands-on training at Microsoft Learn.  At the Ignite keynote, Scott Guthrie announced the availability of Microsoft Learn. This new learning site says it will help you achieve your goals faster. Microsoft have launched more than 80 hours of learning for Azure, Dynamics 365, Power BI, PowerApps, and Microsoft Flow. IT is module based. The new learning platform should help up-level your skills, prepare for new role-based certification exams, and explore additional training offerings such as instructor-led training and Pluralsight.  There are 2 tracks so far. Learn Azure and Learn Business Applications. To save the progress of your learning you need to first logon and your username, display name, achievements and activities will be publically visible https://techprofile.microsoft.com/en-gb/

There are learning learning options. Microsoft Virtual Academy (MVA) which is video based. EDX also provides other courses where you can learn about Microsoft technologies and follow  different career paths.

Monday, 15 October 2018

The Data Relay Journey

This year I was Head of Marketing and Social Media for Data Relay, the conference previously known as SQL Relay. It is a privilege to be able to help put on an event which in 5 consecutive days, travels to 5 different UK cities across the breadth of the country. This year we invested a lot of time trying to improve the conference. We re-branded to Data Relay to be more in keeping with the breadth of the Microsoft Data Platform. We introduced a Code of Conduct to help encourage diversity at the conference and several other things. Our aim is to keep improving the conference.

I was the Bristol Venue Event Owner and this time delivered a session about the end to end process of data management, things to consider when improving data quality and data science in industry, based on findings from my PhD research. The session also covered data collection areas, that are often led by marketing teams. I shared details of my research findings about the complexity of managing database systems, the use of the Microsoft data platform for research and the possible future AI developments to help people manage database systems with greater ease.

Thursday, 11 October 2018

Ignite 2018 key announcements

The Microsoft Ignite 2018 news and highlights from the event.


Research skills for industry experts


It is great to see the name of the SQL Relay conference change to Data Relay. The conference encompasses the full breath of the Microsoft data platform and provides free Microsoft Data, AI & Analytics training conferences on your doorstep.








I am privileged to be speaking on Friday at Data Relay in Bristol to share my experience of providing high quality data analytics for my research using the Microsoft Data Platform. 



Monday, 1 October 2018

SQL Relay 2018

I am speaking at SQL Relay 2018 in Bristol, Friday 12 October. My session is: Research skills for industry experts.

Abstract
It is becoming ever increasing the need to present analytic outcomes. Analytics are only ever as good, as the robustness of the data collection and analysis. This session will cover the raft of research skills that can be applied in industry to improve the quality of your investigative work.


Sunday, 30 September 2018

MVP Wall

At Microsoft Ignite 2018 Microsoft devoted an entire wall to list all the names of the MVPs. I felt very humbled to have my name on the MVP wall with so many amazing people. It is such a privilege to be a part of the Microsoft Data Community. #datafamily #MVPbuzz

hashtag



And there is my name.


Tuesday, 25 September 2018

SQLBits 2019

SQLBits 2019 has been announced. It is in the heart of Manchester. The last time it was in Manchester was in 2009. I am already excited about this next event. 




Monday, 24 September 2018

Azure SQL Database Managed Instance GA

At Microsoft Ignite it was announced that Azure SQL Database Managed Instance will be general availability on October 1, 2018.


Azure SQL Database Managed Instance is a deployment model of Azure SQL Database.  This service enables customers to migrate existing databases to a fully managed PaaS cloud environment.  It is possible to use the Data Migration Service (DMS) in Azure to lift and shift their on-premises SQL Server.  This can be a useful tool to use for secure  databases that reduces the management overhead. including automatic patching and version updates, automated backups and high-availability. 

Reading
Azure SQL Database Managed Instance, General Purpose tier general availability
Azure Database Migration Service and tool updates – Ignite 2018


SQL Server 2019, Big Data and AI


At Microsoft Ignite SQL Server 2019 was launched. An amazing product for the future combining SQL Server 2019 with big data and analytics. It is great to see the combining of multiple tools in once place, a one stop shop for large and small data, structured and unstructured and from multiple sources.

There are 3 major components to SQL Server 2019.




















The creation of a data virtualization layer that handles complexity of all data sources and format.  Enabling the integration of structured and unstructured data without moving the data.

The streamlining of data management with SQL Server 2019 big data clusters deployed in Kubernetes integrating HDFS and Spark. The architecture is explained in more depth here and looks like




The creation of a complete AI platform that can use Spark to analyse both structured and unstructured data anywhere, use SQL Server machine learning services and SparkML.




In summary SQL Server big data clusters allow you to deploy scalable clusters of SQL Server, Spark, and HDFS Docker containers running on Kubernetes.

Read More