Welcome

Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein



Friday, 16 November 2018

Big Data LDN Day 2





















The Fourth Industrial Revolution Report – Download for Free

Keynote

The day 2 keynote was given by Michael Stonebraker, Turing Prize winner, IEEE John von Neumann Medal Holder, Professor at MIT, Co-founder of Tamr entitled Big Data, Disruption and the 800 pound gorilla in the corner.

A few vignettes were mentioned Hamiltons, Dewitts and Amadeus.  Hadoop (meaning map reduce) was started to be used in 2010 and Google stopped using it in 2011. Hadoop now means a HDFS file system.  Cloudera's big problem is that no one wants map reduce. Map reduce is not used for anything.

The data warehouse is yesterdays problem. BI is simple SQL. Data Science has complex problems and it is a different skill set. It is based on deep learning, machine learning and linear algebra, nothing to do with SQL. Deep learning is all the rage but you need vast amounts of training data.  It is not possible to explain why the black box gives certain recommendations so it is not good when data providence is required. 

Big velocity is a big problem over time. Pattern matching and CEP (Complex Event Processing)
like Storm is not competitive. Don’t run Oracle but instead run Mongodb, Cassandra or Redis. NoSQL means no standards and no ACID. ACID is a good idea. NoSQL means you always give up something as per CAP theorem. Declarative languages are a great idea.

Data discovery is a big problem. You spend 90% of your time finding and cleaning data. Then 10% finding and cleaning the errors. Very little time is spent doing data integration. It is a data integration challenge.

Graphs

Jim Webber from Neo4J gave an insightful talk about how useful graphs are to solve problems and predict outcomes. There was some great examples of how to use graphs. He talked about triad closure and strong and week ties. Also mentioning a couple of papers to read

Effects of organizational support on organizationalcommitment  Fakhraei M, Imami R, Manuchehri S (2015)
Semi-Supervised Classification with Graph ConvolutionalNetworks Thomas N. Kipf , Max Welling  (2017) 

and a free ebook

Free Book: Graph Databases By Ian Robinson, Jim Webber, and Emil Eifrém

It is important to have semantic domain knowledge for inference and understanding in graphs as graphs depend on the context. Graphml convolutional network graph will be the data structure for AI.

The Joy of Data
The closing session of the event was delivered by Dr Hannah Fry, Associate Professor in the mathematics of Cities – UCL. This was an amazing session exploring what visualization and insights can be achieved from understanding the data.

She started the talk with the strange Wikipedia phenomenon that all routes lead to philosophy. So by clicking the first proper link on a page you will eventually end up on the philosophy page. 

There are 2 parallel universes where people click the link and the mathematical universe.  Data is the bridge.

She showed how data could be used to investigate why the bicycle transport scheme in London was seeing all the bikes ending up in the wrong place. Vans have to go round moving bikes into the right places during the day. This was the result of people liking to cycle down the hills but not up.

Another example showed that Islington station was a bottleneck which caused a cascading problem because it has a lack of transport routes from there. There were many other interesting examples and how gossip can pay by using network science to track the problem down.

Big Data LDN had some amazing sessions and insightful content. Big Data LDN will be back next year 13-14 Nov 2019.

Wednesday, 14 November 2018

Big Data LDN day 1




















I attended Big Data LDN 13-14 November 2018.

The event was busy with vendor product session and technical sessions.  All the sessions were 30 mins so there was a quick turn around following each session. The sessions ran throughout the day with no break for lunch. 

One of the sessions discussed the fourth industrial revolution and the fact that it is causing a cultural shift. The areas of importance that were mentioned were
  • Skills
  • Digital Infrastructure
  • Search and resilience
  • Ethics and digital regulation
Two institutions were mentioned as leading the way. The Alan Turing Institute as the national institute for data science and artificial intelligence and the Ada Lovelace Institute, an independent research and deliberative body with a mission to ensure data and AI work for people and society.

Text Analytics
I attended an interesting session on text analysis. Text analytics process unstructured text to find patterns and relevant information to transform business.  It is far harder than image analysis due to

  • Obstalele - the quantity of data
  • Polymorphy of language
  • Polysemy of language – where words have many forms and meaning
  • Misspellings
Accuracy of sentiment analysis is hard. Sentiment analysis determines the degree of positive, negative or neutral expression. Some tools are bias. Topic modelling was a method discussed for latent dirichlet allocation (LDA). Topic modeling is a form of unsupervised learning that seeks to categorize documents by topic.

Governance

The changing face of governance has created a resurgence and rebirth of data governance. Data is important to classify, reuse and be trustworthy. A McKinsey survey about data integrity and trust of data was mentioned that the talked about defensive (single source of trust) and offensive (multi versions of the truth).

The Great Data Debate

The end of the first day Big Data LDN assembled a unique panel of some of the world’s leaders in data management for The Great Data Debate.

The panelists included
  • Dr Michael Stonebraker Turing Award winner the inventor of Ingres, Postgres, Illustra, Vertica, Streambase and now CTO of Tamr.
  • Dan Wolfson, Distinguished Engineer and Director of Data & Analytics, IBM Watson Media & Weather,
  • Raghu Ramakrishnan, Global CTO for Data at Microsoft
  • Doug Cutting co-creator of Hadoop
  • Chief Architect of Cloudera
  • Phillip Radley Chief Data Architect at BT
There is a growing challenge of complexity and agility in architecture. When data scientists start looking at the data, 80% of time is spent data cleaning and then a further 10% of the time cleaning errors from the data integration. Data scientists are data unifiers not data scientists.  There are two things to consider

  • How to do data unification with lots of tools
  • Everyone will move to the cloud at some point due to economic pressures.

Data lineage is important and privacy needs to be by design. It is possible to have self service for easy analytics but not for more complicated things. A question to also consider is why not clean data at source before migrating it. Democratizing data will require data that is always on and always clean.

There will be no one size fits all. Instead packages will come, such as SQL Server 2019 bundling tools outside such as Spark and HDFS. Going forward there is likely to be 

  • A database management regime in a large database management ecosystem. 
  • A need a best of breed of tools and a uniform lens to view all lineage, all data and all tasks.

The definition of what is a database is, has evolved over time.   There are a few things to consider going forward

  • Diversity of engines for storage and processing.  
  • Keep track of data meta systems after cleaning, data enrichment and provenance is important. 
  • Keep training data attached to the machine learning (ML)  model. 
  • Need enterprise catalog management. 
  • ML brings competitive advantage
  • Separate data from compute

It is a data unification problem in a data catalog era.

  

Friday, 9 November 2018

SQLBits 2019: The Great Data Heist

SQLBits 2019 registration is open. Next year it runs between 27 February 2019 and 02 March 2019 at Manchester Central. There are many amazing reasons to attend this data conference. Hope to see you there.

Thursday, 8 November 2018

PASS Summit 2018 Day 2 Keynote

The day 2 keynote today was given by Microsoft Data Platform CTO Raghu Ramakrishnan on the internals of our next evolution in engine architecture which will form the foundation for the next 25 years of the Microsoft data platform.














It covered Azure SQL DB Hyperscale. The changing landscape of data has many challenges. How to leverage unbounded storage and elastic compute as well as the perennial problems: size of data operations are slow with long painful recovery times while masking network latencies.




















There isn’t one database system that can do it all well. Users need to move data across systems which is slow and complicates governance.




















The most challenging for state management is ACID properties, transactional updates, high velocity of data changes and lowest response times. These issue lead to the SQL Hyperscale.













There are various technical themes: full separation of compute and storage, the quorum (log) is complex, uniquely skewed access pattern and network simply extends the memory hierarchy. He shared a newsflash about Multi-Version Timestamp CC rules resulting from 2 phase locking and MVCC (Hekaton) and lock free data structures.

Persistent Version Store (PVS)
This technical in depth talk was packed full of technical content about SQL Hyperscale and I would recommend watching the recording about this new product and era of database delivery.

Wednesday, 7 November 2018

PASS Summit 2018 Keynote Day 1












The first keynote of PASS summit was delivered by Rohan Kumar entitled SQL Server and Azure Data Services: Harness the ultimate hybrid platform for data and AI





Customer priorities for a modernized data estate are: modernizing on-premises, modernizing to cloud, build cloud native apps and unlocking insights.






The announcements follow:

SQL Server 2019
SQL Server 2019 Public Preview  is a great way to celebrate the 25th anniversary of SQL Server

There is the introduction of big data clusters which combines Apache Spark and Hadoop into a single data platform called SQL Server. This combines the power of Spark with SQL Server over the relational and non-relation data sitting in SQL Server, HDFS and other systems like Oracle, Teradata, CosmosDB.

There are new capabilities around performance, availability and security for mission critical environments along with capability to leverage hardware innovations like persistent memory and enclaves.

Hadoop, ApacheSpark, Kubernetes and Java are native capabilities in the database engine.

Accelerated data recovery (ADR) was demonstrated and is incredible. It is at public preview.  The benefits of ADR are
  • Fast and consistent Database Recovery
  • Instantaneous Transaction rollback
  • Aggressive Log Truncation

Azure HDInsight 4.0

HDInsight 4.0 is now available in public preview.

There are several Apache Hadoop 3.0 innovations. Hive LLAP (Low Latency Analytical Processing known as Interactive Query in HDInsight) delivers ultra-fast SQL queries. The Performance metrics provide useful insight.

Integration with Power BI direct Query, Apache Zeppelin, and other tools. To learn more HDInsight Interactive Query with Power BI.

Data quality and GDPR compliance enabled by Apache Hive transactions
Improved ACID capabilities handle data quality (update/delete) issues at row level. This means that GDPR compliance requirements can now be meet with the ability to erase the data at row level. Spark can read and write to Hive ACID tables via Hive Warehouse Connector.

Apache Hive LLAP + Druid = single tool for multiple SQL use cases

Druid is a high-performance, column-oriented, distributed data store, which is well suited for user-facing analytic applications and real-time architectures. Druid is optimized for sub-second queries to slice-and-dice, drill down, search, filter, and aggregate event streams. Druid is commonly used to power interactive applications where sub-second performance with thousands of concurrent users are expected.

Hive Spark Integration
Apache Spark gets updatable tables and ACID transactions with Hive Warehouse Connector

There are several Apache Hadoop 3.0 innovations. Hive LLAP (Low Latency Analytical Processing called Interactive Query in HDInsight) for ultra-fast SQL queries. The Performance metrics provide useful insight.

Integration with Power BI Direct Query, Apache Zeppelin, and other tools. To learn more watch HDInsight Interactive Query with Power BI.

Better data quality and GDPR compliance enabled by Apache Hive transactions
Improved ACID capabilities handle data quality (update/delete) issues at row level. GDPR compliance requirements can now be meet with the ability to erase the data at row level. Spark can read and write to Hive ACID tables via Hive Warehouse Connector

Apache Hive LLAP + Druid = single tool for multiple SQL use cases

Druid is a high-performance, column-oriented, distributed data store, which is suited for user-facing analytic applications and real-time architectures. Druid is optimized for sub-second queries to slice-and-dice, drill down, search, filter, and aggregate event streams. Druid is commonly used to power interactive applications where sub-second performance with thousands of concurrent users are expected.

Hive Spark Integration
Apache Spark gets updatable tables and ACID transactions with Hive Warehouse Connector.



















Apache HBase and Apache Phoenix
Apache HBase 2.0 and Apache Phoenix 5.0 get new performance and stability features and all of the above have enterprise grade security.

Azure
Azure event hubs for Kafka is generally available
Azure Data Explorer is in public preview.

Azure Databricks Delta is in public preview
  • Connect data scientist and engineers
  • Prepare and clean data at massive scales
  • Build/train models with pre-configured ML

Azure Cosmos DB multi master replication was demoed with a drawing app, Azure Cosmos DB PxDraw
Azure SQL DB Managed Instances will be at General Availability (GA) on Dec 1st. This provides Availability Groups managed by Microsoft.

Power BI
















The new Dataflows is an enabler for self-service data prep in Power BI

Power BI Desktop November Update
  • Follow-up questions for Q&A explorerIt is possible to ask follow-up questions inside the Q&A explorer pop-up, which take into account the previous questions you asked.
  • Copy and paste between PBIX files
  • New modelling view makes it easier to work with large models.
  • Expand and collapse matrix row headers


Friday, 2 November 2018

Future Decoded Day 2

Live stream updates had this  great picture summary of the keynote.

The Day 2 Keynote at Future Decoded by Satya Nadella was inspiring. He talked around this simple self-evident formula Tech intensity = (Tech adoption) ^ Tech capability and the Intelligent Cloud and Intelligent Edge in an era of digital transformation.

In any society you need three actors for growth. You need government, academia and entrepreneurs & the private sector.

The core areas to consider and build on in the future

Privacy
We need to protect privacy as a basic fundamental human right. Trust and GDPR are important to achieve this.

Security
We need to act with collective responsibility across the tech to help keep the world safe. Cyber Security threat detection and removal are core to have embedded in any platform. Microsoft have been leading the Tech Accord.

Ethical AI 
We need to ask ourselves not only what computers can do, but what computers should do

Thursday, 1 November 2018

Future Decoded The AI Future

Future Decoded in London ExCel 31 October - 1 November is an exciting place to be. The event is packed full of AI innovations.  AI is groundbreaking and will change the face of the market place. It needs substantial learning for business and people to maximise its capability. Three takeaways from today

Maximising the AI Opportunity



Artificial intelligence is changing the UK so fast that nearly half of today's business model won’t exist by 2023, a new Microsoft report has revealed. The article can be read here
UK companies at risk of falling behind due to a lack of AI strategy, Microsoft research reveals
and the report  Maximising the AI Opportunity shares insights on the potential of AI - including Skills & Learning - based on a survey & interviews with 1000s of UK leaders

Microsoft AI Academy
A new addition to Microsoft's commitment to advancing Digital skills in the UK, the Microsoft AI academy will run face-to-face and online training sessions for business and public sector leaders, IT professionals, developers and start ups.

aka.ms/learn

Microsoft Research and Cambridge University

Some amazing news from Microsoft is that is partnering with the University of Cambridge to boost the number of AI researchers in the UK.The Microsoft Research-Cambridge University Machine Learning Initiative will provide support for Ph.D. students at the world-leading university, and offer a postdoctoral research position at Microsoft Research Lab, Cambridge . Our aim is to realise artificial intelligence’s potential in enhancing the human experience and to nurture the next generation of researchers and talent in the field.

Read More:
Microsoft Research and Cambridge University strengthen their commitment to AI innovation and the field’s future leaders