Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein

Wednesday 14 November 2018

Big Data LDN day 1

I attended Big Data LDN 13-14 November 2018.

The event was busy with vendor product session and technical sessions.  All the sessions were 30 mins so there was a quick turn around following each session. The sessions ran throughout the day with no break for lunch. 

One of the sessions discussed the fourth industrial revolution and the fact that it is causing a cultural shift. The areas of importance that were mentioned were
  • Skills
  • Digital Infrastructure
  • Search and resilience
  • Ethics and digital regulation
Two institutions were mentioned as leading the way. The Alan Turing Institute as the national institute for data science and artificial intelligence and the Ada Lovelace Institute, an independent research and deliberative body with a mission to ensure data and AI work for people and society.

Text Analytics
I attended an interesting session on text analysis. Text analytics process unstructured text to find patterns and relevant information to transform business.  It is far harder than image analysis due to

  • Obstalele - the quantity of data
  • Polymorphy of language
  • Polysemy of language – where words have many forms and meaning
  • Misspellings
Accuracy of sentiment analysis is hard. Sentiment analysis determines the degree of positive, negative or neutral expression. Some tools are bias. Topic modelling was a method discussed for latent dirichlet allocation (LDA). Topic modeling is a form of unsupervised learning that seeks to categorize documents by topic.


The changing face of governance has created a resurgence and rebirth of data governance. Data is important to classify, reuse and be trustworthy. A McKinsey survey about data integrity and trust of data was mentioned that the talked about defensive (single source of trust) and offensive (multi versions of the truth).

The Great Data Debate

The end of the first day Big Data LDN assembled a unique panel of some of the world’s leaders in data management for The Great Data Debate.

The panelists included
  • Dr Michael Stonebraker Turing Award winner the inventor of Ingres, Postgres, Illustra, Vertica, Streambase and now CTO of Tamr.
  • Dan Wolfson, Distinguished Engineer and Director of Data & Analytics, IBM Watson Media & Weather,
  • Raghu Ramakrishnan, Global CTO for Data at Microsoft
  • Doug Cutting co-creator of Hadoop
  • Chief Architect of Cloudera
  • Phillip Radley Chief Data Architect at BT
There is a growing challenge of complexity and agility in architecture. When data scientists start looking at the data, 80% of time is spent data cleaning and then a further 10% of the time cleaning errors from the data integration. Data scientists are data unifiers not data scientists.  There are two things to consider

  • How to do data unification with lots of tools
  • Everyone will move to the cloud at some point due to economic pressures.

Data lineage is important and privacy needs to be by design. It is possible to have self service for easy analytics but not for more complicated things. A question to also consider is why not clean data at source before migrating it. Democratizing data will require data that is always on and always clean.

There will be no one size fits all. Instead packages will come, such as SQL Server 2019 bundling tools outside such as Spark and HDFS. Going forward there is likely to be 

  • A database management regime in a large database management ecosystem. 
  • A need a best of breed of tools and a uniform lens to view all lineage, all data and all tasks.

The definition of what is a database is, has evolved over time.   There are a few things to consider going forward

  • Diversity of engines for storage and processing.  
  • Keep track of data meta systems after cleaning, data enrichment and provenance is important. 
  • Keep training data attached to the machine learning (ML)  model. 
  • Need enterprise catalog management. 
  • ML brings competitive advantage
  • Separate data from compute

It is a data unification problem in a data catalog era.


No comments:

Post a Comment

Note: only a member of this blog may post a comment.