Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein

Friday 16 November 2018

Big Data LDN Day 2

The Fourth Industrial Revolution Report – Download for Free


The day 2 keynote was given by Michael Stonebraker, Turing Prize winner, IEEE John von Neumann Medal Holder, Professor at MIT, Co-founder of Tamr entitled Big Data, Disruption and the 800 pound gorilla in the corner.

A few vignettes were mentioned Hamiltons, Dewitts and Amadeus.  Hadoop (meaning map reduce) was started to be used in 2010 and Google stopped using it in 2011. Hadoop now means a HDFS file system.  Cloudera's big problem is that no one wants map reduce. Map reduce is not used for anything.

The data warehouse is yesterdays problem. BI is simple SQL. Data Science has complex problems and it is a different skill set. It is based on deep learning, machine learning and linear algebra, nothing to do with SQL. Deep learning is all the rage but you need vast amounts of training data.  It is not possible to explain why the black box gives certain recommendations so it is not good when data providence is required. 

Big velocity is a big problem over time. Pattern matching and CEP (Complex Event Processing)
like Storm is not competitive. Don’t run Oracle but instead run Mongodb, Cassandra or Redis. NoSQL means no standards and no ACID. ACID is a good idea. NoSQL means you always give up something as per CAP theorem. Declarative languages are a great idea.

Data discovery is a big problem. You spend 90% of your time finding and cleaning data. Then 10% finding and cleaning the errors. Very little time is spent doing data integration. It is a data integration challenge.


Jim Webber from Neo4J gave an insightful talk about how useful graphs are to solve problems and predict outcomes. There was some great examples of how to use graphs. He talked about triad closure and strong and week ties. Also mentioning a couple of papers to read

Effects of organizational support on organizationalcommitment  Fakhraei M, Imami R, Manuchehri S (2015)
Semi-Supervised Classification with Graph ConvolutionalNetworks Thomas N. Kipf , Max Welling  (2017) 

and a free ebook

Free Book: Graph Databases By Ian Robinson, Jim Webber, and Emil Eifrém

It is important to have semantic domain knowledge for inference and understanding in graphs as graphs depend on the context. Graphml convolutional network graph will be the data structure for AI.

The Joy of Data
The closing session of the event was delivered by Dr Hannah Fry, Associate Professor in the mathematics of Cities – UCL. This was an amazing session exploring what visualization and insights can be achieved from understanding the data.

She started the talk with the strange Wikipedia phenomenon that all routes lead to philosophy. So by clicking the first proper link on a page you will eventually end up on the philosophy page. 

There are 2 parallel universes where people click the link and the mathematical universe.  Data is the bridge.

She showed how data could be used to investigate why the bicycle transport scheme in London was seeing all the bikes ending up in the wrong place. Vans have to go round moving bikes into the right places during the day. This was the result of people liking to cycle down the hills but not up.

Another example showed that Islington station was a bottleneck which caused a cascading problem because it has a lack of transport routes from there. There were many other interesting examples and how gossip can pay by using network science to track the problem down.

Big Data LDN had some amazing sessions and insightful content. Big Data LDN will be back next year 13-14 Nov 2019.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.