Welcome

Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein



Wednesday, 11 July 2018

Microsoft Research Open Data Sets

Microsoft Research Outreach team worked with the community to enable adoption of cloud based research. As a result they have launched  Microsoft Research Open Data,  a new data repository for the global research community. Microsoft wish to bring processing to the data rather than rely on data movement through the internet. This useful addition allows the data sets to be copied directly to the Azure based Data Science virtual machines. More details can be read here. The aim is to provide anonymized curated and meaningful datasets that are findable, accessible, interoperable and reusable. This follows on from the data-intensive science fourth paradigm of discovery discussed by Jim Gray.  

The open data set categories can be seen below. 


Saturday, 7 July 2018

The Future State - Serendipitous Data Management

Gone are the days where companies can survive on existing products and services. The need to continually innovate to stay ahead in a fluid world, requires a change in direction. Many articles have been written, in both academic research and Industry, to try to predict what will be the future state of data technology and what will be this year's trends.

Currently research meets industry in a rebirth of industry-based research teams consisting of organisational only teams or industry collaborating with universities. Guzdial shares his thoughts in the Communicationsof ACM March 2018 journal that "for the majority of new computer science PhD's, the research environment in industry is currently more attractive". Particularly the need within industry to continually innovate cries out for more research divisions in industry. Part of this change is due to the rapid expansion of emerging technology but also the realization, of what data science and artificial intelligence (AI) can add to a business. Data science requires collaboration between people, teams and organisations as interdisciplinary skills are needed to solve today’s problems.

There is an emerging trend whereby more research institutes have been created or existing ones hiring more staff. Microsoft have created a new organization, Microsoft Research AI (MSR AI), to pursue game-changing advances in artificial intelligence. The research team combines advances in machine learning with innovations in language and dialog, human computer interaction, and computer vision to solve some of the toughest challenges in AI.

AI machine learning intelligence, based on big data, is a complex problem to solve, to empower people for the future. In the current world there is the need for collaboration. Greengard in the Communications of ACM March 2018 journal, raised a concern that "mountains of data produce incremental gains, and coordinating all the research groups and silos is a complex endeavour".  Managing data is complex and the key areas that I think will define the next revolution are in the graph.


Telling stories from the data is increasingly important in this ever-changing holistic environment. Skills need to be developed in this area as communicating the meaning of data is crucial. Aiming for improvement in business, science, robotics, space and health can initially appear through intelligent automation and can produce further actionable insights. 

Data visualization is a key component to telling the story and seeing anomalies. Parameswaran discussed at SIGMOD 2018, that it is the scale that brings databases and visualisation together. He highlighted two problem areas, too many tuples and too many visualisations. It is an interesting point to consider how to address the excessive data points and how to appropriately find the right visualization for the data, to gain insight at speed.  

Innovation is key to the next step. I believe that is by making beneficial discoveries by design through scientific experiments from quality data in a continuous and autonomous fashion. I call this Serendipitous Data Management. This improvement and innovation will come from having sound practices for big data management that enable actionable data insights at speed.

Another trend I am seeing in research and industry is looking at how data is processed in centralised data lakes and moving that processing to the edge, particularly for IOT at the moment. As well as this increasing security, if the data can remain at source, it also reduces the volume of data transit which is currently unsustainable. How to consolidate these distributed data sources and produce analysis across disparate systems is an interesting challenge to solve. In conclusion the system built on data creates a rapidly changing landscape of which I see as the key components in defining revolutionary changes to society and culture. 

Thursday, 5 July 2018

2018 MVP Reactions

Peter Laker published an article about reactions from the 2018's NEWEST Most Valuable Professional (MVP) award winners. What a great set of reactions from some amazing people.
































Privileged to have my comment listed on the reactions list.

Sunday, 1 July 2018

Tutorial on Tree Based Modeling

I found a useful tutorial on tree based learning





















The tutorial includes

  • What is a Decision Tree? How does it work?
  • Regression Trees vs Classification Trees
  • How does a tree decide where to split?
  • What are the key parameters of model building and how can we avoid over-fitting in decision trees?
  • Are tree based models better than linear models?
  • Working with Decision Trees in R and Python
  • What are the ensemble methods of trees based model?
  • What is Bagging? How does it work?
  • What is Random Forest ? How does it work?
  • What is Boosting ? How does it work?
  • Which is more powerful: GBM or Xgboost?
  • Working with GBM in R and Python
  • Working with Xgboost in R and Python
  • Where to Practice ?