I attended Big Data LDN 13-14 November 2018.
The event was busy with vendor product session and
technical sessions. All the sessions were 30 mins so there was a quick turn
around following each session. The sessions ran throughout the day with no
break for lunch.
One of the sessions discussed the fourth industrial revolution and the fact that it is causing
a cultural shift. The areas of importance that were mentioned were
- Skills
- Digital Infrastructure
- Search and resilience
- Ethics and digital
regulation
Two institutions were mentioned as leading the way. The Alan Turing Institute as the national institute for data science and artificial intelligence and the Ada Lovelace Institute, an
independent research and deliberative body with a mission to ensure data and AI
work for people and society.
Text Analytics
I attended an interesting session on text analysis. Text
analytics process unstructured text to find patterns and relevant information
to transform business. It is far harder than image analysis due to
- Obstalele - the quantity of data
- Polymorphy of language
- Polysemy of language – where words have many forms and meaning
- Misspellings
Accuracy of sentiment analysis is hard. Sentiment
analysis determines the degree of positive, negative or neutral expression.
Some tools are bias. Topic modelling was a method discussed for latent dirichlet
allocation (LDA). Topic modeling is a form of unsupervised learning that seeks to categorize documents by topic.
Governance
The changing face of governance has created a resurgence
and rebirth of data governance. Data is important to classify, reuse and be
trustworthy. A McKinsey survey about data integrity and trust of data was mentioned that the talked
about defensive (single source of trust) and offensive (multi versions of the
truth).
The Great Data Debate
The end of the first day Big Data LDN assembled a unique
panel of some of the world’s leaders in data management for The Great Data
Debate.
The panelists included
- Dr Michael Stonebraker Turing Award winner the
inventor of Ingres, Postgres, Illustra, Vertica, Streambase and now CTO of
Tamr.
- Dan Wolfson, Distinguished Engineer and
Director of Data & Analytics, IBM Watson Media & Weather,
- Raghu Ramakrishnan, Global CTO for Data at
Microsoft
- Doug Cutting co-creator of Hadoop
- Chief Architect of Cloudera
- Phillip Radley Chief Data Architect at BT
There is a growing challenge of complexity and agility in
architecture. When data scientists start looking at the data, 80% of time is spent data cleaning and then a further 10% of the time cleaning errors from the data integration. Data scientists are data unifiers not data scientists. There are two things to consider
- How to do data unification with lots of tools
- Everyone will move to the cloud at some point due to economic pressures.
Data lineage is important and privacy needs to be by design. It is possible to have self service for easy analytics but not for more complicated things. A question to also consider is why not clean data at source before migrating it. Democratizing data will require data that is always on and always clean.
There will be no one size fits all. Instead packages will
come, such as SQL Server 2019 bundling tools outside such as Spark and HDFS. Going forward there is likely to be
- A database management regime in a large database management ecosystem.
- A need a best of breed of tools and a uniform lens to view all lineage, all data and all tasks.
The definition of what is a database is, has evolved over time. There are a few things to consider going forward
- Diversity of engines for storage and processing.
- Keep track of data meta systems after cleaning, data enrichment and provenance is important.
- Keep training data attached to the machine learning (ML) model.
- Need enterprise catalog management.
- ML brings competitive advantage
- Separate data from compute
It is a data unification problem in a data catalog era.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.