Welcome

Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein



Friday 14 October 2011

Big Data – What is the Big Deal?

The 3rd SQL PASS Keynote was given by David J. DeWitt of the Data and Storage Platform Division. IT was a brilliant insightful session.

The session started explaining the definitions of Big Data. It is a massive collection of records. To some, “Big Data” means using a new a NoSQL system like Hadoop and Map Reduce or the old traditional parallel relational DBMS to manage the data. Data is the currency of this generation with the realization that data is too valuable to delete.

NoSQL
Not Only SQL - It's about recognizing that for some problems other storage solutions are better suited. NoSQL has a flexible data model, faster time to deliver, relaxed consistency model such as eventually consistent, the willingness to trade consistency for availability, low upfront software costs. Some data is just not worth storing in a relational databases, validating, cleansing, ETL, analyzing or controlling the quality.

There are 2 types of NoSQL:

Key/Value Stores
Examples: Mongo, CouchBase, Cassandra, Windows Azure.
This is single value retrievals based on key - Think NoSQL OLTP.

Hadoop
This is large volumes of data stored in a distributed file system - Think NoSQL data warehousing.

SQL is sometimes termed 'schema first' and NoSQL 'schema later'.

The other idea that was presented throughout the session was the idea that there are two universes in the new reality. Structured Vs Unstructured.


This is not a paradigm shift. The world has changed and the new reality is the RDBMS and NoSQL databases need to work together to address the current requirements in a complementary fashion.

The rest of the session went on to explain about Hadoop and its ecosystem and how the 2 technologies work together.

Hadoop = HDFS (file system store) + MapReduce (programing paradigm, process)

Some applications need data from both universes in the new world. Where this is the case Sqoop is used to connect the Unstructured (Hadoop) to Structured (RDBMS).



No comments:

Post a Comment

Note: only a member of this blog may post a comment.