Welcome

Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk . Microsoft Data Platform MVP

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein



Thursday 1 March 2018

An Introduction to HDInsight

 
I attended a great session at SQLBits 2018 covering the basics of HDInsight by Edinson Medina. He introduced his talk explaining the term Big Data and that it is too complex for analysis in traditional databases. There are 2 types of processing batch processing, to shape the data for analysis and real time processing to capture streams of data for low latency querying.

Hadoop is described on the Hortonworks site as "Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware."

A Hadoop cluster looks like



















The underlying structure uses map reduce. The Tez engine is a newer faster engine for map reduce. The model is explained in the paper on "Analyzing performance of Apache Tez and MapReduce with hadoop multinode cluster on Amazon cloud"



HDInsight is 100% Apache Hadoop, but powered by the cloud.



There are many tools within the Hadoop ecosystem.

Hive
A meta data service that projects tabular schemas over folders and enables the folders to be queried as tables using a SQL like query.

Pig (an ETL Tool)
Performs a series of transformations to data relations based on Pig Latin statements.

OoZie
A workflow engine for actions in a Hadoop cluster supporting parallel work streams.

Scoop
A database integration service which enables bi-directional data transfer between an Hadoop cluster and databases via JDBC.

HBase
A low latency NoSQL database built on Hadoop modeled on Googles's BigTable. HBase stores files on HDFS.

Storm
An event processor for data streams such as real time monitoring and for event aggregation and logging. It defines a streaming topology that consists of spouts and bolts.

Spark
A fast general purpose computation engine that supports in memory operations. It is a unified stack for interactive, streaming and predictive analysis.

Ambari
A management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters.

Zepplin notebooks 
A multi-purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.