Passionately curious about Data, Databases and Systems Complexity. Data is ubiquitous, the database universe is dichotomous (structured and unstructured), expanding and complex. Find my Database Research at SQLToolkit.co.uk

"The important thing is not to stop questioning. Curiosity has its own reason for existing" Einstein

Monday, 12 March 2018

Apache Hive and HDInsight

Apache Hive is a data warehouse system for Hadoop. Hive enables data summarization, querying, and analysis of data. Hive queries are written in HiveQL, which is a query language similar to SQL.  Hive only maintains metadata information about your data stored on HDFS. Apache Spark has built-in functionality for working with Hive and HiveQL can be used to query data stored in HBase. Hive can handle large data sets. The data must have some structure. The query execution can be via Apache TezApache Spark, or MapReduce.

There are two types of tables within Hive.
  • Internal: Data is stored in the Hive data warehouse. The data warehouse is located at /hive/warehouse/ on the default storage for the cluster. This is for mainly temporary data.
  • External: data is stored outside the data warehouse. The data is also used outside of Hive or the data needs to stay in the underlying location

A Hive table consists of a schema stored in the metastore and the data is stored on HDFS. The supported file formats are Text File, SequenceFile, RCFile, Avro Files, ORC Files, Parquet, Custom INPUTFORMAT and OUTPUTFORMAT.

Apache Hive and Hive QL is on Azure HDInsight.  

No comments:

Post a Comment

Note: only a member of this blog may post a comment.