Apache
Hive is a data warehouse system for Hadoop. Hive enables
data summarization, querying, and analysis of data. Hive queries are written in
HiveQL, which is a query language similar to SQL. Hive only maintains metadata information about
your data stored on HDFS. Apache Spark has built-in functionality for working
with Hive and HiveQL can be used to query data stored in HBase. Hive can handle large data sets. The
data must have some structure. The query execution can be via Apache Tez, Apache Spark,
or MapReduce.
There are
two types of tables within Hive.
- Internal: Data is stored in the Hive data warehouse. The data warehouse is located at /hive/warehouse/ on the default storage for the cluster. This is for mainly temporary data.
- External: data is stored outside the data warehouse. The data is also used outside of Hive or the data needs to stay in the underlying location
A Hive table consists of a schema stored in the metastore and
the data is stored on HDFS. The supported file formats are Text File, SequenceFile, RCFile, Avro Files, ORC Files, Parquet, Custom
INPUTFORMAT and OUTPUTFORMAT.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.