You are on page 1of 23

Hive + HCatalog

By Amru Eliwat
CS 157B at San Jose State University
www.linkedin.com/in/amrue/
Agenda
What is Hive?
What is HCatalog?
- Using it with Hive
Setting up Hive + HCat locally
Setting up Hive + HCat in a Virtual Machine
Demo
- Loading data into HCat manually
- Loading data into HCat using Hive
- Basic Hive queries
Total time: Approximately 30 minutes
Hive
Apache Hive is a data warehouse infrastructure
built on top of Hadoop for providing data
summarization, query, and analysis.
Hive
Runs SQL-like queries using HiveQL, which are
implicitly converted into map-reduce jobs.
Because of HiveQLs declarative nature, Hive excels
at Ad-Hoc analysis.
Hive
Hive acts on metadata in the Hive metastore.
Metadata is stored in an Apache Derby database by
default, but a MySQL database can be used instead.
- When using the default Derby database, only one
process can connect to the metastore at a time, so
this is only ideal for testing purposes.
Although queries are run on data in the metastore, we
do not get the efficiency and optimization of an RDBMS
since the queries are converted into map-reduce jobs.
Hive
Peter Jamack at the IBM Developer Works blog asks, Hive
for ETL or ELT?
You can extract, transform, then load your data with Hive, but
Jamack suggests it is better to extract, load, then transform
with Hive.
Hive works better for some types of data then others.
Obviously, choosing between adopting an ELT or ETL philosophy
requires thought. This decision can account for more than 70
percent of the planning time required for many data warehouse,
master data management, and other database projects.
HCatalog
Apache HCatalog is a table and storage
management layer for Hadoop that enables users
with different data processing tools Apache Pig,
Apache MapReduce, and Apache Hive to more
easily read and write data on the grid.
HCatalog
HCatalog presents a relational view of data. Data is
stored in tables and these tables can be placed into
databases.
Hive can read data in HCatalog directly, because
HCatalog is based on Hives metastore. Other tools
require interfaces, such as HCatLoader and HCatStorer
for Pig.
In other words, HCatalog can be seen as a project
enabling non-Hive scripts to access Hive metastore
tables.
HCatalog
As mentioned earlier, we do not get the efficiency of
RDBMS despite the data being presented in a
relation view.
Setup
Installing Hive requires some work, however it comes with HCatalog out-of-the-box
to use as the metastore.
1. Download and unpack the tarball (.tar.gz).
2. Set the environment variable HIVE_HOME to point to the installation directory:
export HIVE_HOME=/usr/local/hive-0.12.0
3. Add $HIVE_HOME/bin to your PATH:
export PATH=$HIVE_HOME/bin:$PATH
4. You will need Hadoop installed to continue. Create the following folders for
Hives metastore then set them to chmod g+w in HDFS like so:
$HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
Setup
5. Copy the file hive-default.xml.template in /hive-0.12.0/conf and
rename it hive-site.xml.
6. Finally, run $HIVE_HOME/hcatalog/sbin/hcat_server.sh start in
the terminal.
7. Create a table in Hive using CREATE TABLE.
8. Load data:
LOAD DATA LOCAL INPATH './files/stackoverflow.txt'
OVERWRITE INTO TABLE posts;
Hive + HCatalog
If youve made it this far, you can use your knowledge of SQL to
run HiveQL queries.
SELECT column_name FROM table_name;
SELECT a.* FROM a JOIN b ON (a.id = b.id);
Only equality joins, outer joins, and left semi joins are supported in
Hive. Hive does not support join conditions that are not equality
conditions as it is very difficult to express such conditions as a
map/reduce job.

http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/
Setup #2
Download the correct version of HortonWorks
Sandbox for your virtual machine setup from the
HortonWorks webpage.
Double click the HortonWorks Sandbox virtual
machine launch file (its the only file in the folder you
just downloaded).
Point your browser to 127.0.0.1:8888
Demo
Loading data into HCat and analyzing it with Hive
Demo
Fire up HortonWorks in VirtualBox
Click on HCatalog on the upper toolbar
On the left hand side, choose Create New Table
Manually and give it a name. Then:
Finally, select a file to upload.
Alternatively, use Hive for the whole process:
Once the data is loaded into HCat, we can take a
closer look at the data.
SELECT * FROM post_etl WHERE posttypeid = 1
Q&A

You might also like