You are on page 1of 13

Hive Overview

Feb 05, 2013

YAHOO! CONFIDENTIAL

What is Hive?
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive Provides:

Logical data partitioning

Metastore (command line and web interfaces)


Query Language (Hive QL) Libraries to handle different serialization formats (SerDes) http://sva.ti.com/assets/en/other/designcon2004_serdes.pdf JDBC interface Extensibility Types, Functions, Formats, Scripts
-2Yahoo! Confidential

Data Units
Databases.

Tables.
Partitions. Buckets (or Clusters).

-3-

Yahoo! Confidential

Data Types:
Primitive types Integers:TINYINT, SMALLINT, INT, BIGINT. 1. Actionable insights that Boolean: BOOLEAN. enable account managers to optimize campaign Floating point numbers: FLOAT, DOUBLE . performance ad yield. String: STRING. 2. Critical network-wide insights BINARY, TIMESTAMP, DECIMAL that enable our sales force to Complex types sell more consultatively. Structs: {a INT; b INT}. Maps: M['group']. 3. Ability to create valuable packages of inventory that Arrays: ['a', 'b', 'c'], A[1] returns 'b'.to buy from make it easier us. Union: UNIONTYPE[datatype, datatype]
-4Yahoo! Confidential

Physical Layout
Warehouse directory in HDFS e.g., /user/hive/warehouse

1. Actionable insights that enable account managers to campaign Tables stored in subdirectories optimize of warehouse performance ad yield.

Partitions, buckets form subdirectories of tables

2. Critical network-wide insights enable our sales force to Actual data stored in flat files that sell more consultatively.

Control char-delimited text, or SequenceFiles 3. Ability to create valuable With custom SerDe, can use arbitrary format

packages of inventory that make it easier to buy from us.

-5-

Yahoo! Confidential

Hive QL: DDL Operations


CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES; DESCRIBE sample;

ALTER TABLE sample ADD COLUMNS (new_col INT);


DROP TABLE sample;

-6-

Yahoo! Confidential

Hive QL: DDL Operations cont.


hive> SHOW TABLES; hive> CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY \t STORED AS TEXTFILE;

hive> DESCRIBE shakespeare;

-7-

Yahoo! Confidential

Data Load Statement:


LOAD DATA [LOCAL] INPATH '/path/to/file' [OVERWRITE] INTO TABLE table_name 1. Actionable insights that account managers to [PARTITION (partition_col =enable partition_col_value, optimize campaign partition_col = partiton_col_value, ...)]ad yield. performance You can load data from the local filesystem or 2. Critical network-wide insights anywhere in HDFS (cf. CREATE TABLE that enable our sales force to sell more consultatively. EXTERNAL) If you dont specify OVERWRITE, be 3. Ability todata createwill valuable appended to existing table packages of inventory that
make it easier to buy from us.

-8-

Yahoo! Confidential

SELECTS and FILTERS:


SELECT foo FROM sample WHERE ds='201301-31'; 1. Actionable insights that INSERT OVERWRITE '/tmp/hdfs_out' SELECT * FROM sample WHERE 2. Critical network-wide insights ds='2013-01-31'; that enable our sales force to
sell more consultatively.
enable account managers to optimize campaign DIRECTORY performance ad yield.

INSERT OVERWRITE LOCAL DIRECTORY 3. Ability to create valuable packages of inventory that '/tmp/hive-sample-out' SELECT * FROM sample;
make it easier to buy from us.

-9-

Yahoo! Confidential

SELECTS and FILTERS cont.


hive> SELECT * FROM shakespeare LIMIT 10;
1. Actionable insights that enable account managers to hive> SELECT * FROM shakespeare WHERE freq optimize campaign > 100 SORT BY freq ASC LIMIT 10; performance ad yield.

hive> SELECT freq, COUNT(1) AS FROM thatf2 enable our sales force to sell more shakespeare GROUP BY freq SORT BY consultatively. f2 DESC LIMIT 10; 3. Ability to create valuable
packages of inventory that make it easier tof2 buy from hive> EXPLAIN SELECT freq, COUNT(1) AS FROM us.

2. Critical network-wide insights

shakespeare GROUP BY freq SORT BY f2 DESC LIMIT 10;


- 10 Yahoo! Confidential

Built-in Functions
Mathematical: round, floor, ceil, rand, exp... Collection: size, map_keys, 1. map_values, Actionable insights that enable account managers to array_contains. optimize campaign Type Conversion: cast. performance ad yield. Date: from_unixtime, to_date, year, datediff... 2. Critical network-wide insights Conditional: if, case, coalesce. that enable our sales force to sell more consultatively. String: length, reverse, upper, trim...
3. Ability to create valuable packages of inventory that make it easier to buy from us.

- 11 -

Yahoo! Confidential

Q&A

- 12 -

Yahoo! Confidential

UAD: Important Links


Main Hive Project Page - http://hive.apache.org/ Installation and Setups Docs: http://hortonworks.com/blog/set-upapache-hadoop-in-minutes-with-rpms/ http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linuxmulti-node-cluster/ http://hadoop.apache.org/common/docs/current/single_node_setup.html #Local Webcast: http://www.cloudera.com/content/cloudera/en/resources/library/traini ng/introduction-to-apache-hive.html Twiki: http://twiki.corp.yahoo.com/view/DBAOps/Hadoop_hive_testing

- 13 -

Yahoo! Confidential

You might also like