You are on page 1of 17

Hive

Getting started
ssh olap-app02.lo > hive show databases; show tables; Most important databases - oltp: most MySql tables - scribe: imported Scribe logs - stats: Rat logs and aggregated stats such as pmp_interactions
https://cwiki.apache.org/confluence/display/Hive/LanguageManual https://cwiki.apache.org/confluence/display/Hive/Tutorial https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Hadoop jobs page: http://dfs001.lo:50030/jobtracker.jsp

Using Hive in Besar


with_hive(:db => :cms) do |hive| params = { :foo => ... } hive.execute_template('my_template.q', params) end see HiveAdapter (e.g., simplifies to update/insert rows) my_template.q is stored in templates/ params are accessible as <%= foo %> the current date is passed in automatically as <%= dt %> To run a job every night add it to app/warehouse/warehouse.rb

Getting data into HDFS


-- add MySql table app/warehouse/lib/tables.yml.erb -- add Scribe log app/scribe_log/tables.yml -- one-time import load data [local] inpath '/path/to/input.txt' overwrite into table foo; -- mapping external table create external table foo (a string, ...) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile location 'hdfs://dfs001.local:9000/path/to/data/';

Exporting data

hive -e "select ..." > file.out hive -f query.q > file.out insert overwrite local directory '/path/data' select or use an external table (see previous slide)

Tips and Tricks


-- test query on a random subset create table foo_sample as select * from foo order by rand() limit 1000; -- show column headers in CLI set hive.cli.print.header=true; -- run dfs commands inside Hives CLI hive> dfs -ls /user/alienhard;

Tips and Tricks 2


-- hive command history ~/.hivehistory -- get setting set hive.cli.print.header; -- dump all settings set -v; -- save your own settings ~/.hiverc

Performance Tuning
Strategy: split large queries and optimize one by one There are no indexes! To avoid full table scans create and correctly use partitions if possible (but dont over partition) Optimize joins In all joins, put biggest table last Map-side only joins (next slide) Compress data and use optimized file format -- run jobs in parallel (default = false). Be careful! hive.exec.parallel=true; -- turn off speculative execution for long-running jobs
set mapred.map.tasks.speculative.execution=false; set mapred.reduce.tasks.speculative.execution=false; set hive.mapred.reduce.tasks.speculative.execution=false;

Performance: Optimizing Joins


-- Map-side only join If one join table is small select /*+ MAPJOIN(a) */ ... from foo_a join foo_b on ...; -- If both tables are large, map-side joins are possible with tables clustered into buckets (but this is more complex) 3-5X speedup!

Performance: Compression
Options: BZip2 and LZO (splittable), Gzip and Snappy (not splittable) LZO and Snappy are fast but have a low compression ratio Gzip and BZip2 have a high compression ratio but are slow
Dont use non-splittable codecs (unless you know youre doing it right)
set hive.exec.compress.output=true; set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;

Performance: File Formats

Textfile (default, useful for external tables) SequenceFile (binary, block compression and hence splittable) RCFile (binary, record-columnar format) Parquet (the best)

create table foo stored as rcfile as select ...;

Scheduler
-- set env variable when starting Besar job BESAR_HADOOP_QUEUE=lowpri -- in Besar with_hive(:hadoop_queue => 'lowpri') do end -- in Hive set mapred.fairscheduler.pool=lowpri; -- in Hadoop change it during execution http://dfs001.local:50030/scheduler

Views
Views to Reduce Query Complexity Instead of nested queries or temporary tables: create view subquery as select ... ; -- just use it as a regular table select ... from subquery where ...; a view doesnt materialize any data

Streaming
select transform (a, b) using '/bin/cat' AS new_a, new_b from ...;

or your Ruby script:


#!/usr/bin/env ruby ARGF.each do |line| line = line.chomp next if !line out = ... puts out.join("\t") end

Custom UDFs
example: https://github.com/livingsocial/HiveSwarm -- compile, scp jar to olap-app02, and add it in hive: add jar /home/alienhard/HiveSwarm.jar; create temporary function dayofweek as 'com.livingsocial.hive.udf.DayOfWeek'; select dayofweek(to_date(created_at)) from oltp.payments_orders limit 10;

Debugging and Caveats


Some Hive error messages are cryptic. Go to jobtracker, go to the failed tasks list to see the real exception. Ctrl-c doesn't immediately kill a job instead run `hadoop job -kill job_201402281210_0670` describe <table> returns `compressed: false` even if compressed CLI doesnt like SQL comments with ; Hadoop job progress isnt always linear

My trust in Hive's ability to plan a query sanely isn't too strong -- Stephen Veiss

Dont use Hive

if you dont have to.

You might also like