Professional Documents
Culture Documents
Getting started
ssh olap-app02.lo > hive show databases; show tables; Most important databases - oltp: most MySql tables - scribe: imported Scribe logs - stats: Rat logs and aggregated stats such as pmp_interactions
https://cwiki.apache.org/confluence/display/Hive/LanguageManual https://cwiki.apache.org/confluence/display/Hive/Tutorial https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
Exporting data
hive -e "select ..." > file.out hive -f query.q > file.out insert overwrite local directory '/path/data' select or use an external table (see previous slide)
Performance Tuning
Strategy: split large queries and optimize one by one There are no indexes! To avoid full table scans create and correctly use partitions if possible (but dont over partition) Optimize joins In all joins, put biggest table last Map-side only joins (next slide) Compress data and use optimized file format -- run jobs in parallel (default = false). Be careful! hive.exec.parallel=true; -- turn off speculative execution for long-running jobs
set mapred.map.tasks.speculative.execution=false; set mapred.reduce.tasks.speculative.execution=false; set hive.mapred.reduce.tasks.speculative.execution=false;
Performance: Compression
Options: BZip2 and LZO (splittable), Gzip and Snappy (not splittable) LZO and Snappy are fast but have a low compression ratio Gzip and BZip2 have a high compression ratio but are slow
Dont use non-splittable codecs (unless you know youre doing it right)
set hive.exec.compress.output=true; set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
Textfile (default, useful for external tables) SequenceFile (binary, block compression and hence splittable) RCFile (binary, record-columnar format) Parquet (the best)
Scheduler
-- set env variable when starting Besar job BESAR_HADOOP_QUEUE=lowpri -- in Besar with_hive(:hadoop_queue => 'lowpri') do end -- in Hive set mapred.fairscheduler.pool=lowpri; -- in Hadoop change it during execution http://dfs001.local:50030/scheduler
Views
Views to Reduce Query Complexity Instead of nested queries or temporary tables: create view subquery as select ... ; -- just use it as a regular table select ... from subquery where ...; a view doesnt materialize any data
Streaming
select transform (a, b) using '/bin/cat' AS new_a, new_b from ...;
Custom UDFs
example: https://github.com/livingsocial/HiveSwarm -- compile, scp jar to olap-app02, and add it in hive: add jar /home/alienhard/HiveSwarm.jar; create temporary function dayofweek as 'com.livingsocial.hive.udf.DayOfWeek'; select dayofweek(to_date(created_at)) from oltp.payments_orders limit 10;
My trust in Hive's ability to plan a query sanely isn't too strong -- Stephen Veiss