You are on page 1of 13

Yahoo!

White Papei

1














0iacle & Bauoop
Biiect Bata Loau fiom BBFS
to 0iacle Batabase












By
Bean Biunuige
0BA Engineeiing
&
Apun Biian
Y! BBA Team
Yahoo! White Papei

2
!"#$%&'(#)%"

Apache +,&%%- ls a new way for enLerprlses Lo sLore and analyze daLa.
Padoop ls an open-source pro[ecL admlnlsLered by Lhe Apache SofLware
loundaLlon. LnLerprlses Loday collecL and generaLe more daLa Lhan ever before.
8elaLlonal and daLa warehouse producLs excel aL CLA and CL1 workloads over
sLrucLured daLa. Padoop, however, was deslgned Lo solve a dlfferenL problem: Lhe
fasL, rellable analysls of boLh sLrucLured daLa !"# complex daLa. Padoop conslsLs
of Lwo key servlces: rellable daLa sLorage uslng Lhe Padoop ulsLrlbuLed llle
SysLem (PulS) and hlgh-performance parallel daLa processlng uslng a Lechnlque
called Map8educe.

.$,(/0 12#0$",/ 3,4/0 ls a read-only Lable whose meLadaLa ls sLored ln Lhe
daLabase buL Lhe acLual daLa resldes ln flles ouLslde Lhe daLabase. 1he daLabase
uses Lhls meLadaLa (exLernal Lable deflnlLlon) Lo access daLa resldlng ln exLernal
flles as lf Lhey were relaLlonal Lables sLored lnslde Lhe daLabase. LxLernal Lables
feaLure ls a complemenL Lo Lhe exlsLlng SCL*Loader funcLlonallLy.

WlLh Cracle 10g, Lhere ls Lhe ablllLy Lo wrlLe Lo an exLernal Lable uslng Lhe uaLa
ump access drlver, maklng lL posslble Lo move daLa from wlLhln Lhe daLabase Lo
an exLernal flaL flle ouLslde of Lhe daLabase. 1hls ablllLy ls resLrlcLed Lo Lhe use of
Lhe creaLe Lable as selecL command only, Lherefore uML operaLlons are noL
supporLed. 1he resulLlng flaL flle ls an Cracle proprleLary formaL LhaL ls
lndependenL of Lhe underlylng CS Lhe flle ls creaLed on.

1he prlmary beneflL of Lhls feaLure ls Lhe ablllLy Lo load/unload Lables, maklng lL
posslble Lo perform daLa Lransforms ln elLher dlrecLlon.

1hls makes exLernal Lables more flexlble Lhan Cracle's uaLa ump uLlllLles (lmpdp
and expdp) because, noL only ls lL posslble Lo move daLa buL also lL can perform
complex LransformaLlons whlle lLs belng loaded/unloaded.

AddlLlonally, [olns can be creaLed on Lhe daLa whlle lL ls loaded/unloaded, whlch
cannoL be done wlLh Lhe uaLa ump uLlllLles.


Yahoo! White Papei

S
5'$-%60

As Lhe amounL of daLa avallable for analyLlcs Lhese days ls lncreaslng mulLl-fold,
Padoop ls lncreaslngly belng leveraged for processlng/crunchlng/aggregaLlon of
such daLa. ln mosL of our daLa warehouslng sysLems, Lhe daLa aggregaLed by
Padoop ls loaded lnLo an Cracle daLabase for furLher aggregaLlon & analysls
before avallablllLy for reporLlng.

A real world case would be uAu, where we geL Lhe ulmenslon daLa from 1AC
daLabase and lAC1 daLa from Lhe Crld. 8oLh Lhese daLa sources geL loaded lnLo
uAu daLabase where a few levels of aggregaLlons are performed Lo produce
MS18 reporLs.

CurrenLly, Lhe daLa from Padoop/Crld ls pulled lnLo a neLwork flle sysLem, whlch
acLs as a sLaglng area and Lhen geLs loaded lnLo Lhe Cracle daLabase vla exLernal
Lables.

CurrenL challenges wlLh Lhls meLhod:
1.) Conslderable amounL of Llme ls spenL Lo copy Lhe flles over from grld Lo Lhe
shared flle sysLem.
2.) As Lhe daLa volume ls hlgh, grld has Lo ouLpuL compressed flles, whlch have
Lo be uncompressed before loadlng lnLo Lhe Cracle uaLabase.
3.) Shared neLwork space ls requlred.

We wlll Lry Lo access Lhe daLa sLored ln Padoop (Crld) clusLer from wlLhln Lhe
Cracle uaLabase. We wlll compare Lhe Llme lL Lakes Lo dlrecLly load Lhls daLa lnLo
Cracle vs. Lhe currenL meLhod.

1hls LesL ls purely a proof of concepL and a loL more LesLs relaLed Lo Lhe rellablllLy
of Lhls meLhod need Lo be performed.

1he ouLllned meLhod addlLlonally leverages Cracle LxLernal 1able and lLs pre-
processor feaLure. ?ou can geL more deLalls abouL Cracle exLernal Lable from:
hLLp://dba.yahoo.com/documenLs/WhlLepapers/Cracle_LxLernal_Lables.pdf


Yahoo! White Papei

4

+,$&7,$0

Pardware deLalls of Lhe hosLs we ran Lhe LesL's on:

PosL: l8M x3830 M2, 4 x xeon L7430 2.40CPz
Cu: 4 x xeon L7430 2.40CPz (24 cores)
Memory: 64 C8
uaLabase: Cracle 3 node 8AC, verslon 11.2.0.2

Yahoo! White Papei

S
8%"(/'6)%"
59.:;

! arallellsm can be used wlLh exLernal Lables.
! lf we wanL Lo use compressed flles, Lhe besL opLlon ls Lo have mulLlple flles ln
an exLernal Lable compared Lo one flle per exLernal Lable. We cannoL beneflL
wlLh parallellsm on a slngle compressed flle.
! Cracle exLernal Lable can access daLa dlrecLly from Padoop/Crld.
! Cracle 8AC dynamlcally load balances Lhe connecLlons Lo grld dependlng on
Lhe sysLem load vs. exLernal processes managlng connecLlons Lo Lhe daLabase.
! 1hls meLhod ls scalable, as wlLh lncrease ln parallel Lhreads (uC) Lhe loadlng
Llme decreases slgnlflcanLly and Lhe rows per second lncrease.
! very large amounL of daLa can be loaded wlLh Lhls meLhod.
! Padoop daLa ls easlly avallable Lo any cllenL processes such as sqlplus, odbc,
[dbc eLc. LhaL can selecL from Cracle Lables. We can avold loadlng daLa lnLo
Cracle all LogeLher by [usL creaLlng exLernal Lables over Padoop daLa.
! 1ransformaLlon/condlLlons can be applled on Lhe daLa belng loaded.
! An Al can be wrlLLen LhaL could auLomaLe Lhe process of creaLlon of Cracle
exLernal Lables and dependencles (preprocessor scrlpL eLc.).
8.<:;

! lf Lhe daLa needs Lo be re-read mulLlple Llmes over a Llme perlod, Lhe overhead
of pulllng lL over neLwork mulLlple Llmes can be cosLly lnsLead of havlng lL ln
perslsLenL sLorage ln oracle daLa segmenLs.
! As wlLh any exLernal oracle Lable read, Lhe underlylng daLa read ls essenLlally a
full Lable scan.
! We cannoL leverage Lhe beneflL of lndexes wlLh exLernal Lables.


Yahoo! White Papei

6
3=060 (%"(/'6)%"6 70$0 #=0 $06'/# %> #=0 >%//%7)"?
#06# (,606;

1. Slngle grld ouLpuL parL flle load from Padoop Lo Cracle uaLabase
a. uncompressed daLa load
b. Compressed daLa load
c. Cracle degree of parallellsm of 0,2,4, & 8. (ueLalls of Lhe case can
found on pg. 10).

2. arallel load of 10 grld ouLpuL parL flles from Padoop Lo Cracle uaLabase
a. uncompressed daLa load
b. Compressed daLa load
c. Cracle degree of parallellsm of 0,2,4, & 8. (ueLalls of Lhe case can
found on pg. 11).

3. arallel load of 100 grld ouLpuL parL flles from Padoop Lo Cracle uaLabase
a. uncompressed daLa load
b. Compressed daLa load
c. Cracle degree of parallellsm of 0,2,4, & 8. (ueLalls of Lhe case can
found on pg. 11).

4. arallel load of 200 grld ouLpuL parL flles from Padoop Lo Cracle uaLabase
a. uncompressed daLa load
b. Compressed daLa load
c. Cracle degree of parallellsm of 0,2,4, & 8. (ueLalls of Lhe case can
found on pg. 12).

3. arallel load of 200 compressed grld ouLpuL flles from Padoop Lo Cracle
daLabase wlLh order by" clause. (ueLalls of Lhe case can found on pg. 13).


Yahoo! White Papei

7
306# 8,60;

12#0$",/ 3,4/0 @0>)")#)%";

CREATE TABLE HIVE_RAW_PARTS_TXT_00000_00000
(
ETL_SEQ NUMBER,
ENTITY_ID NUMBER,
SEGMENT_IDS NUMBER,
YMDH DATE,
CREATIVE_ID NUMBER,
BUYER_LINE_ITEM_ID NUMBER,
SELLER_LINE_ITEM_ID NUMBER,
IMPRESSIONS NUMBER,
CLICKS NUMBER,
CONVERSIONS NUMBER,
AMT_PAID_TO_MEDIA_SELLER NUMBER,
AMT_PAID_TO_DATA_SELLER NUMBER,
DATA_REVENUE_FROM_BUYER NUMBER,
MEDIA_REVENUE_FROM_BUYER NUMBER,
AMT_PAID_TO_BROKER NUMBER,
SECTION_ID NUMBER,
TRIGGER_BY_CLICK NUMBER,
SITE_ID NUMBER,
SIZE_ID NUMBER,
BUYER_ENTITY_ID NUMBER,
CONVERSION_ID NUMBER,
BUCKET_ID NUMBER,
ADVERTISER_SK NUMBER,
PUBLISHER_SK NUMBER,
SEGMENT_SK NUMBER,
SHARED_FLAG NUMBER
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY HIVE_EXT
ACCESS PARAMETERS
( records delimited by '\n'
nobadfile
nologfile
preprocessor hive_ext: '[loaduncompress.sh/loadcompress.sh]' *
fields terminated by 0X'01'
ldrtrim
missing field values are null
reject rows with all null fields
(
ETL_SEQ INTEGER EXTERNAL,
ENTITY_ID INTEGER EXTERNAL,
SEGMENT_IDS INTEGER EXTERNAL,
YMDH DATE 'YYYY-MM-DD HH24:MI:SS',
CREATIVE_ID INTEGER EXTERNAL,
BUYER_LINE_ITEM_ID INTEGER EXTERNAL,
SELLER_LINE_ITEM_ID INTEGER EXTERNAL,
IMPRESSIONS INTEGER EXTERNAL,
Yahoo! White Papei

8
CLICKS INTEGER EXTERNAL,
CONVERSIONS INTEGER EXTERNAL,
AMT_PAID_TO_MEDIA_SELLER INTEGER EXTERNAL,
AMT_PAID_TO_DATA_SELLER INTEGER EXTERNAL,
DATA_REVENUE_FROM_BUYER INTEGER EXTERNAL,
MEDIA_REVENUE_FROM_BUYER INTEGER EXTERNAL,
AMT_PAID_TO_BROKER INTEGER EXTERNAL,
SECTION_ID INTEGER EXTERNAL,
TRIGGER_BY_CLICK INTEGER EXTERNAL,
SITE_ID INTEGER EXTERNAL,
SIZE_ID INTEGER EXTERNAL,
BUYER_ENTITY_ID INTEGER EXTERNAL,
CONVERSION_ID INTEGER EXTERNAL,
BUCKET_ID INTEGER EXTERNAL,
ADVERTISER_SK INTEGER EXTERNAL,
PUBLISHER_SK INTEGER EXTERNAL,
SEGMENT_SK INTEGER EXTERNAL,
SHARED_FLAG INTEGER EXTERNAL
)
)
LOCATION (HIVE_EXT:'part-r-00000,part-r-00001,.,part-r-00nnn) **
)
REJECT LIMIT UNLIMITED
NOMONITORING;

<.31:;
* - Clause 59159.81::.9" was changed Lo loaduncompress.sh when loadlng .1x1
(non-compressed) flles from grld and loadcompress.sh was used whlle loadlng .bz2 flles
from Lhe grld.
** - Clause A.8B3!.<" was used Lo deflne Lhe number of flles loaded for each LesL
case. Lg. lf we have 200 parL flles for one loglcal ouLpuL, we need Lo speclfy 200 flles ln
Lhe LCCA1lCn clause.
5$0-$%(066%$ :($)-#6;
/%,&'"(%C-$066D6=
#!binsh
fn=`echo $1 | binawk -F "" '{piint $4}'`
usibincuil -s -S -b "<file to ieau cookies fiom>" -c "<wiites cookies to this file aftei
opeiation>" --caceit "<location of ceitificate file>" --heauei "Yahoo-App-Auth: ${val}"
https:<hufs pioxy host>:444Sfs<file location>$fn

/%,&(%C-$066D6=
#!binsh
fn=`echo $1 | binawk -F "" '{piint $4}'`
usibincuil -s -S -b "<file to ieau cookies fiom>" -c "<wiites cookies to this file aftei
opeiation>" --caceit "<location of ceitificate file>" --heauei "Yahoo-App-Auth: ${val}"
"https:<hufs pioxy host>:444Sfs<file location>$fn.bz2" | usibinbzcat
Yahoo! White Papei

9
"#$%&' () *++',,#-$ ./0112 3%42%4 5#/ 3&/+6' 784'&-/6 9/:6';
3=0 5$%(066;

Slnce Cracle 10g, we have Lhe opLlon of speclfylng a scrlpL as parL of Lhe
preprocessor ln Lhe exLernal Lable deflnlLlon. WlLh Lhls opLlon oracle has
empowered us Lo bulld our own execuLable/scrlpL Lo Lransform Lhe daLa resldlng
ln flaL flles before loadlng lL lnLo Lhe daLabase. 1he ouLpuL for such a
scrlpL/execuLable should be Lo Lhe sLandard ouLpuL of Lhe CS for oracle Lo read lL.

ln Lhls case, Lhe shell scrlpL runs Lhe curl command Lo access flles on Lhe Padoop
ClusLer and redlrecLs Lhe ouLpuL Lo Lhe sLandard ouLpuL on Lhe daLabase hosL vla
Lhe parL-r-00000-n flles, whlch ln Lurn ls loaded lnLo Lhe daLabase uslng C1AS
(C8LA1L 1A8LL AS SLLLC1).




























HDFS Output
(part_r_00000)
.
.
.
.

HDFS Output
(part_r_00001)
HDFS Output
(part_r_00002)
HDFS Output
(part_r_nnnnn)
3&/+6' </4/:/,' (=$>(($

E
x
t
e
i
n
a
l

T
a
b
l
e

CTAS
cuil
cuil
cuil
cuil
CTAS
CTAS
CTAS

Table
CTAS
CTAS
CTAS
CTAS
?1@2&',,'0 ./0112 1%42%4 A;:BC
D#6',E +/- :' 61/0'0 #- 3&/+6' 4/:6'F
:G %,#-$ H:B+/4I #- 4J' '84'&-/6 4/:6'
2&'2&1+',,1&;
Yahoo! White Papei

1u

8,60 :#'&)06;

EDF A%,&)"? 6)"?/0 >)/0 G'"(%C-$0660&H(%C-$0660&F >$%C +,&%%- #% .$,(/0 @,#,4,60;

<<K D1& %-+1@2&',,'0 0/4/ 61/0)

creaLe Lable h_1_p0 nologglng as selecL /*+ noparallel(a) */ * from hlve_raw_1_parL_LxL a,
creaLe Lable h_1_p2 nologglng as selecL /*+ parallel(a,2) */ * from hlve_raw_1_parL_LxL a,
creaLe Lable h_1_p4 nologglng as selecL /*+ parallel(a,4) */ * from hlve_raw_1_parL_LxL a,
creaLe Lable h_1_p8 nologglng as selecL /*+ parallel(a,8) */ * from hlve_raw_1_parL_LxL a,

<<K D1& +1@2&',,'0 0/4/ 61/0)

creaLe Lable h_1_p0 nologglng as selecL /*+ noparallel(a) */ * from hlve_raw_1_parL_bz2 a,
creaLe Lable h_1_p2 nologglng as selecL /*+ parallel(a,2) */ * from hlve_raw_1_parL_bz2 a,
creaLe Lable h_1_p4 nologglng as selecL /*+ parallel(a,4) */ * from hlve_raw_1_parL_bz2 a,
creaLe Lable h_1_p8 nologglng as selecL /*+ parallel(a,8) */ * from hlve_raw_1_parL_bz2 a,

DOP
Uncompressed Data
(Rows=767,208)
Compress Data (Rows=
833,742)

Elapsed
Time
Rows Per
Second Elapsed Time
Rows Per
Second
0 00:11.0 69,746 00:13.68 60,946
2 00:12.0 63,934 00:13.52 61,677
4 00:12.0 69,746 00:13.22 63,067
8 00:12.2 63,934 00:13.82 60,328

I)"&)"?6H8%"(/'6)%"6;
1.) Cracle can access compressed/uncompressed flle on Padoop clusLer vla
exLernal Lable.
2.) 1hough slngle flle can be accessed ln parallel (lf Lhe flle resldes on a flle
sysLem), effecLlvely oracle uses only 1 sesslon/Lhread Lo read Lhe flle from grld
and hence Lhere ls no beneflL of readlng/loadlng slngle Padoop flle
(compressed/uncompressed) ln parallel.

Yahoo! White Papei

11

JDF A%,&)"? EK >)/0 G'"(%C-$0660&H(%C-$0660&F >$%C +,&%%- #% .$,(/0 @,#,4,60;

<<K D1& %-+1@2&',,'0 0/4/ 61/0)

creaLe Lable h_10_p0 nologglng as selecL /*+ noparallel(a) */ * from hlve_raw_10_parL_LxL a,
creaLe Lable h_10_p2 nologglng as selecL /*+ parallel(a,2) */ * from hlve_raw_10_parL_LxL a,
creaLe Lable h_10_p4 nologglng as selecL /*+ parallel(a,4) */ * from hlve_raw_10_parL_LxL a,
creaLe Lable h_10_p8 nologglng as selecL /*+ parallel(a,8) */ * from hlve_raw_10_parL_LxL a,


<<K D1& +1@2&',,'0 0/4/ 61/0)

creaLe Lable h_10_p0 nologglng as selecL /*+ noparallel(a) */ * from hlve_raw_10_parL_bz2 a,
creaLe Lable h_10_p2 nologglng as selecL /*+ parallel(a,2) */ * from hlve_raw_10_parL_bz2 a,
creaLe Lable h_10_p4 nologglng as selecL /*+ parallel(a,4) */ * from hlve_raw_10_parL_bz2 a,
creaLe Lable h_10_p8 nologglng as selecL /*+ parallel(a,8) */ * from hlve_raw_10_parL_bz2 a,

DOP
Uncompressed Data
(Rows=7,668,872)
Compress Data (Rows=
8,328,412)

Elapsed
Time
Rows Per
Second Elapsed Time
Rows Per
Second
0 01:36.43 79,884 02:05.11 66,569
2 00:55.15 139,434 01:07.47 123,439
4 00:38.29 201,812 00:47.66 174,746
8 00:35.00 219,111 00:40.46 205,843

LDF A%,&)"? EKK >)/0 G'"(%C-$0660&H(%C-$0660&F >$%C +,&%%- #% .$,(/0 @,#,4,60;

<<K D1& %-+1@2&',,'0 0/4/ 61/0)

creaLe Lable h_100_p0 nologglng as selecL /*+ noparallel(a) */ * from hlve_raw_100_parL_LxL a,
creaLe Lable h_100_p2 nologglng as selecL /*+ parallel(a,2) */ * from hlve_raw_100_parL_LxL a,
creaLe Lable h_100_p4 nologglng as selecL /*+ parallel(a,4) */ * from hlve_raw_100_parL_LxL a,
creaLe Lable h_100_p8 nologglng as selecL /*+ parallel(a,8) */ * from hlve_raw_100_parL_LxL a,

<<K D1& +1@2&',,'0 0/4/ 61/0)

creaLe Lable h_100_p0 nologglng as selecL /*+ noparallel(a) */ * from hlve_raw_100_parL_bz2 a,
creaLe Lable h_100_p2 nologglng as selecL /*+ parallel(a,2) */ * from hlve_raw_100_parL_bz2 a,
creaLe Lable h_100_p4 nologglng as selecL /*+ parallel(a,4) */ * from hlve_raw_100_parL_bz2 a,
Yahoo! White Papei

12

creaLe Lable h_100_p8 nologglng as selecL /*+ parallel(a,8) */ * from
hlve_raw_100_parL_bz2 a,

DOP
Uncompressed Data
(Rows=76,663,790)
Compress Data (Rows=
83,276,476)

Elapsed
Time
Rows Per
Second Elapsed Time
Rows Per
Second
0
16:26.91 77,752
20:37.25 67,308
2
09:07.28 140,153
10:29.13 132,367
4
04:51.23 263,449
06:03.97 225,090
8
04:00.69 319,432
05:32.11 250,750

MDF A%,&)"? JKK >)/0 G'"(%C-$0660&H(%C-$0660&F >$%C +,&%%- #% .$,(/0 @,#,4,60;

<<K D1& %-+1@2&',,'0 0/4/ 61/0)

creaLe Lable h_200_p0 nologglng as selecL /*+ noparallel(a) */ * from hlve_raw_200_parL_LxL a,
creaLe Lable h_200_p2 nologglng as selecL /*+ parallel(a,2) */ * from hlve_raw_200_parL_LxL a,
creaLe Lable h_200_p4 nologglng as selecL /*+ parallel(a,4) */ * from hlve_raw_200_parL_LxL a,
creaLe Lable h_200_p8 nologglng as selecL /*+ parallel(a,8) */ * from hlve_raw_200_parL_LxL a,

<<K D1& +1@2&',,'0 0/4/ 61/0)

creaLe Lable h_200_p0 nologglng as selecL /*+ noparallel(a) */ * from hlve_raw_200_parL_bz2 a,
creaLe Lable h_200_p2 nologglng as selecL /*+ parallel(a,2) */ * from hlve_raw_200_parL_bz2 a,
creaLe Lable h_200_p4 nologglng as selecL /*+ parallel(a,4) */ * from hlve_raw_200_parL_bz2 a,
creaLe Lable h_200_p8 nologglng as selecL /*+ parallel(a,8) */ * from hlve_raw_200_parL_bz2 a,


DOP
Uncompressed Data
(Rows=153,334,394)
Compress Data
(Rows=165,699,995)

Elapsed
Time
Rows Per
Second Elapsed Time
Rows Per
Second
0
32:51.56 77,795
40:14.23 68,635
2
18:29.71 138,264
23:18.62 118,474
4
08:59.08 284,479
12:36.02 219,174
8
07:04.28 361,638
09:58.51 276,854

Yahoo! White Papei

1S
I)"&)"?6H8%"(/'6)%"6;
1.) Cracle can access mulLlple flles slmulLaneously from Padoop and load Lhem lnLo
daLabase Lables.
2.) With the use of paiallelism anu multiple files we can speeu up the loauing piocess
anu timings aie compaiable to loauing files fiom the file system.
S.) B0P anu piocessing times aie inveisely piopoitional iesulting in high scalability.
4.) Extiemely laige uatasets can be pulleu via this piocess.

NDF A%,&)"? JKK (%C-$0660& >)/06 >$%C +,&%%- #% .$,(/0 @,#,4,60 7)#= %$&0$ 4O
(/,'60;

ln 1AC producLlon an order by ls applled aL load Llme. 1o compare Lhls LesL Lo an
acLual producLlon daLa load, and order by was lncluded ln Lhe followlng LesL.

5$%&'(#)%" 3)C)"?6;

8CCLSS_nAML: C2C_619163_SLCMLn1S
8CWS_LCAuLu: 163699993
Cu8L/uaLa pull from C8lu: 00:01:37
C1AS: 00:40:39

@)$0(# A%,& 7)#= P%$&0$ 4OQ >%$ #=0 ,4%R0 &,#,60#;

creaLe Lable h_200_p0_ordered nologglng as selecL /*+ noparallel(a) */ * from
hlve_raw_200_parL_bz2 a order by 2,3,4,

lL Look 30 mlns 48 sec Lo load 163,699,993 rows wlLh uC 8.
I)"&)"?6H8%"(/'6)%"6;

1.) We can apply condlLlons (where, order by) whlle loadlng daLa dlrecLly from
grld.

You might also like