You are on page 1of 18

Simulating a Data Science Pipe-Line on your Laptop

Ed Bullen, Oracle UK

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 1
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 2
Oracle

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 3
Open Source Projects at Oracle
http://openjdk.java.net/projects/graal/

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 4
Motivation
Data Science and Engineering

MATHS SCIENCE

PROGRAMMING ENGINEERING

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 5
A Simple Data Science Pipe-Line
Engineering a Data Processing Pipe-Line

Source Raw
Pre-Process Summarise Consumers
Data

Hadoop Map Hadoop Reduce HDFS, Hive


UK Crime
Data
Python Python R Studio

Hadoop Streaming API


A Simple Approach – well known (not latest cutting-edge tech) … but …
Stable – effective, easy to implement, static technology components
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 7
The Oracle Big Data Lite VM
Free, Simple to Install – Fast Track Access to Hadoop Stack Technologies
Main Download Site:
http://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html

Personal Blog – Additional Assistance and Network Configuration Tips:


https://pygot.wordpress.com/2016/07/08/getting-started-with-the-oracle-hadoop-vm/

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 8
Map Reduce
A Quick Refresher – Map, Shuffle, Reduce HDFS

Node 1 - MAP Node 2 - MAP

=3 =1 =2 =2
Node 1 - REDUCE Node 2 - REDUCE

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 9
Hadoop Streaming API
Deploy Python and R Straight to Hadoop
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file my_python_mapper.py -mapper "python my_python_mapper.py" \
-file my_python_reducer.py -reducer "python my_python_reducer.py" \
-input /user/hadoopuser/source_HDFS_dir \
-output /user/hadoopuser/dest_HDFS_dir

Mapper Hadoop SORT Reducer


Hadoop Hadoop
STD-IN Executed in STD-OUT and STD-IN Executed in OUT
HDFS HDFS
OS Shell NODE PARTITION OS Shell

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 10
Hadoop Streaming API
Sample Code – Python Map-Reduce for UK Crime Data
https://data.police.uk/data/

Crime ID,Month,Reported by,Falls within,..,LSOA, Crime type...


,2012-01,Avon and Somerset Constabulary,..,E01014399, Anti-social behaviour...


,2012-02,Avon and Somerset Constabulary,..,E01014400, Burglary...

DATE, LSOA , LSOA_Name , crime[0], crime[1], crime[2], ... crime[n]


2012-01,e01014399, LSOA Desc , 1 , 2 , 0 , ... 4
2012-02,e01014400, LSOA Desc , 1 , 2 , 0 , ... 4

Example Code on GitHub


https://github.com/edbullen/py-mapred

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 11
Accessing the Data in R Studio
A Simple Approach

# Load Libraries and setup Java ClassPath # Connect to Hive datastore in Hadoop
library("DBI") .jinit(classpath=cp)
library("rJava")
library("RJDBC") drv <- JDBC("org.apache.hive.jdbc.HiveDriver"
, "hive-jdbc.jar")
# Java ClassPath for HIVE Access conn <- dbConnect(drv
cp = c("./hive-jdbc.jar" , "jdbc:hive2://bigdatalite:10000/default"
, "./hadoop-common.jar" , "oracle", "")
, "./libthrift-0.9.2.jar"
, "./hive-service.jar" # Query Data using SQL
, "./httpclient-4.2.5.jar" ukcrimesum <- dbGetQuery(conn
, "./httpcore-4.2.5.jar“ , "select * from ukcrimesum")
, "./hive-jdbc-standalone.jar")

Personal Blog – Connecting R Studio to Hadoop via Hive:


https://pygot.wordpress.com/2016/10/13/connecting-r-studio-to-hadoop-via-hive/

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 12
Analysis of the Data-Set
A quick first-pass…

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 13
Analysis of the Data-Set
Correlation and Clustering

ukcrimesum <- dbGetQuery(conn


, "select * from ukcrimesum")

#which crimes show correlation?


crimesM <- data.matrix(ukcrimesum[,4:17])

corM <- cor(crimesM)


diag(corM) <- 0

heatmap(corM)

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 14
Analysis of the Data-Set
Seasonality

monthagg <- aggregate(cbind(robbery


, burglary
, bicycle_theft
, social) ~ date
, data=monthcrimes
, FUN=sum)

centered <- cbind(monthagg$date


, as.data.frame(apply(monthagg[-1]
, 2
, function(y) y - mean(y))) )

par(mfrow = c(4,1))
attach(centered)

for (name in names(centered)[-1] ) {


barplot(as.vector(centered[name][,1])
, main = paste(name))
}

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 15
Analysis of the Data-Set
Mapping
library("rgeos")
library("maptools")

ukshapefileDETAIL <- "./LSOA_2011_EW_BFE_V2.shp"


ukmap <- readShapeSpatial(ukshapefileDETAIL)
lonmap <- ukmap[match(lonLSOA, ukmap@data$LSOA11CD),]

loncrime <- dbGetQuery(conn, "select LSOA,


sum(total_classified) from ukcrimesum
where date in <...>
and lsoa in <...> group by LSOA")

#Combined Map Data (shapeFile) with added data


lonmap.crime <- SpatialPolygonsDataFrame(lonmap
,loncrime ,match.ID=FALSE)

plot(lonmap.crime
, col = countcols[findInterval(counts
, breaks, all.inside = TRUE)]
, axes = FALSE
, border = "transparent“
, main = "2015 Total Crimes" )

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 16
Analysis of the Data-Set
Mapping
library("rgeos")
library("maptools")

ukshapefileDETAIL <- "./LSOA_2011_EW_BFE_V2.shp"


ukmap <- readShapeSpatial(ukshapefileDETAIL)
lonmap <- ukmap[match(lonLSOA, ukmap@data$LSOA11CD),]

loncrime <- dbGetQuery(conn, "select LSOA,


sum(bicycle_theft) from ukcrimesum
where date in <...>
and lsoa in <...> group by LSOA")

#Combined Map Data (shapeFile) with added data


lonmap.crime <- SpatialPolygonsDataFrame(lonmap
,loncrime ,match.ID=FALSE)

plot(lonmap.crime
, col = countcols[findInterval(counts
, breaks, all.inside = TRUE)]
, axes = FALSE
, border = "transparent"
, main = "2015 Bicycle Theft" )

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 17
Thank You
ed.bullen@oracle.com
Social Media and Blog: ** all personal views, not representing my employer **

@bullened

http://pygot.wordpress.com

http://github.com/edbullen

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 18
https://www.meetup.com/Oracle-UK-BigData/

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 19

You might also like