Using R For Scalable Data Analytics - From Single Machines To Hadoop Spark Clusters Presentation

Using R for Scalable Data Analytics:
Single Machines to Spark Clusters

John-Mark Agosta, Hang Zhang, Robert Horton, Mario Inchiosa, Srini
Kumar*, Katherine Zhao, Vanja Paunic and Debraj GuhaThakurta
Microsoft
Acknowledgements: Gopi Kumar, Paul Shealy, Ali-Kazim Zaidi
* Currently at: LevaData

TUTORIAL MATERIAL & SLIDES:
tinyurl.com/Strata2017R
ROOM: LL21 C/D, San Jose Convention Center
TIME: 9:00am - 12:30pm, March 14th, 2017
1
Key learning objectives
How to scale R code with distributed, parallel, and off-memory
processing
How to develop scalable E2E R data-science process
How to easily operationalize code and models written in R
How to use cloud infrastructure (single node or clusters) to

develop, scale, operationalize
2
Tutorial Outline
Introduction & Orientation [15 mins]
Scaling R on Spark: Hands-on tutorials w/ presentation [150

mins]
SparkR & sparklyr [75 mins]
RevoScaleR [75 mins]
Approaches not covered in hands-on [15 mins]
Wrap-up, summary Q&A [15 mins]

15 min break after ~ 1 hrs
3
Introduction - Scaling your R scripts
Katherine Zhao
4
Introduction
What is R?
What limits the scalability of R scripts?
What functions and techniques can be used to

overcome those limits?
5
The most popular statistical programming language
Language A data visualization tool
Platform Open source
2.5+M users
What is Community
Taught in most universities
Many common use cases across industry
Thriving user groups worldwide
5th in 2016 IEEE Spectrum rank
42% pro analysts prefer R (highest amongst R, SAS, python)
Ecosystem 10,000+ contributed packages

Rich application & platform integration
6
R adoption is on a Tear
But there are several issues regarding scalability
In-Memory Operation
Expensive Data Movement

& Duplication
Lack of Parallelism
7
Couple of scalable R solutions
R packages for distributed computing [Hands-on]
SparkR
sparklyr
RevoScaleR (Microsoft R Server)
h2o
and more!
R packages with big data support on single machines

The bigmemory project
ff and related packages
foreach with doParallel, doSNOW, doNWS backends
8
Hands-on Tutorials w/ Presentations
Part I: SparkR and sparklyr [75 mins]
Katherine Zhao
Debraj GuhaThakurta
Srini Kumar
Acknowledgement: Ali-Kazim Zaidi Hang Zhang
9
Distributed computing on Spark
Brief intro to Spark, its APIs and OS R packages
10
Scale on Spark clusters
What is Spark?
An unified, open
source, parallel, data
processing framework
for Big Data Analytics
11
SparkR 2.0: a Spark API
12
Data processing and modeling with SparkR
13
General analytical workflow in Spark
(across multiple toolkits)
Exploration
and
Transfor- Creation of
visualization
mation and ML models in
Wrangling (for Featurization Save models
Ingest into and cleanup Spark
visualization, for
Spark DF (Spark SQL, (in Spark SQL +
sampled data
dplyr) or deployment
may need to Model
featurization
be converted evaluation
functions)
to R
dataframe)
Spark dataframes used multiple times in the workflow should be cached in memory
14
Platforms & Services for
Hands-on
15
Single node Azure Linux DSVM w/ Spark (for Hands-On)
Data-science virtual machine
Vowpal Wabbit
xgboost Rattle
Spark 2.0.2
HDFS (local)
Yarn
Developer edition
CNTK
http://aka.ms/dsvm
16
Spark clusters in Azure HDInsight
Provisions Azure
compute resources with
Spark 2.0.2 installed and
configured.
Supports multiple
versions (e.g. Spark 1.6).
Stores data in Azure Blob

storage (WASB), Azure
Data Lake Store or Local
HDFS.
17
GitHub repository for all code and scripts
tinyurl.com/Strata2017R
18
SparkR Hands-on
Debraj GuhaThakurta
Srini Kumar
19
Model deployment using R-server
operationalization services
Easy Consumption
Data Scientist
Easy Deployment Microsoft R Client

Data Scientist (mrsdeploy package)
Microsoft R Client publishService
(mrsdeploy package)
Microsoft R Server
configured for
operationalizing R analytics
Easy Setup
In-cloud or on-prem
Adding nodes to scale Easy Integration
High availability & load balancing
Remote execution server
Developer
20
Deployment
Turn R into Web Services easily; and consume them in R
Build the model first Deploy as a web service instantly
https://msdn.microsoft.com/en-us/microsoft-r/operationalize/configuration-initial
Package: mrsdeploy
https://msdn.microsoft.com/en-us/microsoft-r/operationalize/admin-utility 21
mrsdeploy Hands-on
Debraj GuhaThakurta
22
sparklyr: R interface for Apache Spark
Easy installation from CRAN
Connect to both local instances of

Spark and remote Spark clusters
Loads data into SparkDataFrame

from: local R data frames, Hive
tables, CSV, JSON, and Parquet files.
Source: http://spark.rstudio.com/
23
dplyr and ML in sparklyr
Provides a complete dplyr backend for data manipulation,
analysis and visualization

%>%
Includes 3 family of functions for machine learning pipeline

ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package.
K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA
ft_*: Feature transformers for manipulating individual features.
sdf_*: Functions for manipulating SparkDataFrames.

24
h2o: prediction engine in R
Optimized for in memory processing of distributed,
parallel machine learning algorithms on clusters.
Sparkling Water = h2o + Spark
Data manipulation and modeling: R functions + h2o

pre-fixed functions.
Transformations: h2o.group_by(), h2o.impute()
Statistics: h2o.summary(), h2o.quantile(), h2o.mean()
Algorithms: h2o.glm(), h2o.naiveBayes(),
http://www.h2o.ai/product/
h2o.deeplearning(), h2o.kmeans() 25
sparklyr Hands-on
Debraj GuhaThakurta
26
15 min break
27
Hands-on Tutorials w/ Presentation
Part II: RevoScaleR [75 mins]
Mario Inchiosa
Robert Horton
Vanja Paunic
John-Mark Agosta
Katherine Zhao
28
Hands-on Tutorial:
Airline Arrival Delay Prediction using
R Server and SparkR
Mario Inchiosa
29
R Server 9.0: scale-out R, Enterprise Class!
100% compatible with open source R
Any code/package that works today with R will work in R Server.
Ability to parallelize any R function

Ideal for parameter sweeps, simulation, scoring.
Wide range of scalable and distributed rx pre-fixed functions in

RevoScaleR package.
Transformations: rxDataStep()
Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()
Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()
Parallelism: rxSetComputeContext()
30
Azure HDInsight + R Server: Managed
Hadoop for Advanced Analytics in the Cloud
R Easy setup, elastic, SLA
Ubuntu Linux
SparkR functions RevoScaleR functions
Cloud Storage
Spark and Hadoop Spark
Blob Storage
Data Lake Storage
R Server
Leverage R skills with massively scalable
algorithms and statistical functions
Reuse existing R functions over multiple
machines
R Server Hadoop Architecture
Data in Distributed Storage
R process on Edge Node
R R R R R
R R R R R Master R process on Edge Node
Apache YARN and Spark
R Server
Worker R processes on Data Nodes
R Server on Hadoop/HDInsight scales to hundreds
of nodes, billions of rows and terabytes of data
Logistic Regression on NYC Taxi Dataset
2.2 TB
Elapsed Time
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Billions of rows
Typical advanced analytics lifecycle
Prepare Model Operationalize

Airline Arrival Delay Prediction Demo
Clean/Join Using SparkR from R Server
Train/Score/Evaluate Scalable R Server functions
Deploy/Consume Using mrsdeploy from R Server

Airline data set
Passenger flight on-time performance data from the
US Department of Transportations TranStats data
collection
>20 years of data
300+ Airports
Every carrier, every commercial flight
http://www.transtats.bts.gov
Weather data set
Hourly land-based weather observations from
NOAA
> 2,000 weather stations
http://www.ncdc.noaa.gov/orders/qclcd/
Provisioning a cluster with R Server
Scaling a cluster
Clean and Join using SparkR in R Server
Train, Score, and Evaluate using R Server
Publish Web Service from R
Demo Technologies Review
HDInsight Premium Hadoop cluster
Data Science Virtual Machine
Spark on YARN distributed computing
R Server R interpreter
SparkR data manipulation functions
RevoScaleR Statistical & Machine Learning functions
mrsdeploy web service operationalization
Distributed model training and
parameter optimization:
Learning Curves on Big Data
Robert M. Horton, PhD MS

Senior Data Scientist
44
Learning Curve
Simulated Data
A B C D E F G H I J y
a00002 b00001 c00003 d00002 e00026 f00011 g00043 h00142 i00049 j00161 -19.4032
a00001 b00002 c00004 d00013 e00024 f00047 g00037 h00139 i00068 j00164 28.2963
a00002 b00002 c00002 d00004 e00017 f00002 g00086 h00141 i00059 j00447 -8.9377
a00001 b00002 c00001 d00003 e00012 f00004 g00066 h00050 i00163 j00714 -27.9605
a00001 b00003 c00001 d00002 e00004 f00016 g00011 h00097 i00163 j00246 27.3483
a00002 b00001 c00001 d00003 e00023 f00006 g00002 h00072 i00249 j00188 4.7853
a00001 b00003 c00007 d00010 e00002 f00006 g00036 h00031 i00250 j00179 25.9673
a00002 b00003 c00004 d00016 e00017 f00004 g00029 h00077 i00168 j00020 27.1069
a00001 b00001 c00002 d00011 e00003 f00033 g00047 h00115 i00310 j00280 9.5063
a00001 b00001 c00004 d00006 e00006 f00040 g00086 h00014 i00002 j00374 -19.5206
a00001 b00002 c00001 d00002 e00004 f00028 g00044 h00005 i00431 j00646 -4.0899
a00001 b00003 c00002 d00006 e00018 f00044 g00040 h00232 i00254 j00261 19.7420
a00002 b00002 c00007 d00003 e00011 f00012 g00081 h00071 i00291 j00023 7.9582
a00002 b00003 c00004 d00012 e00005 f00006 g00056 h00182 i00430 j00615 -37.2846
a00001 b00002 c00007 d00001 e00026 f00022 g00033 h00157 i00067 j00039 3.6434
Increasing cardinality Y
Parameter Table
model_class training_fraction with_formula test_set_kfold_id KFOLDS cube
rxLinMod 0.0150000 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0150000 y ~ E+D+C+B+A 1 3 TRUE

Dynamic Sampling
row_tagger:
set.seed(chunk_num + salt)
kfold <- sample(1:kfolds, size=num_rows,
replace=TRUE)
in_test_set <- kfold == kfold_id
num_training_candidates <- sum(!in_test_set)
keepers <- sample(rowNums[!in_test_set],
prob * num_training_candidates)
data_list$in_training_set <- rowNums %in% keepers
data_list$in_test_set <- in_test_set
Dynamic Scoring
On each chunk:
residual <- rxPredict(model, <selected cases>)
SSE <- SSE + sum(residual^2, na.rm=TRUE)
rowCount <- rowCount + sum(!is.na(residual))
On overall results:
sqrt(SSE/rowCount)) # root mean square error
Demo
Running learning curves with R Server
Airline Flight Delay:
varying cardinality
Columns added:
Origin33, Dest33,
Origin50, Dest50,
Origin75, Dest75,
Origin100, Dest100,
Origin125, Dest125,
Origin150, Dest150,
Origin250, Dest250
Tuning Boosted Trees
Hierarchical Time Series
Vanja Paunic
57
Comparisons
Katherine Zhao
58
Base and scalable approaches comparison
Approach Scalability Spark Hadoop SQL Server Teradata Support

CRAN R1 Single machines Community
Single + Distributed
SparkR computing
X Community
sparklyr computing
X Community
h2o computing
X X Community
RevoScaleR computing
X X X X Enterprise
1. CRAN R indicates no additional R packages installed

59
R Server on Spark - faster and more scalable
E2E Process:
Load Data from .csv
Transform Features
Split Data: Train +
Test
Fit Model: Logistic
Regression (no
regularization)
Predict and Write
Outputs
Configuration:
1 Edge Node: 16 cores,
112GB
4 Worker Nodes: 16 cores,
112GB
Dataset: Duplicated Airlines
data (.csv)
Number of columns: 26
tinyurl.com/Strata2017R/Performance_Comparison 60
SparkR - outperform when loading data
Load Data:
MRS on Spark: XDF
SparkR: Spark DF
sparklyr: Spark DF
h2o: H2OFrame
CRAN R: DF
Configuration:
112GB
112GB
data (.csv)
61
MRS - faster when fitting big data
Configuration:
112GB
112GB
data (.csv)
62
MRS - save time when making predictions
Predict:
Outputs predictions
into files in HDFS
Configuration:
112GB
112GB
data (.csv)
63
Other Options for Scaling R Scripts
Katherine Zhao
64
The bigmemory project
Coined by Michael Kane and John Emerson at Yale
University
works with massive matrix-like objects in R
Combines memory and file-backed data structures:
analyze numerical data larger than RAM
The data structures may be allocated to shared memory

Source: The Bigmemory Project by Michael Kane and John Emerson: April 29, 2010.
sister packages and related work
biganalytics: provides exploratory data analysis functionality on
big.matrix
bigtabulate: adds table-, tapply-, and split-like behavior for
big.matrix
bigalgebra: performs linear algebra calculations on big.matrix and R
matrix
synchronicity: supports synchronization and may eventually
support interprocess communication (ipc) and message passing
biglm: provides linear and generalized linear models on big.matrix
Rdsm: enables shared-memory parallelism with big.matrix
Source: The Bigmemory Project by Michael Kane and John Emerson: April 29, 2010. 66
ff package
Provides data structures that are stored on Disk, but
behave as if they were in RAM
Maps only a section in main memory for effective
consumption
Accepts numeric and characters as input data
Source: ff: memory-efficient storage of large data on disk and fast access functions. 67
ff related packages
ffbase: adds basic statistical functionality to ff. (Note:
*.ff apply on ff vectors, and *.ffdf apply on ffdf.)
Coercions: as.character.ff(), as.Date_ff_vector(), as.ffdf.ffdf(), as.ram.ffdf()
Selections: subset.ffdf(), ffwhich(), transform.ffdf(), within.ffdf(),
with.ffdf()
Aggregations: quantile.ff(), hist.ff(), sum.ff(), mean.ff(), range.ff(),
tabulate.ff()
Algorithms: bigglm.ffdf()
biglars: provides least-angle regression, lasso and

stepwise regression on ff.
Source: CRAN Task View: High-Performance and Parallel Computing with R, by Dirk Eddelbuettel. 68
Parallel programming with foreach
Provides a function foreach and two operators %do% and %dopar% that
support parallel execution
%dopar% operator relies on a pre-registered parallel backend
doParallel(parallel), doSNOW(snow), doMC(multicore) ,
doMPI(Rmpi)and etc.
Source: foreach package.

69
Q&A
CONTACT INFORMATION
Vanja Paunic (vanja.paunic@microsoft.com)
Robert Horton (rhorton@microsoft.com)
Hang Zhang (hangzh@microsoft.com)
Srini Kumar (srini.kumar.private@gmail.com)
Mengyue (Katherine) Zhao (mez@microsoft.com)
John-Mark Agosta (joagosta@microsoft.com)
Mario Inchiosa (marioinc@yahoo.com)
Debraj GuhaThakurta (deguhath@microsoft.com)
70
THANK YOU
Backups
Parallelized & Distributed Analytics
ETL Statistical Tests Machine Learning
Data import Delimited, Fixed, SAS, SPSS, Chi Square Test
Decision Trees
OBDC Kendall Rank Correlation
Decision Forests
Variable creation & transformation Fishers Exact Test
Gradient Boosted Decision Trees
Recode variables Students t-Test
Nave Bayes
Factor variables
Missing value handling
Sort, Merge, Split Predictive Statistics Clustering
Aggregate by category (means, sums) Sum of Squares (cross product matrix for set K-Means
variables)
Descriptive Statistics Multiple Linear Regression Sampling
Generalized Linear Models (GLM) exponential
Min / Max, Mean, Median (approx.) family distributions: binomial, Gaussian, inverse Subsample (observations & variables)
Quantiles (approx.) Gaussian, Poisson, Tweedie. Standard link Random Sampling
Standard Deviation functions: cauchit, identity, log, logit, probit. User
Variance
Correlation
defined distributions & link functions. Simulation
Covariance & Correlation Matrices
Covariance Logistic Regression Simulation (e.g. Monte Carlo)
Sum of Squares (cross product matrix for set Predictions/scoring for models Parallel Random Number Generation
variables) Residuals for all models
Pairwise Cross tabs
Risk Ratio & Odds Ratio

Custom Parallelization
Cross-Tabulation of Data (standard tables & long
form)
Variable Selection

rxDataStep
rxExec
Marginal Summaries of Cross Tabulations Stepwise Regression PEMA-R API
73
Portable across multiple platforms
Cloud
Hadoop & Spark

R Server portfolio
R Server Technology
EDW
RDBMS
Desktops & Servers
high-speed
provides
includes and direct
fully-parallelized
distributed computing
mrsdeploy
connectors to establishes
HDFS, remote
Teradata,
framework
analytics and cross-platform
execution and publishes &
SAS and etc.
portability
manages web services
74
ScaleR: parallel + Big Data
Our ScaleR algorithms work
inside multiple cores / nodes
in parallel at high speed
Stream data into blocks from sources: Hive tables, CSV, Parquet,
XDF, ODBC and SQL Server. Interim results are collected
and combined analytically to
XDF file format is optimised to work with the ScaleR library and produce the output on the
significantly speeds up iterative algorithm processing. entire data set
75
Write Once - Deploy Anywhere
ScaleR models can be deployed from a server or edge node to run in Spark/Hadoop without any
functional R model re-coding.
Local Parallel processing - Linux or Windows In Spark/Hadoop
### SETUP LOCAL ENVIRONMENT VARIABLES ###
### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ###
myLocalCC <- localpar
mySparkCC <- RxSpark() myHadoopCC <- RxHadoopMR()
### LOCAL COMPUTE CONTEXT ###
Compute rxSetComputeContext(myLocalCC) ### HADOOP COMPUTE CONTEXT ###
context R script rxSetComputeContext(mySparkCC)
rxSetComputeContext(myHadoopCC)
- sets where the ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
linuxFS <- RxNativeFileSystem( )
model will run AirlineDataSet <- RxXdfData(airline_20MM.xdf,
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
fileSystem = linuxFS)
AirlineDataSet <- RxXdfData(airline_20MM.xdf,
fileSystem = hdfsFS)
### ANALYTICAL PROCESSING ###

Functional model ### Statistical Summary of the data
R script does rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineDataSet, reportProgress = 1)
not need to ### CrossTab the data

rxCrossTabs(ArrDelay ~ DayOfWeek, data = AirlineDataSet, means = T)
change to run in
### Linear model and plot
Spark hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
76

Using R For Scalable Data Analytics - From Single Machines To Hadoop Spark Clusters Presentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using R For Scalable Data Analytics - From Single Machines To Hadoop Spark Clusters Presentation

Uploaded by

Copyright:

Available Formats

Using R for Scalable Data Analytics:

Single Machines to Spark Clusters

* Currently at: LevaData

How to scale R code with distributed, parallel, and off-memory

How to develop scalable E2E R data-science process

How to easily operationalize code and models written in R

How to use cloud infrastructure (single node or clusters) to

Introduction & Orientation [15 mins]

Scaling R on Spark: Hands-on tutorials w/ presentation [150

RevoScaleR [75 mins]

Approaches not covered in hands-on [15 mins]

Wrap-up, summary Q&A [15 mins]

What limits the scalability of R scripts?

What functions and techniques can be used to

Ecosystem 10,000+ contributed packages

Expensive Data Movement

R packages with big data support on single machines

Part I: SparkR and sparklyr [75 mins]

Data-science virtual machine

Stores data in Azure Blob

Easy Deployment Microsoft R Client

Microsoft R Client publishService

Build the model first Deploy as a web service instantly

Connect to both local instances of

Loads data into SparkDataFrame

Provides a complete dplyr backend for data manipulation,

analysis and visualization

Includes 3 family of functions for machine learning pipeline

ft_*: Feature transformers for manipulating individual features.

sdf_*: Functions for manipulating SparkDataFrames.

Data manipulation and modeling: R functions + h2o

Part II: RevoScaleR [75 mins]

Ability to parallelize any R function

Wide range of scalable and distributed rx pre-fixed functions in

Data in Distributed Storage

R process on Edge Node

R R R R R Master R process on Edge Node

Apache YARN and Spark

Prepare Model Operationalize

Train/Score/Evaluate Scalable R Server functions

Deploy/Consume Using mrsdeploy from R Server

Robert M. Horton, PhD MS

Approach Scalability Spark Hadoop SQL Server Teradata Support

1. CRAN R indicates no additional R packages installed

The data structures may be allocated to shared memory

Accepts numeric and characters as input data

biglars: provides least-angle regression, lasso and

Source: foreach package.

Risk Ratio & Odds Ratio

Hadoop & Spark

Desktops & Servers

### ANALYTICAL PROCESSING ###

not need to ### CrossTab the data

You might also like