You are on page 1of 76

Using R for Scalable Data Analytics:

Single Machines to Spark Clusters


John-Mark Agosta, Hang Zhang, Robert Horton, Mario Inchiosa, Srini
Kumar*, Katherine Zhao, Vanja Paunic and Debraj GuhaThakurta

Microsoft
Acknowledgements: Gopi Kumar, Paul Shealy, Ali-Kazim Zaidi

* Currently at: LevaData


TUTORIAL MATERIAL & SLIDES:

tinyurl.com/Strata2017R
ROOM: LL21 C/D, San Jose Convention Center
TIME: 9:00am - 12:30pm, March 14th, 2017
1
Key learning objectives

How to scale R code with distributed, parallel, and off-memory

processing

How to develop scalable E2E R data-science process

How to easily operationalize code and models written in R

How to use cloud infrastructure (single node or clusters) to


develop, scale, operationalize

2
Tutorial Outline

Introduction & Orientation [15 mins]

Scaling R on Spark: Hands-on tutorials w/ presentation [150


mins]
SparkR & sparklyr [75 mins]

RevoScaleR [75 mins]

Approaches not covered in hands-on [15 mins]

Wrap-up, summary Q&A [15 mins]


15 min break after ~ 1 hrs

3
Introduction - Scaling your R scripts

Katherine Zhao

4
Introduction

What is R?

What limits the scalability of R scripts?

What functions and techniques can be used to


overcome those limits?

5
The most popular statistical programming language
Language A data visualization tool
Platform Open source

2.5+M users

What is Community
Taught in most universities
Many common use cases across industry
Thriving user groups worldwide
5th in 2016 IEEE Spectrum rank
42% pro analysts prefer R (highest amongst R, SAS, python)

Ecosystem 10,000+ contributed packages


Rich application & platform integration

6
R adoption is on a Tear
But there are several issues regarding scalability

In-Memory Operation

Expensive Data Movement


& Duplication

Lack of Parallelism

7
Couple of scalable R solutions
R packages for distributed computing [Hands-on]
SparkR
sparklyr
RevoScaleR (Microsoft R Server)
h2o
and more!

R packages with big data support on single machines


The bigmemory project
ff and related packages
foreach with doParallel, doSNOW, doNWS backends

8
Hands-on Tutorials w/ Presentations

Part I: SparkR and sparklyr [75 mins]

Katherine Zhao
Debraj GuhaThakurta
Srini Kumar
Acknowledgement: Ali-Kazim Zaidi Hang Zhang
9
Distributed computing on Spark
Brief intro to Spark, its APIs and OS R packages

10
Scale on Spark clusters

What is Spark?
An unified, open
source, parallel, data
processing framework
for Big Data Analytics

11
SparkR 2.0: a Spark API

12
Data processing and modeling with SparkR

13
General analytical workflow in Spark
(across multiple toolkits)

Exploration
and
Transfor- Creation of
visualization
mation and ML models in
Wrangling (for Featurization Save models
Ingest into and cleanup Spark
visualization, for
Spark DF (Spark SQL, (in Spark SQL +
sampled data
dplyr) or deployment
may need to Model
featurization
be converted evaluation
functions)
to R
dataframe)

Spark dataframes used multiple times in the workflow should be cached in memory

14
Platforms & Services for
Hands-on

15
Single node Azure Linux DSVM w/ Spark (for Hands-On)

Data-science virtual machine

Vowpal Wabbit

xgboost Rattle

Spark 2.0.2
HDFS (local)
Yarn
Developer edition

CNTK

http://aka.ms/dsvm
16
Spark clusters in Azure HDInsight

Provisions Azure
compute resources with
Spark 2.0.2 installed and
configured.

Supports multiple
versions (e.g. Spark 1.6).

Stores data in Azure Blob


storage (WASB), Azure
Data Lake Store or Local
HDFS.
17
GitHub repository for all code and scripts

tinyurl.com/Strata2017R

18
SparkR Hands-on

Debraj GuhaThakurta
Srini Kumar
19
Model deployment using R-server
operationalization services
Easy Consumption

Data Scientist

Easy Deployment Microsoft R Client


Data Scientist (mrsdeploy package)

Microsoft R Client publishService

(mrsdeploy package)
Microsoft R Server
configured for
operationalizing R analytics

Easy Setup
In-cloud or on-prem
Adding nodes to scale Easy Integration
High availability & load balancing
Remote execution server
Developer
20
Deployment
Turn R into Web Services easily; and consume them in R

Build the model first Deploy as a web service instantly

https://msdn.microsoft.com/en-us/microsoft-r/operationalize/configuration-initial
Package: mrsdeploy
https://msdn.microsoft.com/en-us/microsoft-r/operationalize/admin-utility 21
mrsdeploy Hands-on

Debraj GuhaThakurta

22
sparklyr: R interface for Apache Spark
Easy installation from CRAN

Connect to both local instances of


Spark and remote Spark clusters

Loads data into SparkDataFrame


from: local R data frames, Hive
tables, CSV, JSON, and Parquet files.
Source: http://spark.rstudio.com/
23
dplyr and ML in sparklyr

Provides a complete dplyr backend for data manipulation,

analysis and visualization


%>%

Includes 3 family of functions for machine learning pipeline


ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package.

K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA

ft_*: Feature transformers for manipulating individual features.

sdf_*: Functions for manipulating SparkDataFrames.


24
h2o: prediction engine in R
Optimized for in memory processing of distributed,
parallel machine learning algorithms on clusters.
Sparkling Water = h2o + Spark

Data manipulation and modeling: R functions + h2o


pre-fixed functions.
Transformations: h2o.group_by(), h2o.impute()
Statistics: h2o.summary(), h2o.quantile(), h2o.mean()
Algorithms: h2o.glm(), h2o.naiveBayes(),
http://www.h2o.ai/product/
h2o.deeplearning(), h2o.kmeans() 25
sparklyr Hands-on

Debraj GuhaThakurta

26
15 min break

27
Hands-on Tutorials w/ Presentation

Part II: RevoScaleR [75 mins]

Mario Inchiosa
Robert Horton
Vanja Paunic
John-Mark Agosta
Katherine Zhao
28
Hands-on Tutorial:
Airline Arrival Delay Prediction using
R Server and SparkR

Mario Inchiosa
29
R Server 9.0: scale-out R, Enterprise Class!
100% compatible with open source R
Any code/package that works today with R will work in R Server.

Ability to parallelize any R function


Ideal for parameter sweeps, simulation, scoring.

Wide range of scalable and distributed rx pre-fixed functions in


RevoScaleR package.
Transformations: rxDataStep()
Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()
Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()
Parallelism: rxSetComputeContext()
30
Azure HDInsight + R Server: Managed
Hadoop for Advanced Analytics in the Cloud
R Easy setup, elastic, SLA
Ubuntu Linux
SparkR functions RevoScaleR functions
Cloud Storage
Spark and Hadoop Spark
Blob Storage
Data Lake Storage
R Server
Leverage R skills with massively scalable
algorithms and statistical functions
Reuse existing R functions over multiple
machines
R Server Hadoop Architecture

Data in Distributed Storage

R process on Edge Node

R R R R R

R R R R R Master R process on Edge Node

Apache YARN and Spark

R Server
Worker R processes on Data Nodes
R Server on Hadoop/HDInsight scales to hundreds
of nodes, billions of rows and terabytes of data
Logistic Regression on NYC Taxi Dataset

2.2 TB
Elapsed Time

0 1 2 3 4 5 6 7 8 9 10 11 12 13
Billions of rows
Typical advanced analytics lifecycle

Prepare Model Operationalize


Airline Arrival Delay Prediction Demo
Clean/Join Using SparkR from R Server

Train/Score/Evaluate Scalable R Server functions

Deploy/Consume Using mrsdeploy from R Server


Airline data set
Passenger flight on-time performance data from the
US Department of Transportations TranStats data
collection
>20 years of data
300+ Airports
Every carrier, every commercial flight
http://www.transtats.bts.gov
Weather data set
Hourly land-based weather observations from
NOAA
> 2,000 weather stations
http://www.ncdc.noaa.gov/orders/qclcd/
Provisioning a cluster with R Server
Scaling a cluster
Clean and Join using SparkR in R Server
Train, Score, and Evaluate using R Server
Publish Web Service from R
Demo Technologies Review
HDInsight Premium Hadoop cluster
Data Science Virtual Machine
Spark on YARN distributed computing
R Server R interpreter
SparkR data manipulation functions
RevoScaleR Statistical & Machine Learning functions
mrsdeploy web service operationalization
Distributed model training and
parameter optimization:
Learning Curves on Big Data

Robert M. Horton, PhD MS


Senior Data Scientist
44
Learning Curve
Simulated Data
A B C D E F G H I J y
a00002 b00001 c00003 d00002 e00026 f00011 g00043 h00142 i00049 j00161 -19.4032
a00001 b00002 c00004 d00013 e00024 f00047 g00037 h00139 i00068 j00164 28.2963
a00002 b00002 c00002 d00004 e00017 f00002 g00086 h00141 i00059 j00447 -8.9377
a00001 b00002 c00001 d00003 e00012 f00004 g00066 h00050 i00163 j00714 -27.9605
a00001 b00003 c00001 d00002 e00004 f00016 g00011 h00097 i00163 j00246 27.3483
a00002 b00001 c00001 d00003 e00023 f00006 g00002 h00072 i00249 j00188 4.7853
a00001 b00003 c00007 d00010 e00002 f00006 g00036 h00031 i00250 j00179 25.9673
a00002 b00003 c00004 d00016 e00017 f00004 g00029 h00077 i00168 j00020 27.1069
a00001 b00001 c00002 d00011 e00003 f00033 g00047 h00115 i00310 j00280 9.5063
a00001 b00001 c00004 d00006 e00006 f00040 g00086 h00014 i00002 j00374 -19.5206
a00001 b00002 c00001 d00002 e00004 f00028 g00044 h00005 i00431 j00646 -4.0899
a00001 b00003 c00002 d00006 e00018 f00044 g00040 h00232 i00254 j00261 19.7420
a00002 b00002 c00007 d00003 e00011 f00012 g00081 h00071 i00291 j00023 7.9582
a00002 b00003 c00004 d00012 e00005 f00006 g00056 h00182 i00430 j00615 -37.2846
a00001 b00002 c00007 d00001 e00026 f00022 g00033 h00157 i00067 j00039 3.6434

Increasing cardinality Y
Parameter Table
model_class training_fraction with_formula test_set_kfold_id KFOLDS cube
rxLinMod 0.0150000 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0219736 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0321893 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0471543 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0690766 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.1011907 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.1482349 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.2171503 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.3181049 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.4659939 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.6826375 y ~ D+C+B+A 1 3 TRUE
rxLinMod 1.0000000 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0150000 y ~ E+D+C+B+A 1 3 TRUE
rxLinMod 0.0219736 y ~ E+D+C+B+A 1 3 TRUE
rxLinMod 0.0321893 y ~ E+D+C+B+A 1 3 TRUE

Dynamic Sampling
row_tagger:
set.seed(chunk_num + salt)
kfold <- sample(1:kfolds, size=num_rows,
replace=TRUE)
in_test_set <- kfold == kfold_id
num_training_candidates <- sum(!in_test_set)
keepers <- sample(rowNums[!in_test_set],
prob * num_training_candidates)
data_list$in_training_set <- rowNums %in% keepers
data_list$in_test_set <- in_test_set
Dynamic Scoring
On each chunk:
residual <- rxPredict(model, <selected cases>)
SSE <- SSE + sum(residual^2, na.rm=TRUE)
rowCount <- rowCount + sum(!is.na(residual))

On overall results:
sqrt(SSE/rowCount)) # root mean square error
Demo
Running learning curves with R Server
Airline Flight Delay:
varying cardinality

Columns added:
Origin33, Dest33,
Origin50, Dest50,
Origin75, Dest75,
Origin100, Dest100,
Origin125, Dest125,
Origin150, Dest150,
Origin250, Dest250
Tuning Boosted Trees
Hierarchical Time Series

Vanja Paunic
57
Comparisons

Katherine Zhao
58
Base and scalable approaches comparison

Approach Scalability Spark Hadoop SQL Server Teradata Support


CRAN R1 Single machines Community

Single + Distributed
SparkR computing
X Community

Single + Distributed
sparklyr computing
X Community

Single + Distributed
h2o computing
X X Community

Single + Distributed
RevoScaleR computing
X X X X Enterprise

1. CRAN R indicates no additional R packages installed


59
R Server on Spark - faster and more scalable
E2E Process:
Load Data from .csv
Transform Features
Split Data: Train +
Test
Fit Model: Logistic
Regression (no
regularization)
Predict and Write
Outputs

Configuration:
1 Edge Node: 16 cores,
112GB
4 Worker Nodes: 16 cores,
112GB
Dataset: Duplicated Airlines
data (.csv)
Number of columns: 26

tinyurl.com/Strata2017R/Performance_Comparison 60
SparkR - outperform when loading data

Load Data:
MRS on Spark: XDF
SparkR: Spark DF
sparklyr: Spark DF
h2o: H2OFrame
CRAN R: DF

Configuration:
1 Edge Node: 16 cores,
112GB
4 Worker Nodes: 16 cores,
112GB
Dataset: Duplicated Airlines
data (.csv)
Number of columns: 26

61
MRS - faster when fitting big data

Configuration:
1 Edge Node: 16 cores,
112GB
4 Worker Nodes: 16 cores,
112GB
Dataset: Duplicated Airlines
data (.csv)
Number of columns: 26

62
MRS - save time when making predictions

Predict:
Outputs predictions
into files in HDFS

Configuration:
1 Edge Node: 16 cores,
112GB
4 Worker Nodes: 16 cores,
112GB
Dataset: Duplicated Airlines
data (.csv)
Number of columns: 26

63
Other Options for Scaling R Scripts

Katherine Zhao

64
The bigmemory project
Coined by Michael Kane and John Emerson at Yale
University
works with massive matrix-like objects in R
Combines memory and file-backed data structures:
analyze numerical data larger than RAM

The data structures may be allocated to shared memory


Source: The Bigmemory Project by Michael Kane and John Emerson: April 29, 2010.
sister packages and related work
biganalytics: provides exploratory data analysis functionality on
big.matrix
bigtabulate: adds table-, tapply-, and split-like behavior for
big.matrix
bigalgebra: performs linear algebra calculations on big.matrix and R
matrix
synchronicity: supports synchronization and may eventually
support interprocess communication (ipc) and message passing
biglm: provides linear and generalized linear models on big.matrix
Rdsm: enables shared-memory parallelism with big.matrix

Source: The Bigmemory Project by Michael Kane and John Emerson: April 29, 2010. 66
ff package
Provides data structures that are stored on Disk, but
behave as if they were in RAM
Maps only a section in main memory for effective
consumption

Accepts numeric and characters as input data

Source: ff: memory-efficient storage of large data on disk and fast access functions. 67
ff related packages
ffbase: adds basic statistical functionality to ff. (Note:
*.ff apply on ff vectors, and *.ffdf apply on ffdf.)
Coercions: as.character.ff(), as.Date_ff_vector(), as.ffdf.ffdf(), as.ram.ffdf()
Selections: subset.ffdf(), ffwhich(), transform.ffdf(), within.ffdf(),
with.ffdf()
Aggregations: quantile.ff(), hist.ff(), sum.ff(), mean.ff(), range.ff(),
tabulate.ff()
Algorithms: bigglm.ffdf()

biglars: provides least-angle regression, lasso and


stepwise regression on ff.
Source: CRAN Task View: High-Performance and Parallel Computing with R, by Dirk Eddelbuettel. 68
Parallel programming with foreach
Provides a function foreach and two operators %do% and %dopar% that
support parallel execution
%dopar% operator relies on a pre-registered parallel backend
doParallel(parallel), doSNOW(snow), doMC(multicore) ,
doMPI(Rmpi)and etc.

Source: foreach package.


69
Q&A

CONTACT INFORMATION
Vanja Paunic (vanja.paunic@microsoft.com)
Robert Horton (rhorton@microsoft.com)
Hang Zhang (hangzh@microsoft.com)
Srini Kumar (srini.kumar.private@gmail.com)
Mengyue (Katherine) Zhao (mez@microsoft.com)
John-Mark Agosta (joagosta@microsoft.com)
Mario Inchiosa (marioinc@yahoo.com)
Debraj GuhaThakurta (deguhath@microsoft.com)
70
THANK YOU
Backups
Parallelized & Distributed Analytics
ETL Statistical Tests Machine Learning
Data import Delimited, Fixed, SAS, SPSS, Chi Square Test
Decision Trees
OBDC Kendall Rank Correlation
Decision Forests
Variable creation & transformation Fishers Exact Test
Gradient Boosted Decision Trees
Recode variables Students t-Test
Nave Bayes
Factor variables
Missing value handling
Sort, Merge, Split Predictive Statistics Clustering
Aggregate by category (means, sums) Sum of Squares (cross product matrix for set K-Means
variables)
Descriptive Statistics Multiple Linear Regression Sampling
Generalized Linear Models (GLM) exponential
Min / Max, Mean, Median (approx.) family distributions: binomial, Gaussian, inverse Subsample (observations & variables)
Quantiles (approx.) Gaussian, Poisson, Tweedie. Standard link Random Sampling
Standard Deviation functions: cauchit, identity, log, logit, probit. User
Variance
Correlation
defined distributions & link functions. Simulation
Covariance & Correlation Matrices
Covariance Logistic Regression Simulation (e.g. Monte Carlo)
Sum of Squares (cross product matrix for set Predictions/scoring for models Parallel Random Number Generation
variables) Residuals for all models
Pairwise Cross tabs

Risk Ratio & Odds Ratio


Custom Parallelization
Cross-Tabulation of Data (standard tables & long
form)
Variable Selection

rxDataStep
rxExec
Marginal Summaries of Cross Tabulations Stepwise Regression PEMA-R API

73
Portable across multiple platforms

Cloud

Hadoop & Spark


R Server portfolio
R Server Technology
EDW

RDBMS

Desktops & Servers

high-speed
provides
includes and direct
fully-parallelized
distributed computing
mrsdeploy
connectors to establishes
HDFS, remote
Teradata,
framework
analytics and cross-platform
execution and publishes &
SAS and etc.
portability
manages web services
74
ScaleR: parallel + Big Data
Our ScaleR algorithms work
inside multiple cores / nodes
in parallel at high speed

Stream data into blocks from sources: Hive tables, CSV, Parquet,
XDF, ODBC and SQL Server. Interim results are collected
and combined analytically to
XDF file format is optimised to work with the ScaleR library and produce the output on the
significantly speeds up iterative algorithm processing. entire data set

75
Write Once - Deploy Anywhere
ScaleR models can be deployed from a server or edge node to run in Spark/Hadoop without any
functional R model re-coding.
Local Parallel processing - Linux or Windows In Spark/Hadoop
### SETUP LOCAL ENVIRONMENT VARIABLES ###
### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ###
myLocalCC <- localpar
mySparkCC <- RxSpark() myHadoopCC <- RxHadoopMR()
### LOCAL COMPUTE CONTEXT ###
Compute rxSetComputeContext(myLocalCC) ### HADOOP COMPUTE CONTEXT ###
context R script rxSetComputeContext(mySparkCC)
rxSetComputeContext(myHadoopCC)
- sets where the ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
linuxFS <- RxNativeFileSystem( )
model will run AirlineDataSet <- RxXdfData(airline_20MM.xdf,
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
fileSystem = linuxFS)
AirlineDataSet <- RxXdfData(airline_20MM.xdf,
fileSystem = hdfsFS)

### ANALYTICAL PROCESSING ###


Functional model ### Statistical Summary of the data
R script does rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineDataSet, reportProgress = 1)

not need to ### CrossTab the data


rxCrossTabs(ArrDelay ~ DayOfWeek, data = AirlineDataSet, means = T)
change to run in
### Linear model and plot
Spark hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
76

You might also like