Professional Documents
Culture Documents
Microsoft
Acknowledgements: Gopi Kumar, Paul Shealy, Ali-Kazim Zaidi
tinyurl.com/Strata2017R
ROOM: LL21 C/D, San Jose Convention Center
TIME: 9:00am - 12:30pm, March 14th, 2017
1
Key learning objectives
processing
2
Tutorial Outline
3
Introduction - Scaling your R scripts
Katherine Zhao
4
Introduction
What is R?
5
The most popular statistical programming language
Language A data visualization tool
Platform Open source
2.5+M users
What is Community
Taught in most universities
Many common use cases across industry
Thriving user groups worldwide
5th in 2016 IEEE Spectrum rank
42% pro analysts prefer R (highest amongst R, SAS, python)
6
R adoption is on a Tear
But there are several issues regarding scalability
In-Memory Operation
Lack of Parallelism
7
Couple of scalable R solutions
R packages for distributed computing [Hands-on]
SparkR
sparklyr
RevoScaleR (Microsoft R Server)
h2o
and more!
8
Hands-on Tutorials w/ Presentations
Katherine Zhao
Debraj GuhaThakurta
Srini Kumar
Acknowledgement: Ali-Kazim Zaidi Hang Zhang
9
Distributed computing on Spark
Brief intro to Spark, its APIs and OS R packages
10
Scale on Spark clusters
What is Spark?
An unified, open
source, parallel, data
processing framework
for Big Data Analytics
11
SparkR 2.0: a Spark API
12
Data processing and modeling with SparkR
13
General analytical workflow in Spark
(across multiple toolkits)
Exploration
and
Transfor- Creation of
visualization
mation and ML models in
Wrangling (for Featurization Save models
Ingest into and cleanup Spark
visualization, for
Spark DF (Spark SQL, (in Spark SQL +
sampled data
dplyr) or deployment
may need to Model
featurization
be converted evaluation
functions)
to R
dataframe)
Spark dataframes used multiple times in the workflow should be cached in memory
14
Platforms & Services for
Hands-on
15
Single node Azure Linux DSVM w/ Spark (for Hands-On)
Vowpal Wabbit
xgboost Rattle
Spark 2.0.2
HDFS (local)
Yarn
Developer edition
CNTK
http://aka.ms/dsvm
16
Spark clusters in Azure HDInsight
Provisions Azure
compute resources with
Spark 2.0.2 installed and
configured.
Supports multiple
versions (e.g. Spark 1.6).
tinyurl.com/Strata2017R
18
SparkR Hands-on
Debraj GuhaThakurta
Srini Kumar
19
Model deployment using R-server
operationalization services
Easy Consumption
Data Scientist
(mrsdeploy package)
Microsoft R Server
configured for
operationalizing R analytics
Easy Setup
In-cloud or on-prem
Adding nodes to scale Easy Integration
High availability & load balancing
Remote execution server
Developer
20
Deployment
Turn R into Web Services easily; and consume them in R
https://msdn.microsoft.com/en-us/microsoft-r/operationalize/configuration-initial
Package: mrsdeploy
https://msdn.microsoft.com/en-us/microsoft-r/operationalize/admin-utility 21
mrsdeploy Hands-on
Debraj GuhaThakurta
22
sparklyr: R interface for Apache Spark
Easy installation from CRAN
K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA
Debraj GuhaThakurta
26
15 min break
27
Hands-on Tutorials w/ Presentation
Mario Inchiosa
Robert Horton
Vanja Paunic
John-Mark Agosta
Katherine Zhao
28
Hands-on Tutorial:
Airline Arrival Delay Prediction using
R Server and SparkR
Mario Inchiosa
29
R Server 9.0: scale-out R, Enterprise Class!
100% compatible with open source R
Any code/package that works today with R will work in R Server.
R R R R R
R Server
Worker R processes on Data Nodes
R Server on Hadoop/HDInsight scales to hundreds
of nodes, billions of rows and terabytes of data
Logistic Regression on NYC Taxi Dataset
2.2 TB
Elapsed Time
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Billions of rows
Typical advanced analytics lifecycle
Increasing cardinality Y
Parameter Table
model_class training_fraction with_formula test_set_kfold_id KFOLDS cube
rxLinMod 0.0150000 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0219736 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0321893 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0471543 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0690766 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.1011907 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.1482349 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.2171503 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.3181049 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.4659939 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.6826375 y ~ D+C+B+A 1 3 TRUE
rxLinMod 1.0000000 y ~ D+C+B+A 1 3 TRUE
rxLinMod 0.0150000 y ~ E+D+C+B+A 1 3 TRUE
rxLinMod 0.0219736 y ~ E+D+C+B+A 1 3 TRUE
rxLinMod 0.0321893 y ~ E+D+C+B+A 1 3 TRUE
Dynamic Sampling
row_tagger:
set.seed(chunk_num + salt)
kfold <- sample(1:kfolds, size=num_rows,
replace=TRUE)
in_test_set <- kfold == kfold_id
num_training_candidates <- sum(!in_test_set)
keepers <- sample(rowNums[!in_test_set],
prob * num_training_candidates)
data_list$in_training_set <- rowNums %in% keepers
data_list$in_test_set <- in_test_set
Dynamic Scoring
On each chunk:
residual <- rxPredict(model, <selected cases>)
SSE <- SSE + sum(residual^2, na.rm=TRUE)
rowCount <- rowCount + sum(!is.na(residual))
On overall results:
sqrt(SSE/rowCount)) # root mean square error
Demo
Running learning curves with R Server
Airline Flight Delay:
varying cardinality
Columns added:
Origin33, Dest33,
Origin50, Dest50,
Origin75, Dest75,
Origin100, Dest100,
Origin125, Dest125,
Origin150, Dest150,
Origin250, Dest250
Tuning Boosted Trees
Hierarchical Time Series
Vanja Paunic
57
Comparisons
Katherine Zhao
58
Base and scalable approaches comparison
Single + Distributed
SparkR computing
X Community
Single + Distributed
sparklyr computing
X Community
Single + Distributed
h2o computing
X X Community
Single + Distributed
RevoScaleR computing
X X X X Enterprise
Configuration:
1 Edge Node: 16 cores,
112GB
4 Worker Nodes: 16 cores,
112GB
Dataset: Duplicated Airlines
data (.csv)
Number of columns: 26
tinyurl.com/Strata2017R/Performance_Comparison 60
SparkR - outperform when loading data
Load Data:
MRS on Spark: XDF
SparkR: Spark DF
sparklyr: Spark DF
h2o: H2OFrame
CRAN R: DF
Configuration:
1 Edge Node: 16 cores,
112GB
4 Worker Nodes: 16 cores,
112GB
Dataset: Duplicated Airlines
data (.csv)
Number of columns: 26
61
MRS - faster when fitting big data
Configuration:
1 Edge Node: 16 cores,
112GB
4 Worker Nodes: 16 cores,
112GB
Dataset: Duplicated Airlines
data (.csv)
Number of columns: 26
62
MRS - save time when making predictions
Predict:
Outputs predictions
into files in HDFS
Configuration:
1 Edge Node: 16 cores,
112GB
4 Worker Nodes: 16 cores,
112GB
Dataset: Duplicated Airlines
data (.csv)
Number of columns: 26
63
Other Options for Scaling R Scripts
Katherine Zhao
64
The bigmemory project
Coined by Michael Kane and John Emerson at Yale
University
works with massive matrix-like objects in R
Combines memory and file-backed data structures:
analyze numerical data larger than RAM
Source: The Bigmemory Project by Michael Kane and John Emerson: April 29, 2010. 66
ff package
Provides data structures that are stored on Disk, but
behave as if they were in RAM
Maps only a section in main memory for effective
consumption
Source: ff: memory-efficient storage of large data on disk and fast access functions. 67
ff related packages
ffbase: adds basic statistical functionality to ff. (Note:
*.ff apply on ff vectors, and *.ffdf apply on ffdf.)
Coercions: as.character.ff(), as.Date_ff_vector(), as.ffdf.ffdf(), as.ram.ffdf()
Selections: subset.ffdf(), ffwhich(), transform.ffdf(), within.ffdf(),
with.ffdf()
Aggregations: quantile.ff(), hist.ff(), sum.ff(), mean.ff(), range.ff(),
tabulate.ff()
Algorithms: bigglm.ffdf()
CONTACT INFORMATION
Vanja Paunic (vanja.paunic@microsoft.com)
Robert Horton (rhorton@microsoft.com)
Hang Zhang (hangzh@microsoft.com)
Srini Kumar (srini.kumar.private@gmail.com)
Mengyue (Katherine) Zhao (mez@microsoft.com)
John-Mark Agosta (joagosta@microsoft.com)
Mario Inchiosa (marioinc@yahoo.com)
Debraj GuhaThakurta (deguhath@microsoft.com)
70
THANK YOU
Backups
Parallelized & Distributed Analytics
ETL Statistical Tests Machine Learning
Data import Delimited, Fixed, SAS, SPSS, Chi Square Test
Decision Trees
OBDC Kendall Rank Correlation
Decision Forests
Variable creation & transformation Fishers Exact Test
Gradient Boosted Decision Trees
Recode variables Students t-Test
Nave Bayes
Factor variables
Missing value handling
Sort, Merge, Split Predictive Statistics Clustering
Aggregate by category (means, sums) Sum of Squares (cross product matrix for set K-Means
variables)
Descriptive Statistics Multiple Linear Regression Sampling
Generalized Linear Models (GLM) exponential
Min / Max, Mean, Median (approx.) family distributions: binomial, Gaussian, inverse Subsample (observations & variables)
Quantiles (approx.) Gaussian, Poisson, Tweedie. Standard link Random Sampling
Standard Deviation functions: cauchit, identity, log, logit, probit. User
Variance
Correlation
defined distributions & link functions. Simulation
Covariance & Correlation Matrices
Covariance Logistic Regression Simulation (e.g. Monte Carlo)
Sum of Squares (cross product matrix for set Predictions/scoring for models Parallel Random Number Generation
variables) Residuals for all models
Pairwise Cross tabs
73
Portable across multiple platforms
Cloud
RDBMS
high-speed
provides
includes and direct
fully-parallelized
distributed computing
mrsdeploy
connectors to establishes
HDFS, remote
Teradata,
framework
analytics and cross-platform
execution and publishes &
SAS and etc.
portability
manages web services
74
ScaleR: parallel + Big Data
Our ScaleR algorithms work
inside multiple cores / nodes
in parallel at high speed
Stream data into blocks from sources: Hive tables, CSV, Parquet,
XDF, ODBC and SQL Server. Interim results are collected
and combined analytically to
XDF file format is optimised to work with the ScaleR library and produce the output on the
significantly speeds up iterative algorithm processing. entire data set
75
Write Once - Deploy Anywhere
ScaleR models can be deployed from a server or edge node to run in Spark/Hadoop without any
functional R model re-coding.
Local Parallel processing - Linux or Windows In Spark/Hadoop
### SETUP LOCAL ENVIRONMENT VARIABLES ###
### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ###
myLocalCC <- localpar
mySparkCC <- RxSpark() myHadoopCC <- RxHadoopMR()
### LOCAL COMPUTE CONTEXT ###
Compute rxSetComputeContext(myLocalCC) ### HADOOP COMPUTE CONTEXT ###
context R script rxSetComputeContext(mySparkCC)
rxSetComputeContext(myHadoopCC)
- sets where the ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
linuxFS <- RxNativeFileSystem( )
model will run AirlineDataSet <- RxXdfData(airline_20MM.xdf,
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
fileSystem = linuxFS)
AirlineDataSet <- RxXdfData(airline_20MM.xdf,
fileSystem = hdfsFS)