You are on page 1of 46

Data Science with SAS

and Cloudera
Josh Wills, Senior Director of Data Science
Cloudera

What is a Data Scientist?

One Definition

versus Another

What Do Data Scientists Do?

What I Think I Do

What Other People Think I Do

What I Actually Do

A Brief Introduction to Hadoop

Data Storage in 2001: Databases


Structured schemas
Intensive processing
done where data is
stored
Somewhat reliable
Expensive at scale

10

Data Storage in 2001: Filers


No schemas, stores any
kind of file
No data processing
capability
Reliable
Expensive at scale

11

And Then, This Happened

12

Data Economics: Return on Byte

13

Big Data Economics


No individual record is
particularly valuable
Having every record is
incredibly valuable

14

Web index
Recommendation
systems
Sensor data
Market basket analysis
Online advertising

Enter Hadoop

15

The Hadoop Distributed File System


Based on the Google
File System
Data stored in large files

16

Large block size: 64MB


to 256MB per block
Blocks are replicated to
multiple nodes in the
cluster

Reliable Distributed Processing:


MapReduce

Map Stage

Shuffle Stage: Large-scale distributed sort

Embarrassingly parallel
Like a DATA Step
Like PROC SORT

Reduce Stage

Process all of the values that have the same key in a single
step
Like PROC MEANS with a BY statement

Process the data where it is stored


Write once and youre done.

17

Getting Started with Hadoop

Apache Hive

SQL-based query
language

18

Data Warehouse System


on top of Hadoop

SELECT, INSERT, CREATE


TABLE
Includes some
MapReduce-specific
extensions

Thinking Like a Data Scientist

19

Solving The Right Problem

20

Scarcity vs. Abundance

21

The Star Schema

22

Going Supernova

23

Batch vs. Interactive Processing

24

Cloudera Impala

25

SAS LASR

26

Advanced Analytics on Hadoop

27

Data Science as ETL

28

Iterative Algorithms

29

Iterative Algorithms: Hadoop

30

Iterative Algorithms: SAS HPA

31

MapReduce and You

32

Iterative Algorithms: Getting Clever

33

Case Study: Rare Event Prediction

34

K-Means Clustering

35

K-Means Clustering: Lloyds Algorithm

36

K-Means++

37

Scalable K-Means++ with Cloudera ML

38

Thinking About the Future

39

Data Science as Statistics

40

Data Science as Decision Engineering

41

Decisions Should Be Cheap.

42

Operational Analytics

43

Understanding Operational Analytics


Investigative Analytics

44

Question-driven
Interactive
Ad-hoc, post-hoc
Fixed data
Output is embedded into a
report or in-database
scoring engine

Operational Analytics

Metric-driven
Automated
Systematic
Fluid data
Output is a production
system that makes
customer-facing decisions

Building Data Products

45

Thank you!

Josh Wills, Director of Data Science, Cloudera

@josh_wills

You might also like