You are on page 1of 38

Revolution Confidential

T he R is e of Data S c ienc e in the age of B ig Data A nalytic s


Why Data Dis tillation and Mac hine L earning A rent E nough

David M S mith V P Marketing and C ommunity


R evolution Analytic s

Today, well dis c us s :


What is Data Science? Why machine learning isnt enough Why Data Science works The Data Scientists Toolkit The Future of Big Data Analytics Closing thoughts and resources

Revolution Confidential

Revolution Confidential

Dov Harrington, CC By-2.0 http://www.flickr.com/photos/idovermani/4110546683/

Where is it s afe to fis h near S an F ranc is c o?

Revolution Confidential

San Francisco Estuary Institute http://www.sfei.org/tools/wqt

Hurric ane S andy

Revolution Confidential

Bob Rudis http://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/

Hurric ane S andy

Revolution Confidential

Ed Chen http://blog.echen.me/hurricane-sandy-outages/

When did Mic hael J ac ks on have his bigges t hits ?

Revolution Confidential

New York Times, June 25 2009 (3 hours after Michael Jacksons death) http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html

T hree E s s ential S kills of Data S c ientis ts


Data Integration Mashups Applications

Revolution Confidential

Models Visualization Predictions Uncertainty

Problems Data Sources Credibility

Effective Data Applications

Drew Conway http://www.dataists.com/2010/09/the-data-science-venn-diagram/

Revolution Confidential

Image Abode of Chaos, CC BY 2.0 http://www.flickr.com/photos/home_of_chaos/6418989233/

Mac hine learning (ML ) for predic tions


Building the Model Responses Features
scoring rules

Revolution Confidential

ML

Scoring new data Predictions (scores)


10

New Data

Validating the Model Predictions Response Validation set

scoring rules

scoring rules

Accuracy

P roblem: A lac k of pers pec tive

Revolution Confidential

Image 2010 David M Smith. Some rights reserved CC BY-2.0

11

P roblem: L ac k of c redibility

Revolution Confidential

12

P roblem: C omplexity

Revolution Confidential

13

Data Science to the Rescue!

Revolution Confidential

14

A ns wer Unas ked Ques tions

Revolution Confidential

Revolutions blog: The Uncanny Valley of Big Data http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html

15

F ill in knowledge gaps

Revolution Confidential

More data beats better algorithms, every time Google

Companies that have massive amounts of data without massive amounts of clue are going to be displaced by startups that have less data but more clue. -- Tim OReilly

Google Research, The Unreasonable Effectiveness of Data: http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html Tim OReilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwd TechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html

16

Avoid ineffec tive reac tions

Revolution Confidential

S&P 500

Stupid Data Miner Tricks http://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf

17

Revolution Confidential

Henricks Photos CC-BY-ND 2.0 http://www.flickr.com/photos/hendricksphotos/3240667626/

18

0. Data (B ig & Mes s y)

Revolution Confidential

19

1. A language for programming with data

Revolution Confidential

Download the White Paper

R is Hot
bit.ly/r-is-hot

20

Data import and preprocessing


Revolution Confidential

User-defined functions Internet API interface XML parsing

Grant awards to homeless veterans FY09

Iterative data processing

Data: Data.gov Analysis: Drew Conway

Custom graphics

21

2. S peed. L ots and lots of s peed.


Variable Transformation

Revolution Confidential

Data

Feature Selection Sampling Aggregation

Model Estimation

Predictions

Model Comparison / Benkmarking

Model Refinement

22

Us e all available c omputing c yc les


Shared Memory Data Data

Revolution Confidential

Data

Disk

Core 0
(Thread 0)

Core 1
(Thread 1)

Core 2
(Thread 2)

Core n
(Thread n)

Multicore Processor (4, 8, 16+ cores)

23

3. A lgorithms that dont c hoke on B ig Data


Compute Node
Data Partition

Revolution Confidential

BIG DATA
Data Partition Data Partition

Data Partition

Compute Node

Compute Node

Master Node

Compute Node

PEMAs: Parallel External-Memory Algorithms


24

Drink les s c offee!


Single Threaded Non-optimized algorithms

Revolution Confidential

Optimized Parallelized Algorithms

25

4. Move c ode to data (not vic e vers a)

Revolution Confidential

Map-Reduce

RHadoop: http://bit.ly/RHadoop

26

B ig Data A pplianc es

Revolution Confidential

More info: http://bit.ly/R-Netezza


27

P lay Nic e with Others


Presentation Layer Business Intelligence Tools Web-based data apps Reporting / Spreadsheets Analytics Layer R Data Layer Relational datastores Unstructured datastores

Revolution Confidential

28

What every data s c ientis t needs


Open-Source R Interface with multiple data sources Exploratory data analysis Wide range of statistical methods High-speed computation Big Data support Data/code locality (Hadoop, etc.) Print-quality data visualization Scheduled batch production Works in a multi-tool ecosystem Integration into Data Apps

Revolution Confidential

Revolution R Enterprise

29

R evolution R E nterpris e: B ig-Data R


Open-Source R Interface with multiple data sources Exploratory data analysis Wide range of statistical methods High-speed computation Big Data support Data/code locality (Hadoop, etc.) Print-quality data visualization Scheduled batch production Works in a multi-tool ecosystem Integration into Data Apps

Revolution Confidential

Revolution R Enterprise

www.revolutionanalytics.com/products

30

Revolution Confidential

Image www.tinyplanetphotography.com

31

A nd the future?
Even more data Cloud computing Demand for Data Scientists

Revolution Confidential

Diverging paradigms for data analytics


http://www.indeed.com/jobtrends 32

Diverging data paradigms


More data, better fault tolerance

Revolution Confidential

Files Clusters

Data Appliances

Hadoop NoSQL

Exploration Modeling

Easier programming, better performance


Production

Storage Preprocessing

33

Data S c ienc e in P roduc tion

Revolution Confidential

Real-time Big Data Analytics: From Deployment to Production


Thursday, November 29, 2012 10:00AM - 11:00AM Pacific Time
www.revolutionanalytics.com/news-events/free-webinars/

34

B uilding Data S c ienc e Teams

Revolution Confidential

DJ Patil in OReilly Radar: http://oreil.ly/I3H5fI Statistics and Data Science graduates Kaggle and Chorus Revolution Analytics R Training:
http://www.revolutionanalytics.com/services/training/

35

C los ing T houghts


Data Science process leads to more powerful, and more useful models

Revolution Confidential

Data Scientists need a technology platform to think about, explore, and model data Revolution R Enterprise is R for Big Data

36

R es ourc es
www.revolutionanalytics.com/products

Revolution Confidential

Revolution R Enterprise : R for Big Data Rhadoop : Connecting R and Hadoop


bit.ly/r-hadoop

Contact David Smith


david@revolutionanalytics.com @revodavid blog.revolutionanalytics.com
37

T hank you.

Revolution Confidential

The leading commercial provider of software and support for the popular open source R statistics language.

www.revolutionanalytics.com

650.646.9545

Twitter: @RevolutionR

38

You might also like