Professional Documents
Culture Documents
Course goal
• Train students to be full-fledged data science practitioners who could
execute end-to-end data science projects to achieve business results
1
6/28/2016
Data Science
• Skill of extracting of knowledge from data
• Using knowledge to predict the unknown
• Improve business outcomes with the power of data
• Employ techniques and theories drawn from broad areas of
mathematics, statistics and information technology
Definitions
Across the web
2
6/28/2016
Data Scientist
• A practitioner of data science
• Expertise in data engineering, analytics, statistics and business
domain
• Investigate complex business problems and use data to provide
solutions
Data
The foundation of Data Science
Entity Characteristics
• A thing that exists about which we research and predict in data • Every entity has a set of characteristics. These are unique properties
science. • Properties too have a business context
• Entity has a business context. • Customer : Age, income group, gender, education
• Customer of a business • Patient: Age, Blood Pressure, Weight, Family history.
• Patient at a hospital. The same person can be a patient and a • Car: Make, Model, Year, Engine, VIN
customer, but the business context is different.
• Car. Entities can be non living things
Environment Event
• Environment points to the eco-system in which the entity exists or • A significant business activity in which an entity participates.
functions. • Events happen in a said environment.
• Environment is shared among entities. Multiple entities belong to the • Customer : Browsing, store visit, sales call
same environment
• Patient: Doctor visit, blood test
• Environment affects an entity’s behavior
• Car: Smog test, comparison test
• Customer : Country, City, Work Place
• Patient: City, Climate .
• Car: Use (City/highway), Climate
3
6/28/2016
Behavior Outcome
• What an entity does during an event. • The result of an activity deemed significant by the business.
• Entities may have different behaviors in different environments • Outcome values can be
• Customer : Phone Call vs email, Clickstream, response to offers • Boolean ( Yes/No, Pass/Fail)
• Continuous ( a numeric value)
• Patient: Nausea, light-headed, cramps • Class ( identification of type)
• Car: Skid, acceleration, stopping distances • Customer : Sale ( Boolean), sale value (continuous)
• Patient: Blood Pressure value (continuous). Diabetes type (class)
• Car: Smog levels (class), stopping distances (continuous), smog
passed (Boolean), car type (class)
Observation Dataset
• A measurement of an event deemed significant by the business. • A collection of observations
• Captures information about • Each observation is typically called a record
• Entities involved
• Characteristics of the entities • Each record has a set of attributes that point to characteristics,
• Behavior behavior or outcomes.
• Environment in which the behavior happens
• outcomes
• A dataset can be
• Structured (database records, spreadsheet)
• An observation is also called a system of record
• Unstructured ( twitter feeds, news paper articles)
• Customer : A phone call record, a buying transaction, an email offer
• Semi-structured (email)
• Patient: A doctor visit record, a test result, a data capture from a
monitoring device • Data scientists collect and work on datasets to learn about entities
• Car: Service record, smog test result and predict their future behavior/ outcomes.
4
6/28/2016
Relationships
• Attributes in a dataset exhibit relationships
• Relationships “model” the real world and have a logical “explanation”
• For attributes A and B the relationships can be
• When A occurs, B also occurs
• When A occurs B does not occur
• When A increases B also increases
Learning • When A increases B decreases
Discovering knowledge from Data • Relationships can involve multiple attributes too
• When A is present and B increases, C will decrease
5
6/28/2016
6
6/28/2016
Fraud Detection
• Credit Card Frauds exhibit patterns in
transactions
• Historical Transaction Data used to
identify fraudulent patterns and build
models
• Each new transaction is then given a
fraud score based on the model Retailing
• Action taken for high scored Sell more
transactions
7
6/28/2016
Recommendations
• Items exhibit patterns on how they
are brought together
• Cell phones and accessories
• Books
• Patterns used to build affinity scores
between items
• When one item is brought, items
with high affinity scores to that item
Contact centers
are recommended. Improving efficiency
8
6/28/2016
Goal Setting
• Every Data Science Project will / should have a goal.
• The team will focus their activities based on this goal
• Projects without goals are cars without a driver.
• Examples
• Predict which customer(s) will churn in the next 3 months
9
6/28/2016
Understanding Data
• Source of the Data
• Processing /transformation steps performed
• Storage (enterprise databases, cloud, news feeds)
• Synchronization
• Relationships
• Ordering Data Engineering
• Understanding data helps the team identify possible sources of Get the data to the form you need it.
predictive patterns.
10
6/28/2016
Data Persistence
• Processed data is stored in a reliable, retrievable data sink.
• All relevant information captured in a single local record as much as
possible
• Example : Retail store transaction : ( POS Data + customer
demographics + Item characteristics + Sales associate performance)
• Scaling and query performance are important factors will choosing a
data sink Analytics
• Flat files Learn and Predict
• Traditional SQL databases
• Big Data technologies.
Modeling Prediction
• Use machine learning algorithms to build models • Use models built to predict outcomes for new data
• Build multiple models based on different algorithms and different • Keep validating model accuracy to make sure accuracy levels are
datasets consistent for different variations in data
• Test models for accuracy • Response time and resource usage are critical when predictions need
• Identify best performing models to happen in real time
• Accuracy • Measure improvements made to outcome predictions using the
• Response Time model
• Resources • Simulations might be performed to validate prediction benefits
• As simple as an equation or a decision tree. As complex as a neural
network.
11
6/28/2016
Recommendations Iterations
• At the end of the project, recommendations need to be provided to • Based on intermediate or at-the-end analysis and feedback, the
the project owners on the algorithms to use and expected benefits analysis phase might be repeated with different objectives
• A Data science project might have no recommendations to make if • The project team “responds” to findings in the data, which might lead
the dataset does not exhibit any exploitable patterns to multiple analysis paths.
• Does not mean it’s a failure
• Sometimes unexpected patterns are discovered that might lead to
other benefits
• A final presentation is made to stake holders.
Production
Implement continuous processes
12
6/28/2016
Summary
• Data Science projects follow a life cycle
• Data Science projects are research type projects – there is a lot of
experimentation and sometimes no end result
• Signals in data drives results, not the algorithms
• Multiple iterations might be necessary before reasonable results are
Statistics for Data Science
achieved. -
Goals
• Describe basic statistics for Data Science
• Explain the concepts
• Avoid formulae and mathematical representations as much as
possible
Types of Data
What they are and what you can do with them.
Overview Categorical
• There are 4 types of data that you would deal with • Represents categories or types
• They differ in meaning and what operations you can do on them • Fixed list of values
• Types • No implicit ordering or sequencing
• Categorical or nominal • Examples :
• Ordinal • Fruits : apples, oranges, grapes
• Interval • Players : defender, mid-fielder, forward
• Ratio • Cars : sedan, coupe, SUV
13
6/28/2016
Ordinal Interval
• Represents categories • Numeric Data
• But there is ordering among the values • Measurement where the difference is meaningful
• Represents a scale. • Represents time, distance, temperature etc.
• Comparison possible (greater than, less than) • Addition, Subtraction possible
• Examples • Multiplication, division not possible
• Review Rating : Outstanding, Very Good, Good, Fair, Bad • Examples
• Pain Levels: 1 – 10 (10 being the highest) • Time of Day
• Student Grades : A, B, C, D, F • Dates
• Distance between two points
• Temperature
Ratio Comparison
Operations Nominal Ordinal Interval Ratio
• Numeric Data
Discrete Values Yes Yes Yes Yes
• All arithmetic operations possible
Continuous Values No No Yes Yes
• True Zero possible
Frequency Distribution Yes Yes Yes Yes
• Examples Median and Percentiles No Yes Yes Yes
• Weight
Add / Subtract No No Yes Yes
• Speed
• Amount Multiply / Divide No No No Yes
Mean, Std. Deviation No No Yes Yes
Ratios No No No Yes
True Zero No No No Yes
Overview
• Describe a set of observations
• Observations have a number of data points; Summary statistics are
used to characterize them
• Describe
• Central Tendency
• Mean, Median, Mode
Summary Statistics • Variation
• Variance, Standard Deviation
Describe data • Skew
• Quartiles
14
6/28/2016
Variance Quartiles
• Describes the central tendency, distribution, range and skew in one
• Describes how values are distributed Values Mean – Square
around the mean set of measures
Value
• If most values are closer to mean, low 4 0 0 • Given a set of observations, we divide them into 4 equal sets.
variance
• If significant differences in values, then high
6 -2 4 • The boundaries form the quartiles
variance 3 1 1
25% 25% 25% 25%
5 -1 1
• To compute
• Find the mean 2 2 4 1 2 4 4 5 7 8 8 9 9
15
6/28/2016
Overview
• Distributions show how data values Distribution of feedback
are spread in a given observation scores
12
set
10
• Distributions contain a set of bins 8
# count
• Data is grouped in bins based on 6
Feedback
4
Bin Values Count
1-2 2,2,2 3 3
Counts
3-4 4, 3 2
2
5-6 6,5 2
7-8 7,8 2 1
9-10 9 1 0
1-2 3-4 5-6 7-8 9-10
Bins
16
6/28/2016
Correlation
Relationships
17
6/28/2016
To Students
• Programming experience in another programming language assumed.
Hence programming constructs are not explained in detail
• R Offers an exhaustive list of functions. Not all covered. It is expected
that the student uses other resources to search and master them
• R Studio help
• R website
• The internet Language Basics
Committed to Data Science
18
6/28/2016
Variables Strings
• Creating variables are easy and straightforward • Strings are enclosed in single or double quotes
• Assignment operator is “ <- “. “=“ can also be used • Concatenation function is c()
• Variable types con_string <- c( “Programming”, “Languages” )
• Character • Special characters can be included using the backslash “\” inside the
• Numeric string
• Boolean
• A number of string manipulation functions available
• R is case sensitive. So are variable names
• No explicit variable types declarations. Variable takes the type of the
assigned value.
var2 <- 3
Vectors Overview
• Vectors are a key construct in R.
• Equivalent of one-dimensional arrays
• Most data handling happens through vectors
• Sequence/List of data elements of the SAME data type
• Equivalent to arrays
• Can hold any data type
Vectors • Vectors created using c()
my_vector <- c( 3, 5, 10)
char_vector <- c( “apples”, “oranges”)
bool_vector <- c(FALSE, TRUE, FALSE)
19
6/28/2016
• Negative index numbers used to remove members from the output vec2 <- c(4,5,6)
vec1 + vec2
• Out-of-range indexes results in NA 5, 7, 10
Lists Overview
• Lists are similar to vectors.
• A List can hold elements of different data types. (A vector can only
hold data of one type)
• Equivalent to “structures” in other languages.
employee <- list(1001,"Joe Smith", FALSE)
Data Frames
20
6/28/2016
Overview
• Matrices are two dimensional arrays similar to data frames
• Only numeric values are allowed within a matrix.
• Matrices can be converted to data frames and vice-versa
• A number of operations on data frames apply to matrices also.
matrix(
c(2, 4, 3, 1, 5, 7),
Matrices nrow=3,
ncol=2)
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
21
6/28/2016
Overview
• Factor is a data type used to store categorical variables.
• Any vector (including columns in data frames) can be converted to
and from factors
• Factors permit special grouping operations on data like “tables”
• A number of statistical and machine learning functions require factor
data types
Factors
Sorting
• Any vector can be sorted using the sort() function
• A Data Frame can be sorted using the order() function
• Data Frames can be sorted using multiple columns
Data Transformations
Merging Binning
• Two Data Frames can be merged using a common column – merge() • Continuous data can be converted to categorical data by binning
• If column names are same, no specification required. • Data is converted into pre-defined ranges or bins
• Used for foreign key operations (joins) • cut() function is used for binning
empId empName isManager salary dept_id dept_id dept_name emp_df$sal_range <- cut( emp_df$salary,
1 1 John FALSE 1242.11 1 1 1 Sales
c(1,2000,5000 ))
2 2 Mike TRUE 3490.20 1 2 2 Operations
dept_id empId empName isManager salary dept_name 1 1 John FALSE 1242.11 1 (1,2e+03]
1 1 1 John FALSE 1242.11 Sales
2 2 Mike TRUE 3490.20 1 (2e+03,5e+03]
2 1 2 Mike TRUE 3490.20 Sales
22
6/28/2016
Console Operations
• scan() function can be used to read input from console (and from files
too)
• print() function – used to print data to console
• cat() – also writes to screen
• dir() / list.files() prints directory listing
Input / Output Operations
Web downloads
• Files can be downloaded directly from the web using download.file()
• Raw web pages can be scraped using getURL()
Programming
23
6/28/2016
• Logical • For
• <, <=, >, >=, ==, != for (var in seq) expr
• |, & • While
• Decimal places adjust based on operands and result while (cond) expr
• Switch
• Ifelse
ifelse(test,yes,no)
24
6/28/2016
Overview
• A number of R programs require looping through a data frame and
performing an operation on each record.
• The “apply” class of functions provide a short-cut through which that
entire operation can be performed in one function call.
• The function to apply for each row can by user defined function too.
• Variants of the “apply” functions include apply, by, eapply, lapply,
Apply functions mapply, rapply, tapply
mat <- matrix(c(1:20), nrow = 10, ncol = 2)
apply(mat, 1, mean)
apply(mat, 2, mean)
Overview
• R was originally built for statisticians, so there is no dearth of
statistical functions and packages
• Descriptive Statistics
• Correlations
• T-tests
• Regression
Statistics in R •
•
ANOVA and its variants
Power Analysis
25
6/28/2016
Overview
• A picture is worth a thousand words.
• With large amounts of data, graphical representations are the best
way to understand trends
• Graphics play a huge role in all stages of data science
• Identifying outliers during cleansing
• Spotting good predictors during exploratory data analysis
Graphics • Explaining results to project owners
Min value
Graphics in R
• Three systems
• Base Plotting System
• Lattice System
• Grammar of Graphics (GG)
• Extensive set of functionality for type and control Data Engineering
• Use help files to explore the charting options available
26
6/28/2016
Overview
• Data Sources play a vital role in determining data and process
architectures and workflows
• Data Quality and Reliability
• Network Planning
• Fault tolerance
• Security
27
6/28/2016
Overview
• Tables – from RDBMS
• CSV – most common data exchange format
• XML – configuration / meta data
• JSON – new age data exchange format
• Raw Text
Data Formats • Binary – images, voice streams
Acquisition Intervals
• Types
• Batch
• Real time ( push triggers)
• Interval ( e.g.: every 30 minutes)
• Hybrid
• Determined by
Data Acquisition Trends •
•
Analytics needs
Availability
• Rate limits
• Reliability
28
6/28/2016
Data Imputation
• Any “value” present in a dataset is used by machine learning
algorithms as valid values – including null, N/A, blank etc.
• This makes populating missing data a key step that might affect
prediction results
• Techniques for populating missing data
• Mean, median, mode
•
•
Multiple imputation
Use regression to find values based on other values
Data Transformations
• hybrid
29
6/28/2016
• Each variable has 1/0 values High 1 0 • The values are “centered” by subtracting 11 161 -1.47 -0.78
them from the mean value 65 165 1.84 -0.47
• The nth value is indicated by a 0 in all the Medium 0 1
indicator columns Medium 0 1 • The values are “scaled” by dividing the 40 180 0.31 0.73
• Indicator variables sometimes work better above by the Standard Deviation. 51 169 0.98 -0.15
Low 0 0
in predictions that their categorical • ML algorithms give far better results with 20 176 -0.92 0.41
counterparts High 1 0
standardized values Mean 35.00 170.86
30
6/28/2016
31
6/28/2016
Types of Analytics
Type of Description
Analytics
Descriptive Understand what happened
32
6/28/2016
Overview
• Data contains attributes
• Attributes show relationships (correlation) between entities
• Learning – understanding relationships between entities
• Machine Learning – a computer analyzing the data and learning about
relationships
Machine Learning • Machine Learning results in a model built using the data
• Models can be used for grouping and prediction
-
33
6/28/2016
• Predictions can be Boolean or classes • True Positive (TP) and True Negative TRUE FALSE
(TN) are the correct predictions
• False Negative (FN) can be critical in True False
Predict TRUE Positive Positive
Actual medical field ion
TRUE FALSE Total • False Positive (FP) can be critical in FALSE
False
Negative
True
Negative
TRUE 44 6 50
judicial field
Predict
ion
FALSE 9 41 50
Total 53 47 100
• Specificity
• True negative rate
ion
FALSE
False
Negative
True
Negative
Prediction Errors
• Specificity = TN / (TN + FP)
• Precision
• Precision = TP / (TP + FP)
34
6/28/2016
Regression Analysis
• Method of investigating functional relationship between variables
• Estimate the value of dependent variables from the values of
independent variables using a relationship equation
• Used when the dependent and independent variables are continuous
and have some correlation.
• Goodness of Fit analysis is important.
Linear Regression
Linear Relationships
6
between the points and the line (called 4
2
Y = αX + β Intercept
4 y = 1.5x + 2 residuals) is minimized 0
-2 0 2 4 6 8 10 12
Y
R² = 1 -4
Coefficients: 2
• Best line = least residuals X
0
• α = Slope = Y/X -4 -2
-2
0 2 4 6 8
• A line can always be fitted for any set of
• β = Intercept = value of Y when X=0 -4
X points
• The equation of the line becomes the
predictor for Y
35
6/28/2016
Overview
• The simplest, easy to understand and easy to explain ML technique.
• Predictor variables are used to build a tree that would progressively
predict the target variable
• Trees start with a root node that start the decision making process
• Branch nodes refine the decision process
• Leaf nodes provide the decisions
Decision Trees • Training data is used to build a decision tree to predict the target
• The tree becomes the model that is used to predict on new data
36
6/28/2016
Is Age > 41 ? • The depth of trees are highly influenced by the sequence in which the
No
Age BMI Diabetic Yes predictors are chosen for decisions
24 22 N • Using predictors with high selectivity gives faster results
33 28 N BMI > 28 ? BMI > 24 ? • ML implementations automatically make decisions on the sequence
No Yes No Yes
41 36 Y
/preference of predictors
48 24 N
N Y N Y
58 31 Y
61 35 Y
Predicting “Is Diabetic ?”
37
6/28/2016
Overview
• Random Forest is one of the most popular and accurate algorithms
• It is an Ensemble method based on decision trees
• Builds multiple models – each model a decision tree
• For prediction – each tree is used to predict an individual result
• A vote is taken on all the results to find the best answer
Random Forests
38
6/28/2016
Overview
• Unsupervised Learning technique
• Popular method for grouping data into subsets based on the similarity
• Partitions n observations with m variables into k clusters where by
each observation belongs to only one cluster
• How it works
• An m dimensional space is created
K-means Clustering • Each observation is plotted based on this space based on the variable values
• Clustering is done by measuring the distance between points and grouping
them
• Multiple types of distance measures available like Euclidian distance
and Manhattan distance
• Dataset contains only • Choose k=2 centroids • Assign each • Find the centroid of each • Repeat the process of • Find the centroid for the
m=2 variables. We will at random observation to the of the cluster finding the distance new clusters
create k=2 clusters • Measure the distance nearest centroid • Centroid is the point between each • Repeat the process until
• Plot observations on a between each • This forms the where the sum of observation to each the centroids don’t
two dimensional plot observation to each clusters for round 1 distances between the centroid (the new one) move
centroid centroid and each point and reassign each point
is minimum to the nearest one
39
6/28/2016
Overview
• ARM shows how frequently sets of items occur together
• Find Items frequently brought together
• Find fraudulent transactions.
• Frequent Pattern Mining/ Exploratory Data Analysis
• Finding the next word
• One of the clustering techniques
Association Rules Mining • Assumes all data are categorical, not applicable for numeric data
• Helps generate association rules that can be then used for business
purposes like stocking aisles.
40
6/28/2016
Overview
• Biologically inspired by how the human brain works.
• A Black box algorithm ( a full explanation would require few hours
and mathematical prerequisites)
• Used in artificial intelligence domain and of late for machine learning
• Helps discover complex correlations hidden in the data similar to the
human brain
Artificial Neural Networks • Works well on noisy data and where the relationships between
variables is vaguely understood.
• Fast prediction
• Very slow training and easy to over fit
How it works
• Plot feature variables in a multi
dimensional plot
• Try to draw a hyper plane that
separates similar groups of points
• Look for the maximum margin
hyper plane (one that provides the
most separation between the
points).
Bagging
• Support vectors are data points that
lie closest to the hyper plane
41
6/28/2016
Things to note
• May produce improved results than the base classifier if the base
classifier produces unstable results.
• High resource requirements and takes longer times to build models
• Various models available – difference is the base classifier used (some
examples)
• adaBag
•
•
Bagged CART
Bagged Flexible Discriminant Analysis
Boosting
• Bagged Logic Regression
• Model Averaged Neural Network
42
6/28/2016
Things to note
• High resource requirements and takes longer times to build models
• Use a set of weak learners to create a strong learner
• Reduces bias
• Different algorithms available
• Boosted Classification Trees
• Boosted Generalized Additive Model
• Boosted Generalized Linear Model Dimensionality Reduction
Principal Component Analysis
43
6/28/2016
Best of luck !
44