You are on page 1of 35

DSME2040

Business Analytics
Week 6

Keongtae Kim
Assistant Professor
Office: CYT 950
Email: keongkim@cuhk.edu.hk

1
Housekeeping Issues
 One page proposal for group project
 Due this Friday (22nd) at 11:59pm
 Clearly indicate your data source and business analytics
objective
 Submit via both Blackboard and VeriGuide
 One submission per group
 Please name your group unless you already did

2
Housekeeping Issues
 Midterm exam
 18th Mar during class (70 mins from 2:30 to 3:40)
 Covers concepts and interpretation of results
 No need to use laptop and no need to memorize Python
codes
 Accounts for 20% of the total grade in this course

3
Regression

4
Linear Regression

5
Let’s start from a simple linear model –
one predictor
 Marketing Problem: I know current egg
prices on Monday in California. Can I
predict egg sales for the rest of the week
in California?
 I have: Two-year historical database
 Target variable: weekly sales cases of eggs
 Predictor variable: egg prices
 Step one: look at the raw data…

6
Cases Sold / Week Egg.Prices
96343 90.42
96345 89.33
96928 89.89
93519 90.71
99032 85.99
Some of the historical 91539 91.83
data… 89969 87.29
90859 96.36
99697 99.71
88350 99.38
100383 97.53
94415 100.77
91813 98
Let’s use graphical 100466 99.89

display for easier 96783 101.26


91008 101.26
interpretation… 100324 100.28

(data understanding) 106628 100.69


98892 105.16
98252 109.55
7
Scatterplot representation:
a graphical model that could be used for prediction
80000 100000 120000 140000 160000 180000

Prediction: At
$.80, how many
cases would you
predict be sold?
At $1.05?
Cases

What cautions
would you include
with your
predictions?

75 80 85 90 95 100 105 110


8
Egg.Pr
Scatterplot to Algebra:
figure out simple (one predictor) linear model, then extend to multiple
predictors

 We want:
 A formula that we can plug the prices into and it will
give us the predicted case sales
 And it will give us the error range in the predicted
case sales
 First try: Use the simplest formula around: a linear
relation: Cases = a + b x Prices
 use the data to calibrate (estimate a and b) this linear
model

9
100000 120000 140000 160000 180000
Cases

Cases = 153414 – 554 x Price


80000

75 80 85 90 95 100 105 110

Egg.Pr

How are these numbers found?


Minimize Sum of Squared Errors between the predicted cases
and actual cases is shown as the dashed line
10
Linear Regression
y  w 0  w 1x

Observed Value
of y for xi

ei Slope = w1
Predicted Value
of y for xi Error for this x
value

Intercept = w0

xi x 11
Linear Regression

 Method of least squares:


 Estimates the best-fitting straight line
 w0 and w1 are obtained by minimizing the sum of the squared
errors (a.k.a. residuals)

SSE   ei   i i
2
( y  ˆ
y ) 2

i i

  i 0 1i
( y  ( w  w x )) 2

w1 can be obtained by
setting the partial  ( x  x )( y  y )
w
i i
derivative of the SSE to i
w  yw x

1 0 1
0 and solving for w1, ( x  x ) i
2

i
ultimately resulting in: 12
Goodness of Fit - RMSE
 RMSE: Root Mean Squared Error
 Square root of error variance (average squared
difference between the true value and the
estimated value)
True Value Estimated Value Error
127 132 −5
 78  76  2
120 122 −2
130 129  1
 95  91  4

13
RMSE:
Goodness of Fit - R2
 R2 : (SST-SSE)/SST
 SST: Sum of Squares Total
 SSE: Sum of Squares Error

SSE SST

R2=(4.907-0.86)/4.907=0.84
14
Goodness of Fit – R2

15
Interpreting the output
e.g. One predictor, Egg.Pr, of Case sales

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 153,414.5 15,992.9 9.593 5.96e-16 ***
Egg.Pr -553.9 168.2 -3.293 0.00136 **

The estimated The probability


The coefficient
standard deviation of that the true
of Egg.Pr, or the
the slope: Roughly, slope is zero.
slope of the line
there is a 95% chance More precisely, if
in the plot of
that the true slope lies the true slope
Cases vs Egg.Pr
within 2 standard were zero, the
deviations, i.e., probability that
between -217 and -889 we could have
What are the taken a sample
corresponding meanings that gave us an
for the intercept? estimate of
-553.9. 16
Outliers
 A few sales data points are way outside of the
range of the majority of data points

 We could
 Drop them and re-calibrate
 Explore to see if there is anything else that predicts
them

 Check: egg sales increase at Easter, and drop


immediately after Easter– these weeks are the
outliers.
17
Predicting, knowing price and Easter
season

 How much to increase sales prediction if it


is Easter, or decrease if just after Easter?
 Graphically, identify the Easter data and
judge the increase and decrease in sales

 Go back…

18
Back to original Prices—Cases Scatterplot
Weeks 40 and 91 are the Easter weeks
40

91
160000
Cases

120000
80000

75 80 85 90 95 100 105 110

Egg.Pr

Predict from the Graphical Model: It is next year, Egg prices are 1.10, and it
19
is Easter Week. What do you expect Sales to be?
Predicting, knowing price and Easter
season

 Algebraically, perhaps add a variable…


 Sales = a + b x Price + c x Easter ?
 Price represents a number that can be
meaningfully multiplied and added. But Easter?

20
Some values of Cases, Egg.Pr, and the
nominal Easter variable…

Easter takes the values: Non Easter, Pre Easter, Easter, and Post Easter
That’s a problem 21
Create new Indicator variables:

 IndPreEaster = 1 when Easter = PreEaster, IndPreEaster = 0 for all other


values of Easter

 IndPostEaster = 1 when Easter = PostEaster, IndPostEaster = 0 for all


other values of Easter

Easter IndEaster IndPreEaster IndPostEaster


NonEaster 0 0 0
PreEaster 0 1 0
Easter 1 0 0
PostEaster 0 0 1

 New Model:
 Sales = a + b x Price + c x IndEaster
+ d x IndPreEaster + e x IndPostEaster
22
Points to note with categorical variables
and predictive models

 Indicator variables are usually called


“dummy” variables
 Easter has 4 values; 3 indicator variables
needed to be created. The value without
an indicator variable is base level. Cases
prediction for the base level is where all
three indicators are zero.
 In this case, the base level was Non Easter,
which is good for interpretation in this
problem.
23
Interpreting the Output: The Categorical variable
Easter is shown as three indicator variables[level]

R CALIBRATION OUTPUT:
Coefficients:
Estimate Pr(>|t|)
• (Intercept) 115387.19 < 2e-16 ***
• Egg.Pr -170.15 0.0813
• Easter[T.Pre Easter] 32728.55 1.94e-08 ***
• Easter[T.Easter] 76946.67 < 2e-16 ***
• Easter[T.Post Easter] -22096.43 8.25e-05 ***

SALES PREDICTION FORMULA:


Sales = 115387 - 170 x Price + 76946 x IndEaster
+ 32728 x IndPreEaster - 22096 x IndPostEaster
24
Logistic Regression

25
Classification
 The response (dependent) variable is binary
 “Hit” or “Flop” (e.g. a film or a song)
 “Positive” or “Negative” (e.g. medical test results)
 “Legitimate” or “Fraudulent” (e.g. credit card transaction)
 “Payment” or “No Payment” (e.g., newspaper subscription)

 The goal is to predict binary outputs from one


or more explanatory (independent) variables
 Predict whether a film will be a hit or flop using information
about the cast, director, month of release, etc.
 Predict whether a credit card transaction is fraudulent based
on location of transaction

26
Logistic Regression or Logit
model
PHENOMENA we want to capture:
1. While the target data is binary, we will model
the probability of choice
2. Probability is continuous variable bounded
by 0 and 1
3. The probability of choosing an option is
related to the utility the individual would
derive from the choice.
4. Utilities are a linear function of customer
characteristics (and possibly the attributes of
the choice)
27
Logistic Regression
 Linear regression for the binary outcome
 What are limitations to using a linear regression

28
Logistic Regression
 Regression model for modeling binary outcomes
 Logit function
 Instead of using Y (or probability p) as the dependent
variable, we use a function of it, which is called the logit
 Logit maps any value of the dependent variables into a
probability [0, 1]

29
Logistic Regression

 Logistic regression

30
The Logistic Regression Model
A nonlinear regression model
logit = b0  b1 Gender  b2 Married  b3 Income  b4 Age  e
 Inserting logit = log(Odds)
odds =p/(1-p)= exp{b0  b1 Gender  b2 Married  b3 Income  b4 Age  e}

 Solving for p 1
p
1  e ( b 0  b1GENDER b 2 MARRIED  b 3INCOME  b 4 AGEe )

 Bottom
line: Logistic regression is a nonlinear function that
maps any values of the input variables into a probability

31
The Coefficients of the Logistic Model

 β0 is called the intercept and let β1, β2 be the regression


coefficients of the variables x1 and x2 respectively. The
intercept is the value of Y when the value of all risk
factors (variables) is zero.
 Each of the regression coefficients describes the size of
the contribution of that predictor.
 A positive regression coefficient means that an increase in a
predictor increases the probability of the outcome
 A negative regression coefficient means that an increase in a
predictor decreases the probability of the outcome
 A large (in absolute terms) regression coefficient means that
the predictor strongly influences the probability of the outcome
 If the predictors are normalized – if not we need to think about the size
of the independent variables ($1 vs. $1000)
 A near-zero regression coefficient means that the predictor
has little influence on the probability of the outcome
 Again, if the predictors are normalized 32
Logistic Regression for Classification

 Logistic regression outputs the probability of a


categorical outcome
 It is most often used for classification
 Example: Logistic regression outputs the probability of a
customer accepting the loan. A classification labels a
customer as an accepter/ nonaccepter.
 To end up with classification we need a cut-off value c
 Observations with probabilities above c are classified as
belonging to class 1 (accepter)
 Observations with probabilities below c are classified as
belonging to class 0 (nonaccepter)
 A popular value is 0.5
33
Interpreting the Coefficients:
The Odds Ratio (Optional)

 Odds:
w(x1, x2 ,…, xk) = exp(a  b1 x1  b2 x2 ...  bk xk)

 Effect of increasing x2 by one unit on odds, holding all other


explanatory variables constant:
w( x1 , x2  1,..., xk ) exp(a  b1 x1  b 2 ( x2  1)    b k xk )
Odds
  exp( b 2 )
ratio w( x1 , x2 ,..., xk ) exp(a  b1 x1  b 2 x2    b k xk )

 b2 = multiplicative factor by which the odds (of the event


Y=1) increase when the value of X2 is increased by 1 unit
everything else is held constant
and ___________________________

34
Interpreting Coefficients of Continuous
Predictors: Beer Preference Example
(Optional)
Input variables Coefficient Std. Error p-value Odds
Constant term -0.68189073 1.93081641 0.72396708 *
Gender -0.77788508 0.71664554 0.27772108 0.45937654
Married 0.16966102 0.79447782 0.83089775 1.18490314
Income 0.00027846 0.00006335 0.00001103 1.00027847
Age -0.22822094 0.05238947 0.00001323 0.79594839

 Estimated coefficient of Age: bAge =______, or,


exp(bAge)=_______.
 Implies that a 1 year increase in age ___creases the
odds of preferring light beer by a factor of _______, for
those with same gender, marital status & income

35

You might also like