You are on page 1of 4

Hutcheson, G. D. (2011). Ordinary Least-Squares Regression. n L. !outinho and G. D.

Hutcheson, "he S#G$ Dictionary o% &uantitati'e !anage(ent Research. )ages 22*-22+.


Ordinary Least-Squares Regression
ntroduction
Ordinary least-squares (OLS) regression is a generalized linear modelling technique that may be used to
model a single response variable which has been recorded on at least an interval scale. The technique may
be applied to single or multiple eplanatory variables and also categorical eplanatory variables that have
been appropriately coded.
,ey -eatures
!t a very basic level" the relationship between a
continuous response variable (#) and a
continuous eplanatory variable ($) may be
represented using a line o% best-%it" where # is
predicted" at least to some etent" by $. &% this
relationship is linear" it may be appropriately
represented mathematically using the straight
line equation '# ( ) * +'" as shown in ,igure -
(this line was computed using the least-squares
procedure. see /yan" -001).
The relationship between variables # and $ is
described using the equation o% the line o% best
%it with ) indicating the value o% # when $ is
equal to zero (also 2nown as the intercept) and
+ indicating the slope o% the line (also 2nown as
the regression coe%%icient). The regression
coe%%icient + describes the change in # that is
associated with a unit change in $. !s can be
seen %rom ,igure -" + only provides an
indication o% the average epected change (the
observed data are scattered around the line)" ma2ing it important to also interpret the con%idence intervals %or
the estimate (the large sample 034 two-tailed approimation o% the con%idence intervals can be calculated as
+ 5 -.06 s.e. +).
&n addition to the model parameters and con%idence intervals %or +" it is use%ul to also have an indication o%
how well the model %its the data. 7odel %it can be determined by comparing the observed scores o% # (the
values o% # %rom the sample o% data) with the epected values o% # (the values o% # predicted by the
regression equation). The di%%erence between these two values (the deviation" or residual as it is also called)
provides an indication o% how well the model predicts each data point. !dding up the deviances %or all the
data points a%ter they have been squared (this basically removes negative deviations) provides a simple
measure o% the degree to which the data deviates %rom the model overall. The sum o% all the squared
residuals is 2nown as the residual sum o% squares (/SS) and provides a measure o% model-%it %or an OLS
regression model. ! poorly %itting model will deviate mar2edly %rom the data and will consequently have a
relatively large /SS" whereas a good-%itting model will not deviate mar2edly %rom the data and will
consequently have a relatively small /SS (a per%ectly %itting model will have an /SS equal to zero" as there
will be no deviation between observed and epected values o% #). &t is important to understand how the /SS
statistic (or the deviance as it is also 2nown. see !gresti"-006" pages 96-97) operates as it is used to
determine the signi%icance o% individual and groups o% variables in a regression model. ! graphical
illustration o% the residuals %or a simple regression model is provided in ,igure 8. 9etailed eamples o%
calculating deviances %rom residuals %or null and simple regression models can be %ound in :utcheson and
7outinho" 8;;<.
The deviance is an important statistic as it
enables the contribution made by eplanatory
variables to the prediction o% the response
variable to be determined. &% by adding a
variable to the model" the deviance is greatly
reduced" the added variable can be said to
have had a large e%%ect on the prediction o% #
%or that model. &%" on the other hand" the
deviance is not greatly reduced" the added
variable can be said to have had a small e%%ect
on the prediction o% # %or that model. The
change in the deviance that results %rom the
eplanatory variable being added to the model
is used to determine the signi%icance o% that
variable's e%%ect on the prediction o% # in that
model. To assess the e%%ect that a single
eplanatory variable has on the prediction o%
#" one simply compares the deviance statistics
be%ore and a%ter the variable has been added
to the model. ,or a simple OLS regression
model" the e%%ect o% the eplanatory variable
can be assessed by comparing the /SS
statistic %or the %ull regression model (# ( ) * +) with that %or the null model (# ( )). The di%%erence in
deviance between the nested models can then be tested %or signi%icance using an ,-test computed %rom the
%ollowing equation.
F
df
p
df
p+q
,df
p+q
=
RSS
p
RSS
p+q

df
p
df
p+q

RSS
p+q
/ df
p+q

where p represents the null model" # ( )" p+q represents the model # ( ) * +" and df are the degrees o%
%reedom associated with the designated model. &t can be seen %rom this equation that the ,-statistic is simply
based on the di%%erence in the deviances between the two models as a %raction o% the deviance o% the %ull
model" whilst ta2ing account o% the number o% parameters.
&n addition to the model-%it statistics" the /-square statistic is also commonly quoted and provides a
measure that indicates the percentage o% variation in the response variable that is =eplained' by the model.
/-square" which is also 2nown as the coe%%icient o% multiple determination" is de%ined as
R
8
=
RSS after regression
total RSS
and basically gives the percentage o% the deviance in the response variable that can be accounted %or by
adding the eplanatory variable into the model. !lthough /-square is widely used" it will always increase as
variables are added to the model (the deviance can only go down when additional variables are added to a
model). One solution to this problem is to calculate an ad>usted /-square statistic (/
8
a
) which ta2es into
account the number o% terms entered into the model and does not necessarily increase as more terms are
added. !d>usted /-square can be derived using the %ollowing equation
R
a
8
= R
8

k -R
8

nk-
where n is the number o% cases used to construct the model and k is the number o% terms in the model (not
including the constant).
#n e.a(/0e o% si(/0e OLS regression
! simple OLS regression model with a single eplanatory variable can be illustrated using the eample o%
predicting ice cream sales given outdoor temperature (?oteswara" -01;). The model %or this relationship
(calculated using so%tware) is
&ce cream consumption ( ;.8;1 * ;.;;@ temperature.
The parameter %or ) (;.8;1) indicates the predicted consumption when temperature is equal to zero. &t
should be noted that although the parameter ) is required to ma2e predictions o% ice cream consumption at
any given temperature" the prediction o% consumption at a temperature o% zero might be o% limited
use%ulness" particularly when the observed data does not include a temperature o% zero in it's range
(predictions should only be made within the limits o% the sampled values). The parameter + indicates that %or
each unit increase in temperature" ice cream consumption increases by ;.;;@ units. The signi%icance o% the
relationship between temperature and ice cream consumption can be estimated by comparing the deviance
statistics %or the two nested models in the table below. one that includes temperature and one that does not.
This di%%erence in deviance can be assessed %or signi%icance using the ,-statistic.
!ode0 de'iance (RSS) d% change in
de'iance
--statistic )-'a0ue
consumption ( a ;.-833 80
;.;133 A8.8< B.;;;-
consumption ( ) * + temperature ;.;3;; 8<
On the basis o% this analysis" outdoor temperature would appear to be signi%icantly related to ice cream
consumption with each unit increase in temperature being associated with an increase o% ;.;;@ units in ice
cream consumption. Csing these statistics it is a simple matter to also compute the /-square statistic %or this
model" which is ;.;133D;.-833" or ;.6;. Temperature EeplainsF 6;4 o% the deviance in ice cream
consumption (i.e." when temperature is added to the model" the deviance in the # variable is reduced by
6;4).
OLS regression 1ith (u0ti/0e e./0anatory 'aria20es
The OLS regression model can be etended to include multiple eplanatory variables by simply adding
additional variables to the equation. The %orm o% the model is the same as above with a single response
variable (#)" but this time # is predicted by multiple eplanatory variables ($
-
to $
@
).
# ( ) * +
-
$
-
* +
8
$
8
* +
@
$
@
The interpretation o% the parameters () and +) %rom the above model is basically the same as %or the simple
regression model above" but the relationship cannot now be graphed on a single scatter plot. ) indicates the
value o% # when all vales o% the eplanatory variables are zero. Gach + parameter indicates the average
change in # that is associated with a unit change in $" whilst controlling %or the other eplanatory variables
in the model. 7odel-%it can be assessed through comparing deviance measures o% nested models. ,or
eample" the e%%ect o% variable $
@
on # in the model above can be calculated by comparing the nested models
# ( ) * +
-
$
-
* +
8
$
8
* +
@
$
@
# ( ) * +
-
$
-
* +
8
$
8
The change in deviance between these models indicates the e%%ect that $
@
has on the prediction o% # when the
e%%ects o% $
-
and $
8
have been accounted %or (it is" there%ore" the unique e%%ect that $
@
has on # a%ter ta2ing
into account $
-
and $
8
). The overall e%%ect o% all three eplanatory variables on # can be assessed by
comparing the models
# ( ) * +
-
$
-
* +
8
$
8
* +
@
$
@
# ( ).

The signi%icance o% the change in the deviance scores can be assessed through the calculation o% the ,-
statistic using the equation provided above (these are" however" provided as a matter o% course by most
so%tware pac2ages). !s with the simple OLS regression" it is a simple matter to compute the /-square
statistics.
#n e.a(/0e o% (u0ti/0e OLS regression
! multiple OLS regression model with three eplanatory variables can be illustrated using the eample %rom
the simple regression model given above. &n this eample" the price o% the ice cream and the average income
o% the neighbourhood are also entered into the model. This model is calculated as
&ce cream consumption ( ;.-01 H -.;AA price * ;.;@@ income * ;.;;@ temperature.
The parameter %or ) (;.-01) indicates the predicted consumption when all eplanatory variables are equal to
zero. The + parameters indicate the average change in consumption that is associated with each unit increase
in the eplanatory variable. ,or eample" %or each unit increase in price" consumption goes down by -.;AA
units. The signi%icance o% the relationship between each eplanatory variable and ice cream consumption can
be estimated by comparing the deviance statistics %or nested models. The table below shows the signi%icance
o% each o% the eplanatory variables (shown by the change in deviance when that variable is removed %rom
the model) in a %orm typically used by so%tware (when only one parameter is assessed" the ,-statistic is
equivalent to the t-statistic (, ( It) which is o%ten quoted in statistical output).
de'iance
change
d% --'a0ue )-'a0ue
coe%%icient
/rice ;.;;8 - ,-"86( -.361 ;.888
inco(e ;.;-- - ,-"86( 1.01@ ;.;;0
te(/erature ;.;<8 - ,-"86( 6;.838 B;.;;;-
residua0s ;.;@3 86
Jithin the range o% the data collected in this study" temperature and income appear to be signi%icantly related
to ice cream consumption.
3onc0usion
OLS regression is one o% the ma>or techniques used to analyse data and %orms the basis o% many other
techniques (%or eample !KOL! and the Meneralised linear models" see /uther%ord" 8;;-). The use%ulness
o% the technique can be greatly etended with the use o% dummy variable coding to include grouped
eplanatory variables (see :utcheson and 7outinho" 8;;<" %or a discussion o% the analysis o% eperimental
designs using regression) and data trans%ormation methods (see" %or eample" ,o" 8;;8). OLS regression is
particularly power%ul as it relatively easy to also chec2 the model asumption such as linearity" constant
variance and the e%%ect o% outliers using simple graphical methods (see :utcheson and So%roniou" -000).
-urther Reading
!gresti" !. (-006). An Introduction to Categorical ata Anal!sis. Nohn Jiley and Sons" &nc.
,o" N. (8;;8). An R and S-"lus Co#panion to Applied Regression$ LondonO Sage Publications.
:utcheson" M. 9. and 7outinho" L. (8;;<). Statistical 7odeling %or 7anagement. Sage Publications.
:utcheson" M. 9. and So%roniou" K. (-000). %&e 'ulti(ariate Social Scientist$ LondonO Sage Publications.
?oteswara" /. ?. (-01;). Testing %or the &ndependence o% /egression 9isturbances. )cono#etrica" @<O"01-
--1.
/uther%ord" !. (8;;-). &ntroducing !KOL! and !KQOL!O a ML7 approach. LondonO Sage Publications.
/yan" T. P. (-001). 'odern Regression 'et&ods$ QhichesterO Nohn Jiley and Sons.
Mraeme :utcheson
7anchester Cniversity

You might also like