You are on page 1of 843

Theory of Regression

The Course
16 (or so) lessons
Some flexibility
Depends how we feel
What we get through

Part I: Theory of Regression


1. Models in statistics
2. Models with more than one parameter:
regression
3. Why regression?
4. Samples to populations
5. Introducing multiple regression
6. More on multiple regression

Part 2: Application of regression


7. Categorical predictor variables
8. Assumptions in regression analysis
9. Issues in regression analysis
10. Non-linear regression
11. Moderators (interactions) in regression
12. Mediation and path analysis
Part 3: Advanced Types of Regression
13. Logistic Regression
14. Poisson Regression
15. Introducing SEM
16. Introducing longitudinal multilevel
4
models

House Rules
Jeremy must remember
Not to talk too fast

If you dont understand


Ask
Any time

If you think Im wrong


Ask. (Im not always right)
5

Learning New Techniques


Best kind of data to learn a new
technique
Data that you know well, and understand

Your own data


In computer labs (esp later on)
Use your own data if you like

My data
Ill provide you with
Simple examples, small sample sizes
Conceptually simple (even silly)

Computer Programs
SPSS
Mostly

Excel
For calculations

GPower
Stata (if you like)
R (because its flexible and free)
Mplus (SEM, ML?)
AMOS (if you like)
7

Lesson 1: Models in
statistics
Models, parsimony, error,
mean, OLS estimators

10

What is a Model?

11

What is a model?
Representation
Of reality
Not reality

Model aeroplane represents a


real aeroplane
If model aeroplane = real
aeroplane, it isnt a model
12

Statistics is about modelling


Representing and simplifying

Sifting
What is important from what is not
important

Parsimony
In statistical models we seek
parsimony
Parsimony simplicity

13

Parsimony in Science
A model should be:
1: able to explain a lot
2: use as few concepts as possible

More it explains
The more you get

Fewer concepts
The lower the price

Is it worth paying a higher price for a


better model?
14

A Simple Model
Height of five individuals
1.40m
1.55m
1.80m
1.62m
1.63m

These are our DATA


15

A Little Notation

Y
Yi

The (vector of) data that we are


modelling
The ith observation in
our data.

Y 4,5,6,7,8
Y2 5

16

Greek letters represent the


true value in the population.

0
j

(Beta) Parameters in our model


(population value)
The value of the first parameter of
our model in the population.
The value of the jth parameter of
our model, in the population.
(Epsilon) The error in the population
model.
17

Normal letters represent the values in our


sample. These are sample statistics,
which are used to estimate population
parameters.

b
e

A parameters in our model (sample


statistics)
The error in our sample.
The data in our sample which we
are trying to model.

18

Symbols on top change the meaning.

The data in our sample which we


are trying to model (repeated).

Yi

The estimated value of Y, for the ith


case.

The mean of Y.
19


So b1 1
I will use b1 (because it is easier to type)

20

Not always that simple


some texts and computer programs
use
b = the parameter estimate (as we
have used)
(beta) = the standardised parameter
estimate
SPSS does this.

21

A capital letter is the set (vector) of


parameters/statistics

Set of all parameters (b0, b1, b2, b3


bp)

Rules are not used very consistently (even


by me).
Dont assume you know what someone
means, without checking.

22

We want a model
To represent those data

Model 1:
1.40m, 1.55m, 1.80m, 1.62m, 1.63m
Not a model
A copy

VERY unparsimonious

Data: 5 statistics
Model: 5 statistics
No improvement
23

Model 2:
The mean (arithmetic mean)
A one parameter model

Yi b0 Y

Yi

i 1

n
24

Which, because we are lazy, can


be written as

Y
Y
n
25

The Mean as a Model

26

The (Arithmetic) Mean


We all know the mean
The average
Learned about it at school
Forget (didnt know) about how clever the
mean is

The mean is:


An Ordinary Least Squares (OLS) estimator
Best Linear Unbiased Estimator (BLUE)

27

Mean as OLS Estimator


Going back a step or two
MODEL was a representation of DATA
We said we want a model that explains a lot
How much does a model explain?
DATA = MODEL + ERROR
ERROR = DATA - MODEL
We want a model with as little ERROR as
possible

28

What is error?
Data (Y)
1.40
1.55
1.80
1.62
1.63

Model (b0)
mean

1.60

Error (e)
-0.20
-0.05
0.20
0.02
0.03

29

How can we calculate the amount


of error?
Sum of errors

ERROR ei
(Yi Y )
(Yi b0 )

0.20 0.05 0.20 0.02 0.03


0
30

0 implies no ERROR
Not the case

Knowledge about ERROR is useful


As we shall see later

31

Sum of absolute errors


Ignore signs

ERROR ei
Yi Y
Yi b0
0.20 0.05 0.20 0.02 0.03

0.50
32

Are small and large errors equivalent?


One error of 4
Four errors of 1
The same?

What happens with different data?


Y = (2, 2, 5)
b0 = 2
Not very representative

Y = (2, 2, 4, 4)
b0 = any value from 2 - 4
Indeterminate
There are an infinite number of solutions which would
satisfy our criteria for minimum error
33

Sum of squared errors (SSE)

ERROR e

2
i

(Yi Y )

(Yi b0 )

0.20 0.05 0.202 0.022 0.032


2

0.08
34

Determinate
Always gives one answer

If we minimise SSE
Get the mean

Shown in graph
SSE plotted against b0
Min value of SSE occurs when
b0 = mean

35

2
1.8
1.6
1.4

SSE

1.2
1
0.8
0.6
0.4
0.2
0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

b0

36

1.9

The Mean as an OLS


Estimate

37

Mean as OLS Estimate


The mean is an Ordinary Least
Squares (OLS) estimate
As are lots of other things

This is exciting because


OLS estimators are BLUE
Best Linear Unbiased Estimators
Proven with Gauss-Markov Theorem
Which we wont worry about
38

BLUE Estimators
Best
Minimum variance (of all possible
unbiased estimators
Narrower distribution than other
estimators
e.g. median, mode

Linear

Y Y

Linear predictions
For the mean
Linear (straight, flat) line
39

Unbiased
Centred around true (population)
values
Expected value = population value
Minimum is biased.
Minimum in samples > minimum in
population

Estimators
Errrmm they are estimators

Also consistent
Sample approaches infinity, get closer
to population values
Variance shrinks
40

SSE and the Standard


Deviation
Tying up a loose end
2

SSE (Yi Y )

(Yi Y )
n

(Yi Y )
n 1
41

SSE closely related to SD


Sample standard deviation s
Biased estimator of population SD

Population standard deviation -


Need to know the mean to calculate SD
Reduces N by 1
Hence divide by N-1, not N

Like losing one df

42

Proof
That the mean minimises SSE
Not that difficult
As statistical proofs go

Available in
Maxwell and Delaney Designing
experiments and analysing data
Judd and McClelland Data Analysis
(out of print?)
43

Whats a df?
The number of parameters free to
vary
When one is fixed

Term comes from engineering


Movement available to structures

44

0 df
No variation
available

1 df
Fix 1 corner, the
shape is fixed

45

Back to the Data


Mean has 5 (N) df
1st moment

has N 1 df
Mean has been fixed
2nd moment
Can think of as amount cases vary
away from the mean
46

While we are at it
Skewness has N 2 df
3rd moment

Kurtosis has N 3 df
4rd moment
Amount cases vary from

47

Parsimony and df
Number of df remaining
Measure of parsimony

Model which contained all the data


Has 0 df
Not a parsimonious model

Normal distribution
Can be described in terms of mean and
2 parameters

(z with 0 parameters)
48

Summary of Lesson 1
Statistics is about modelling DATA
Models have parameters
Fewer parameters, more parsimony, better

Models need to minimise ERROR


Best model, least ERROR
Depends on how we define ERROR
If we define error as sum of squared
deviations from predicted value
Mean is best MODEL
49

50

51

Lesson 2: Models with one


more parameter regression

52

In Lesson 1 we said
Use a model to predict and
describe data
Mean is a simple, one parameter
model

Yi Y Y
53

More Models
Slopes and Intercepts

54

More Models
The mean is OK
As far as it goes
It just doesnt go very far
Very simple prediction, uses very little
information

We often have more information


than that
We want to use more information
than that
55

House Prices
In the UK, two of the largest
lenders (Halifax and Nationwide)
compile house price indices
Predict the price of a house
Examine effect of different
circumstances

Look at change in prices


Guides legislation
E.g. interest rates, town planning
56

Predicting House Prices


B eds

(0 0 0 s )

77

74

88

62

90

136

35

134

138

55
57

One Parameter Model


The mean

Y 88.9
Y b0 Y
SSE 11806.9
How much is that house worth?
88,900
58
Use 1 df to say that

Adding More Parameters


We have more information than
this
We might as well use it
Add a linear function of number of
bedrooms (x1)

Y b0 b1x1

59

Alternative Expression
Estimate of Y (expected value of Y)

Y b0 b1x1
Value of Y

Yi b0 b1 xi1 ei
60

Estimating the Model


We can estimate this model in four
different, equivalent ways
Provides more than one way of thinking
about it

1. Estimating the slope which minimises


SSE
2. Examining the proportional reduction
in SSE
3. Calculating the covariance
61
4. Looking at the efficiency of the

Estimate the Slope to


Minimise SSE

62

Estimate the Slope


Stage 1
Draw a scatterplot
x-axis at mean
Not at zero

Mark errors on it
Called residuals
Sum and square these to find SSE

63

160
140
120
100
80
60

1.5

2.5

3.5

4.5

40
20
0
64

5.5

160
140
120
100
80
60

1.5

2.5

3.5

4.5

40
20
0
65

5.5

Add another slope to the chart


Redraw residuals
Recalculate SSE
Move the line around to find slope
which minimises SSE

Find the slope

66

First attempt:

67

Any straight line can be defined


with two parameters
The location (height) of the slope
b0
Sometimes called

The gradient of the slope


b1

68

Gradient

b1 units

1 unit

69

Height

b0 units
70

Height
If we fix slope to zero
Height becomes mean
Hence mean is b0

Height is defined as the point that


the slope hits the y-axis
The constant
The y-intercept

71

Why the
constant?
b0x0
Where x0 is 1.00
for every case
i.e. x0 is constant

Implicit in SPSS
Some packages
force you to make
it explicit
(Later on well
need to make it
explicit)

beds (x1)

x0

(000s)

77

74

88

62

90

136

35

134

138

55

72

Why the intercept?


Where the regression line intercepts
the y-axis
Sometimes called y-intercept

73

Finding the Slope


How do we find the values of b0
and b1?
Start with we jiggle the values, to find
the best estimates which minimise
SSE
Iterative approach
Computer intensive used to matter,
doesnt really any more
(With fast computers and sensible search
algorithms more on that later)74

Start with
b0=88.9 (mean)
b1=10 (nice round number)
SSE = 14948 worse than it was

b0=86.9,
b0=66.9,
b0=56.9,
b0=46.9,
b0=51.9,
b0=51.9,
b0=46.9,
..

b1=10,
b1=10,
b1=10,
b1=10,
b1=10,
b1=12,
b1=14,

SSE=13828
SSE=7029
SSE=6628
SSE=8228
SSE=7178
SSE=6179
SSE=5957
75

Quite a long time later


b0 = 46.000372
b1 = 14.79182
SSE = 5921

Gives the position of the


Regression line (or)
Line of best fit
Better than guessing

Not necessarily the only method


But it is OLS, so it is the best (it is
BLUE)
76

160
140
120

Price

100
80
60

Actual Price
Predicted Price

40
20
0
0.5

1.5

2.5

3.5

4.5

Number of Bedrooms
77

5.5

We now know
A house with no bedrooms is worth
46,000 (??!)
Adding a bedroom adds 15,000

Told us two things


Dont extrapolate to meaningless
values of x-axis
Constant is not necessarily useful
It is necessary to estimate the equation

78

Standardised Regression
Line
One big but:
Scale dependent

Values change
to , inflation

Scales change
, 000, 00?

Need to deal with this


79

Dont express in raw units


Express in SD units
x1=1.72
y=36.21

b1 = 14.79
We increase x1 by 1, and
increases by 14.79
14.79 (14.79 / 36.21) SDs 0.408SDs

80

Similarly, 1 unit of x1 = 1/1.72 SDs


Increase x1 by 1 SD
increases by 14.79 (1.72/1) = 8.60

Put them both together

b1 x1
y
81

14.79 1.72
0.706
36.21
The standardised regression line
Change (in SDs) in associated with
a change of 1 SD in x1

A different route to the same


answer
Standardise both variables (divide by
SD)
82
Find line of best fit

The standardised regression line


has a special name

The Correlation Coefficient


(r)
(r stands for regression, but more on
that later)

Correlation coefficient is a
standardised regression slope
Relative change, in terms of SDs
83

Proportional Reduction in
Error

84

Proportional Reduction in
Error
We might be interested in the level
of improvement of the model
How much less error (as proportion)
do we have
Proportional Reduction in Error (PRE)

Mean only
Error(model 0) = 11806

Mean + slope
Error(model 1) = 5921
85

ERROR(0) ERROR(1)
PRE
ERROR(0)
ERROR(1)
PRE 1
ERROR(0)
5921
PRE 1
11806
PRE 0.4984
86

But we squared all the errors in the


first place
So we could take the square root
(Its a shoddy excuse, but it makes the
point)

0.4984 0.706
This is the correlation coefficient
Correlation coefficient is the square
root of the proportion of variance
87
explained

Standardised Covariance

88

Standardised Covariance
We are still iterating
Need a closed-form
Equation to solve to get the
parameter estimates

Answer is a standardised
covariance
A variable has variance
Amount of differentness

We have used SSE so far


89

SSE varies with N


Higher N, higher SSE

Divide by N
Gives SSE per person
(Actually N 1, we have lost a df to
the mean)

The variance
Same as SD2
We thought of SSE as a scattergram
Y plotted against X

(repeated image follows)


90

160
140
120
100
80
60

1.5

2.5

3.5

4.5

40
20
0

91

5.5

Or we could plot Y against Y


Axes meet at the mean (88.9)
Draw a square for each point
Calculate an area for each square
Sum the areas

Sum of areas
SSE

Sum of areas divided by N


Variance

92

Plot of Y against Y
180
160
140
120
100
0

20

40

60

80
80

100

120

140

60
40
20
0
93

160

180

Draw Squares
180

Area =
40.1 x 40.1
= 1608.1

160

138 88.9
= 40.1

140

138 88.9
= 40.1

120
100

20

35 88.9
= -53.9

40

60

80
80

100

120

140

160

60
40

35 88.9
= -53.9

20

Area =
-53.9 x -53.9
= 2905.21

0
94

180

What if we do the same procedure


Instead of Y against Y
Y against X

Draw rectangles (not squares)


Sum the area
Divide by N - 1
This gives us the variance of x with
y
The Covariance
Shortened to Cov(x, y)
95

96

Area
= (-33.9) x (-2)
= 67.8
55 88.9
= -33.9

4-3=1

138-88.9
= 49.1
1 - 3 = -2
Area =
49.1 x 1
= 49.1
97

More formally (and easily)


We can state what we are doing as
an equation
Where Cov(x, y) is the covariance

( x x )( y y )
Cov( x , y )
N 1
Cov(x,y)=44.2
What do points in different sectors
do to the covariance?
98

Problem with the covariance


Tells us about two things
The variance of X and Y
The covariance

Need to standardise it
Like the slope

Two ways to standardise the


covariance
Standardise the variables first
Subtract from mean and divide by SD

Standardise the covariance


afterwards

99

First approach
Much more computationally
expensive
Too much like hard work to do by hand

Need to standardise every value

Second approach
Much easier
Standardise the final value only

Need the combined variance


Multiply two variances
Find square root (were multiplied in
first place)
100

Standardised covariance

Cov( x , y )
Var( x ) Var( y )

44.2
2.9 1311

0.706
101

The correlation coefficient


A standardised covariance is a
correlation coefficient

Covariance

variance variance
102

Expanded

( x x )( y y )

N 1

2
2
( x x ) ( y y )

N 1
N 1
103

This means
We now have a closed form equation
to calculate the correlation
Which is the standardised slope
Which we can use to calculate the
unstandardised slope

104

We know that:

b1 x1

We know that:

b1

r y

x1

105

b1

r y

x1

0.706 36.21
b1
1.72
b1 14.79
So value of b1 is the same as the
iterative approach
106

The intercept
Just while we are at it

The variables are centred at zero


We subtracted the mean from both
variables
Intercept is zero, because the axes
cross at the mean

107

Add mean of y to the constant


Adjusts for centring y

Subtract mean of x
But not the whole mean of x
Need to correct it for the slope

c y b1x1

c 88.9 14.8 3
c 46.00
Naturally, the same
108

Accuracy of Prediction

109

One More (Last One)


We have one more way to
calculate the correlation
Looking at the accuracy of the
prediction

Use the parameters


b0 and b1
To calculate a predicted value for
each case
110

Beds

A c tu a l

P r e d i c te d

P ric e

P ric e

77

6 0 .8 0

74

7 5 .5 9

88

6 0 .8 0

62

9 0 .3 8

90

1 1 9 .96

136

1 1 9 .96

35

7 5 .5 9

134

1 1 9 .96

138

1 0 5 .17

55

6 0 .8 0

Plot actual
price against
predicted
price
From the
model

111

140
120
Predicted Value

100
80
60
40
20
20

40

60

80
100
Actual Value

120

140

112

160

r = 0.706
The correlation

Seems a futile thing to do


And at this stage, it is
But later on, we will see why

113

Some More Formulae


For hand calculation
r

xy
x 2 y 2

Point biserial

M
r

y1

M y 0 PQ
sd y
114

Phi ()
Used for 2 dichotomous variables
Vote P

Vote Q

Homeowner

A: 19

B: 54

Not homeowner

C: 60

D:53

BC AD
r
( A B )(C D)( A C )( B D)
115

Problem with the phi correlation


Unless Px= Py (or Px = 1 Py)
Maximum (absolute) value is < 1.00
Tetrachoric can be used

Rank (Spearman) correlation


Used where data are ranked

6 d
r
2
n(n 1)
2

116

Summary
Mean is an OLS estimate
OLS estimates are BLUE

Regression line
Best prediction of DV from IV
OLS estimate (like mean)

Standardised regression line


A correlation

117

Four ways to think about a


correlation
1. Standardised regression line
2. Proportional Reduction in Error
(PRE)
3. Standardised covariance
4. Accuracy of prediction

118

119

120

Lesson 3: Why
Regression?
A little aside, where we look
at why regression has such a
curious name.
121

Regression
The or an act of regression;
reversion; return towards the
mean; return to an earlier stage of
development, as in an adults or an
adolescents behaving like a child
(From Latin gradi, to go)

So why name a statistical


technique which is about
prediction and explanation?
122

Francis Galton
Charles Darwins cousin
Studying heritability

Tall fathers have shorter sons


Short fathers have taller sons
Filial regression toward mediocrity
Regression to the mean

123

Galton thought this was biological


fact
Evolutionary basis?

Then did the analysis backward


Tall sons have shorter fathers
Short sons have taller fathers

Regression to the mean


Not biological fact, statistical artefact

124

Other Examples
Secrist (1933): The Triumph of
Mediocrity in Business
Second albums often tend to not be as
good as first
Sequel to a film is not as good as the
first one
Curse of Athletics Weekly
Parents think that punishing bad
behaviour works, but rewarding good
behaviour doesnt
125

Pair Link Diagram


An alternative to a scatterplot

126

r=1.00
x
x
x
x
x
x
x

127

r=0.00
x

x
x

128

From Regression to
Correlation
Where do we predict an
individuals score on y will be,
based on their score on x?
Depends on the correlation

r = 1.00 we know exactly where


they will be
r = 0.00 we have no idea
r = 0.50 we have some idea
129

r=1.00
Starts here
Will end
up here

y
130

r=0.00
Starts here
Could end
anywhere here

y
131

r=0.50
Probably
end
somewher
e here

Starts
here

132

Galton Squeeze Diagram


Dont show individuals
Show groups of individuals, from the
same (or similar) starting point
Shows regression to the mean

133

r=0.00
Ends here

Group starts
here

Group starts
here

y
134

r=0.50

y
135

r=1.00

y
136

r
units

1 unit

Correlation is amount of regression


that doesnt occur

137

No
regression
r=1.00

138

Some
regression
r=0.50

139

r=0.00
Lots
(maximum)
regression
r=0.00

y
140

Formula

z y rxy z x

141

Conclusion
Regression towards mean is statistical
necessity
regression = perfection correlation
Very non-intuitive
Interest in regression and correlation
From examining the extent of regression
towards mean
By Pearson worked with Galton
Stuck with curious name

See also Paper B3


142

143

144

Lesson 4: Samples to
Populations Standard
Errors and Statistical
Significance

145

The Problem

In Social Sciences

We investigate samples

Theoretically
Randomly taken from a specified
population
Every member has an equal chance
of being sampled
Sampling one member does not alter
the chances of sampling another

Not the case in (say) physics,


biology, etc.
146

Population
But its the population that we are
interested in
Not the sample
Population statistic represented with
Greek letter
Hat means estimate

x
x
147

Sample statistics (e.g. mean)


estimate population parameters
Want to know
Likely size of the parameter
If it is > 0

148

Sampling Distribution
We need to know the sampling
distribution of a parameter
estimate
How much does it vary from sample
to sample

If we make some assumptions


We can know the sampling
distribution of many statistics
Start with the mean
149

Sampling Distribution of
the Mean
Given
Normal distribution
Random sample
Continuous data

Mean has a known sampling


distribution
Repeatedly sampling will give a
known distribution of means
Centred around the true (population)
150
mean ()

Analysis Example: Memory


Difference in memory for different
words
10 participants given a list of 30
words to learn, and then tested
Two types of word
Abstract: e.g. love, justice
Concrete: e.g. carrot, table

151

Concrete Abstract
12
4
11
7
4
6
9
12
8
6
12
10
9
8
8
5
12
10
8
4

Diff (x)
8
4
-2
-3
2
2
1
3
2
4

x 2.1
x 3.11
N 10

152

Confidence Intervals
This means
If we know the mean in our sample
We can estimate where the mean in
the population () is likely to be

Using
The standard error (se) of the mean
Represents the standard deviation of
the sampling distribution of the mean

153

1 SD
contains
68%
Almost 2
SDs contain
95%

154

We know the sampling distribution


of the mean
t distributed
Normal with large N (>30)

Know the range within means from


other samples will fall
Therefore the likely range of

x
se( x )
n
155

Two implications of equation


Increasing N decreases SE
But only a bit

Decreasing SD decreases SE

Calculate Confidence Intervals


From standard errors

95% is a standard level of CI


95% of samples the true mean will lie
within the 95% CIs
In large samples: 95% CI = 1.96 SE
In smaller samples: depends on t
distribution (df=N-1=9)
156

x 2.1,
x 3.11,
N 10

x 3.11
se( x )

0.98
n
10
157

95% CI 2.26 0.98 2.22


x CI x CI
-0.12 4.32
158

What is a CI?
(For 95% CI):
95% chance that the true
(population) value lies within the
confidence interval?
95% of samples, true mean will
land within the confidence
interval?
159

Significance Test
Probability that is a certain value
Almost always 0
Doesnt have to be though

We want to test the hypothesis


that the difference is equal to 0
i.e. find the probability of this
difference occurring in our sample IF
=0
(Not the same as the probability that
160
=0)

Calculate SE, and then t


t has a known sampling distribution
Can test probability that a certain
value is included

x
t
se(x )

2.1
t
2.14
0.98
p 0.061
161

Other Parameter
Estimates
Same approach
Prediction, slope, intercept, predicted
values
At this point, prediction and slope are
the same
Wont be later on

We will look at one predictor only


More complicated with > 1
162

Testing the Degree of


Prediction
Prediction is correlation of Y with
The correlation when we have one
IV

Use F, rather than t


Started with SSE for the mean only
This is SStotal
Divide this into SSresidual
SSregression

SStot = SSreg + SSres

163

SSreg df1
SS res df 2

df1 k
df 2 N k 1,
164

Back to the house prices


Original SSE (SStotal) = 11806
SSresidual = 5921
What is left after our model

SSregression = 11806 5921 = 5885


What our model explains

Slope = 14.79
Intercept = 46.0
r = 0.706
165

SSreg df1
SS res df 2
5885 1
F
7.95
5921 (10 1 1)
df1 k 1
df 2 N k 1 8
166

F = 7.95, df = 1, 8, p = 0.02
Can reject H0
H0: Prediction is not better than chance

A significant effect

167

Statistical Significance:
What does a p-value (really)
mean?

168

A Quiz
Six questions, each true or false
Write down your answers (if you like)
An experiment has been done. Carried
out perfectly. All assumptions perfectly
satisfied. Absolutely no problems.
P = 0.01
Which of the following can we say?
169

1. You have absolutely disproved


the null hypothesis (that is, there
is no difference between the
population means).

170

2. You have found the probability of


the null hypothesis being true.

171

3. You have absolutely proved your


experimental hypothesis (that
there is a difference between the
population means).

172

4. You can deduce the probability of


the experimental hypothesis
being true.

173

5. You know, if you decide to reject


the null hypothesis, the
probability that you are making
the wrong decision.

174

6. You have a reliable experimental


finding in the sense that if,
hypothetically, the experiment
were repeated a great number of
times, you would obtain a
significant result on 99% of
occasions.
175

OK, What is a p-value


Cohen (1994)
[a p-value] does not tell us what we
want to know, and we so much
want to know what we want to
know that, out of desperation, we
nevertheless believe it does (p
997).
176

OK, What is a p-value


Sorry, didnt answer the question
Its The probability of obtaining a
result as or more extreme than the
result we have in the study, given
that the null hypothesis is true
Not probability the null hypothesis
is true
177

A Bit of Notation
Not because we like notation
But we have to say a lot less

Probability P
Null hypothesis is true H
Result (data) D
Given - |
178

Whats a P Value
P(D|H)
Probability of the data occurring if the
null hypothesis is true

Not
P(H|D)
Probability that the null hypothesis is
true, given that we have the data =
p(H)

P(H|D) P(D|H)
179

What is probability you are prime


minister
Given that you are british
P(M|B)
Very low

What is probability you are British


Given you are prime minister
P(B|M)
Very high

P(M|B) P(B|M)
180

Theres been a murder


Someone bumped off a statto for talking
too much

The police have DNA


The police have your DNA
They match(!)

DNA matches 1 in 1,000,000 people


Whats the probability you didnt do
the murder, given the DNA match (H|
D)
181

Police say:
P(D|H) = 1/1,000,000

Luckily, you have Jeremy on your


defence team
We say:
P(D|H) P(H|D)

Probability that someone matches


the DNA, who didnt do the murder
Incredibly high

182

Back to the Questions


Haller and Kraus (2002)
Asked those questions of groups in
Germany
Psychology Students
Psychology lecturers and professors
(who didnt teach stats)
Psychology lecturers and professors
(who did teach stats)
183

1. You have absolutely disproved the


null hypothesis (that is, there is no
difference between the population
means).

True

34% of students
15% of professors/lecturers,
10% of professors/lecturers teaching
statistics

. False
. We have found evidence against
the null hypothesis
184

2. You have found the probability of


the null hypothesis being true.
32% of students
26% of professors/lecturers
17% of professors/lecturers teaching
statistics

. False
. We dont know

185

3. You have absolutely proved your


experimental hypothesis (that there is
a difference between the population
means).

20% of students
13% of professors/lecturers
10% of professors/lecturers teaching
statistics

False
186

4. You can deduce the probability of


the experimental hypothesis being
true.
59% of students
33% of professors/lecturers
33% of professors/lecturers teaching
statistics

. False

187

5. You know, if you decide to reject the


null hypothesis, the probability that
you are making the wrong decision.

68% of students
67% of professors/lecturers
73% of professors professors/lecturers
teaching statistics

. False
. Can be worked out
P(replication)

188

6. You have a reliable experimental


finding in the sense that if,
hypothetically, the experiment were
repeated a great number of times,
you would obtain a significant result
on 99% of occasions.
41% of students
49% of professors/lecturers
37% of professors professors/lecturers
teaching statistics

. False
. Another tricky one
It can be worked out

189

One Last Quiz


I carry out a study
All assumptions perfectly satisfied
Random sample from population
I find p = 0.05

You replicate the study exactly


What is probability you find p < 0.05?

190

I carry out a study


All assumptions perfectly satisfied
Random sample from population
I find p = 0.01

You replicate the study exactly


What is probability you find p < 0.05?

191

Significance testing creates


boundaries and gaps where none
exist.
Significance testing means that we
find it hard to build upon
knowledge
we dont get an accumulation of
knowledge
192

Yates (1951)
"the emphasis given to formal tests of
significance ... has resulted in ... an undue
concentration of effort by mathematical
statisticians on investigations of tests of
significance applicable to problems which
are of little or no practical importance ...
and ... it has caused scientific research
workers to pay undue attention to the
results of the tests of significance ... and
too little to the estimates of the magnitude
of the effects they are investigating
193

Testing the Slope


Same idea as with the mean
Estimate 95% CI of slope
Estimate significance of difference
from a value (usually 0)

Need to know the sd of the slope


Similar to SD of the mean

194

s y. x

(Y Y )

N k 1

s y.x

SSres
N k 1

s y.x

5921
27.2
8

195

Similar to equation for SD of mean


Then we need standard error
- Similar (ish)
When we have standard error
Can go on to 95% CI
Significance of difference

196

se(by .x )

s y.x
( x x )

27.2
se(by.x )
5.24
26.9
197

Confidence Limits
95% CI
t dist with N - k - 1 df is 2.31
CI = 5.24 2.31 = 12.06

95% confidence limits

14.8 12.1 14.8 12.1


2.7 26.9
198

Significance of difference from zero


i.e. probability of getting result if =0
Not probability that = 0

b
14.7
t

2.81
se(b)
5.2
df N k 1 8
p 0.02
This probability is (of course) the

same as the value for the


prediction

199

Testing the Standardised


Slope (Correlation)
Correlation is bounded between 1
and +1
Does not have symmetrical distribution,
except around 0

Need to transform it
Fisher z transformation approximately
normal

z 0.5[ln( 1 r ) ln( 1 r )]
1
SE z
n3

200

z 0.5[ln( 1 0.706) ln( 1 0.706)]


z 0.879
1
1
SE z

0.38
n3
10 3
95% CIs
0.879 1.96 * 0.38 = 0.13
0.879 + 1.96 * 0.38 = 1.62

201

Transform back to correlation

e 1
r 2y
e 1
2y

95% CIs = 0.13 to 0.92


Very wide
Small sample size
Maybe thats why CIs are not
reported?
202

Using Excel
Functions in excel
Fisher() to carry out Fisher
transformation
Fisherinv() to transform back to
correlation

203

The Others
Same ideas for calculation of CIs
and SEs for
Predicted score
Gives expected range of values given
X

Same for intercept


But we have probably had enough
204

Lesson 5: Introducing
Multiple Regression

205

Residuals
We said
Y = b0 + b1x1

We could have said


Yi = b0 + b1xi1 + ei

We ignored the i on the Y


And we ignored the ei
Its called error, after all

But it isnt just error


Trying to tell us something

206

What Error Tells Us


Error tells us that a case has a
different score for Y than we
predict
There is something about that case

Called the residual


What is left over, after the model

Contains information
Something is making the residual 0
But what?
207

160
140

swimming
pool

120

Price

100
80

Unpleasant
neighbours

60

Actual Price
Predicted Price

40
20
0
0.5

1.5

2.5

3.5

4.5

Number of Bedrooms
208

5.5

The residual (+ the mean) is the


value of Y
If all cases were equal on X
It is the value of Y, controlling for
X
Other words:
Holding constant
Partialling
Residualising
Conditioned on
209

Beds

(0 0 0 s )

77

74

88

62

90

136

35

134

138

55
210

Sometimes adjustment is enough on its


own
Measure performance against criteria

Teenage pregnancy rate


Measure pregnancy and abortion rate in
areas
Control for socio-economic deprivation, and
anything else important
See which areas have lower teenage
pregnancy and abortion rate, given same
level of deprivation

Value added education tables


Measure school performance
Control for initial intake

211

Control?
In experimental research
Use experimental control
e.g. same conditions, materials, time
of day, accurate measures, random
assignment to conditions

In non-experimental research
Cant use experimental control
Use statistical control instead

212

Analysis of Residuals
What predicts differences in crime
rate
After controlling for socio-economic
deprivation
Number of police?
Crime prevention schemes?
Rural/Urban proportions?
Something else

This is what regression is about


213

Exam performance
Consider number of books a student
read (books)
Number of lectures (max 20) a
student attended (attend)

Books and attend as IV, grade as


DV

214

Book s

Attend
0
1
0
2
4
4
1
4
3
0

9
15
10
16
10
20
11
20
15
15

Grade
45
57
45
51
65
88
44
87
89
59

First 10 cases

215

Use books as IV
R=0.492, F=12.1, df=1, 28, p=0.001
b0=52.1, b1=5.7
(Intercept makes sense)

Use attend as IV
R=0.482, F=11.5, df=1, 38, p=0.002
b0=37.0, b1=1.9
(Intercept makes less sense)

216

100
90
80
70

Grade (100)

60
50
40
30
-1

Books
217

100
90
80
70
60

Grade

50
40
30
5

11

13

15

17

19

Attend
218

21

Problem
Use R2 to give proportion of shared
variance
Books = 24%
Attend = 23%

So we have explained 24% + 23%


= 47% of the variance
NO!!!!!

219

Look at the correlation matrix


BOOKS

ATTEN
D
GRADE

0.44

0.49

0.48

BOOKS

ATTEN
D

GRADE

Correlation of books and attend is


(unsurprisingly) not zero
Some of the variance that books
shares with grade, is also shared by
attend
220

I have access to 2 cars


My wife has access to 2 cars
We have access to four cars?
No. We need to know how many of
my 2 cars are the same cars as her 2
cars

Similarly with regression


But we can do this with the residuals
Residuals are what is left after (say)
books
See of residual variance is explained
by attend
Can use this new residual variance to
221
calculate SSres, SStotal and SSreg

Well. Almost.
This would give us correct values for
SS
Would not be correct for slopes, etc

Assumes that the variables have a


causal priority
Why should attend have to take what
is left from books?
Why should books have to take what
is left by attend?

Use OLS again


222

Simultaneously estimate 2
parameters
b1 and b2
Y = b0 + b1x1 + b2x2
x1 and x2 are IVs

Not trying to fit a line any more


Trying to fit a plane

Can solve iteratively


Closed form equations better
But they are unwieldy
223

3D scatterplot
(2points only)
y

x2
x1
224

b2

b1

b0

x2
x1
225

(Really) Ridiculous
Equations
2

y y x1 x1 x2 x2 y y x2 x2 x1 x1 x2 x2
b1
2
2
2
x1 x1 x2 x2 x1 x1 x2 x2
2

y y x2 x2 x1 x1 y y x1 x1 x2 x2 x1 x1
b2
2
2
2
x2 x2 x1 x1 x2 x2 x1 x1

b0 y b1x1 b2x2
226

The good news


There is an easier way

The bad news


It involves matrix algebra

The good news


We dont really need to know how to
do it

The bad news


We need to know it exists

227

A Quick Guide to Matrix


Algebra
(I will never make you do it
again)

228

Very Quick Guide to Matrix


Algebra
Why?
Matrices make life much easier in
multivariate statistics
Some things simply cannot be done
without them
Some things are much easier with
them

If you can manipulate matrices


you can specify calculations v. easily
e.g. AA = sum of squares of a
column
229
Doesnt matter how long the column

A scalar is a number
A scalar: 4
A vector is a row or column of
numbers

A row vector:

4 8 7

A column vector: 11
230

A vector is described as rows x


columns

4 8 7

Is a 1 4 vector

11

Is a 2 1 vector
A number (scalar) is a 1 1 vector
231

A matrix is a rectangle, described


as rows x columns

2 6 5 7 8

4 5 7 5 3
1 5 2 7 8

Is a 3 x 5 matrix
Matrices are referred to with bold
capitals
232

Correlation matrices and


covariance matrices are special
They are square and symmetrical
Correlation matrix of books, attend
and grade

1.00 0.44 0.49

0.44 1.00 0.48


0.49 0.48 1.00

233

Another special matrix is the


identity matrix I
A square matrix, with 1 in the
diagonal and 0 in the off-diagonal

0
I
0

0
1
0
0

0
0
1
0

0
0

Note that this is a correlation matrix, with


correlations all = 0

234

Matrix Operations
Transposition
A matrix is transposed by putting it
on its side
Transpose of A is A A 7 5 6

7

A ' 5
6

235

Matrix multiplication
A matrix can be multiplied by a scalar,
a vector or a matrix
Not commutative
AB BA
To multiply AB
Number of rows in A must equal number
of columns in B

236

Matrix by vector
a

b
c

e
f

2
7
17

3
11
19

g
h
i

13

23

k
l

3 5

11 13
17 19 23

2
3
4

aj dk gl
bj ek hl
cj fk il

4 9 20

14 33 52

34 57 92

2 33

3 99
4 141

237

43
90
183

Matrix by matrix

a b e


c d g

f
h

ae cf

ce dg

af bh

cf dh

2 3 2 3 4 12 6 15

5 7 4 5 10 28 15 35
16 21


38 50
238

Multiplying by the identity matrix


Has no effect
Like multiplying by 1

AI A
2 3 1 0
2 3

5 7 0 1
5 7

239

The inverse of J is: 1/J


J x 1/J = 1
Same with matrices
Matrices have an inverse
Inverse of A is A-1
AA-1=I

Inverting matrices is dull


We will do it once
But first, we must calculate the
determinant
240

The determinant of A is |A|


Determinants are important in
statistics
(more so than the other matrix
algebra)

We will do a 2x2
Much more difficult for larger matrices

241

a b

A
c d
A ad cb

1.0 0.3
A

0.3 1.0
A 1 1 0.3 0.3
A 0.91
242

Determinants are important


because
Needs to be above zero for regression
to work
Zero or negative determinant of a
correlation/covariance matrix means
something wrong with the data
Linear redundancy

Described as:
Not positive definite
Singular (if determinant is zero)
In different error messages
243

Next, the adjoint


a b
A

c d
d b
adj A

c a

Now

1
A
adj A
A
1

244

Find A-1
1.0 0.3
A

0.3 1.0
A 0.91

1.0 0.3
1

0.91 0.3 1.0


1.10 0.33

0.33 1.10
245

Matrix Algebra with


Correlation Matrices

246

Determinants
Determinant of a correlation matrix
The volume of space taken up by the
(hyper) sphere that contains all of the
points

1.0 0.0
A

0.0 1.0
A 1.0
247

X
X

1.0 0.0
A

0.0 1.0
A 1.0
248

X
X
X

1.0 1.0
A

1.0 1.0
A 0.0
249

Negative Determinant
Points take up less than no
space
Correlation matrix cannot exist
Non-positive definite matrix

250

Sometimes Obvious

1.0 1.2

A
1
.
2
1
.
0

A 0.44
251

Sometimes Obvious (If You


Think)

1
0.9 0.9

0.9
1
0.9

0.9 0.9
1

A 2.88
252

Sometimes No Idea
1.00 0.76 0.40

A 0.76
1
0.30
0.40 0.30

A 0.01

1.00 0.75 0.40

A 0.75
1
0.30
0.40 0.30

A 0.0075

253

Multiple R for Each


Variable
Diagonal of inverse of correlation
matrix
Used to calculate multiple R
Call elements aij

Ri .123...k

1
1
aii

254

Regression Weights
Where i is DV
j is IV

bi . j

aij
aij
255

Back to the Good News


We can calculate the standardised
parameters as
B=Rxx-1 x Rxy
Where
B is the vector of regression weights
Rxx-1 is the inverse of the correlation
matrix of the independent (x)
variables
Rxy is the vector of correlations of the
correlations of the x and y variables
Now do exercise 3.2
256

One More Thing


The whole regression equation can
be described with matrices
very simply

Y XB E
257

Where
Y = vector of DV
X = matrix of IVs
B = vector of coefficients

Go all the way back to our example

258

1 0 9
1 1 5

1 0 10

1 2 16
1 4 10

1 4 20
1 1 11

1 4 20
1 3 15

1 0 15

b0
b
1
b
2

e1
e
2
e3

e4
e5

e6
e7

e8
e
9
e10

45
57

45

51
65


88
44

87
89

59

259

1
1

1
1

1
1

1
1

0
1
0
2
4
4
1
4
3
0

5
10

16
10

20

11
20

15
15

The constant literally a


constant. Could be any
e1 45

but
number,
it is most
e2 57

convenient to make it 1.
e 45
Used
the
3 to capture

e4 intercept.
51

b0

e5

b1
b e6
2 e
7
e8


e9
e
10

65

88

44
87

89
59

260

1
1

1
1

1
1

1
1

0
1
0
2
4
4
1
4
3
0

5
10

16
10

20

11
20

15
15

e1 45

e2 57
e 45
3

ematrix
51of values
The
4
b0
(books
65
fore5IVs
and

b1
attend)
e

88
6

b2

e7 44
e8 87

e9 89
e 59

10
261

1 0 9

1 1 5
1 0 10

1 2 16
1 4 10
The parameter

1 4are
20
estimates. We

trying to find
1 11
1 the
best values
1 of
4 20
these. 1 3 15

1 0 15

b0

b1
b
2

e1 45

e2 57
e 45
3

e4 51
e 65

5
e6 88

e7 44
e8 87

e9 89
e 59

10
262

Error. We are trying


9
1 0 this
to minimise

1
1

1
1

1
1

1
1

1
0
2
4
4
1
4
3
0

5
10

16
10

20

11
20

15
15

b0

b1
b
2

e1 45

e2 57
e 45
3

e4 51
e 65

5
e6 88

e7 44
e8 87

e9 89
e 59

10
263

e1 45
1 0 9

e2 57
1 1 5
e 45
1 0 10
3

e4 51
1 2 16
1 4 10 b0 e 65

b1 5

1 4 20 e6 88

b2 e

7 44
1 1 11
e8 87
1 4 20

e9 89
1 3 15

e 59
The 1DV0 - 15
grade

10
264

Y=BX+E
Simple way of representing as many IVs
as you like
Y = b0x0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + e

x01

x02

x11
x12

x21
x22

x31
x32

x41
x42

x51

x52

b0

b1
b e
2 1
b3 e2
b
4
b
5
265

b0

b1

x01 x11 x21 x31 x41 x51 b2 e1



x02 x12 x22 x32 x42 x52 b3 e2
b
4
b
5
b0x0 b1x1 ...bk xk e
266

Generalises to Multivariate
Case
Y=BX+E
Y, B and E
Matrices, not vectors

Goes beyond this course


(Do Jacques Tacqs course for more)
(Or read his book)

267

268

269

270

Lesson 6: More on
Multiple Regression

271

Parameter Estimates
Parameter estimates (b1, b2 bk)
were standardised
Because we analysed a correlation
matrix

Represent the correlation of each


IV with the DV
When all other IVs are held constant
272

Can also be unstandardised


Unstandardised represent the unit
change in the DV associated with a
1 unit change in the IV
When all the other variables are held
constant

Parameters have standard errors


associated with them
As with one IV
Hence t-test, and associated
probability can be calculated
Trickier than with one IV
273

Standard Error of
Regression Coefficient
Standardised is easier

1 R
1
SEi
2
n k 1 1 R i
2
Y

R2i is the value of R2 when all other


predictors are used as predictors of that
variable
Note that if R2i = 0, the equation is the same as
for previous
274

Multiple R
The degree of prediction
R (or Multiple R)
No longer equal to b

R2 Might be equal to the sum of


squares of B
Only if all xs are uncorrelated

275

In Terms of Variance
Can also think of this in terms of
variance explained.
Each IV explains some variance in the
DV
The IVs share some of their variance

Cant share the same variance


twice

276

Variance in Y
accounted for by
x1
rx1y2 = 0.36

The total
variance of Y
=1

Variance in Y
accounted for by
x2
rx2y2 =
277 0.36

In this model
R2 = ryx12 + ryx22
R2 = 0.36 + 0.36 = 0.72
R = 0.72 = 0.85

But
If x1 and x2 are correlated
No longer the case

278

Variance in Y
accounted for by
x1
rx1y2 = 0.36

The total
variance of Y
=1

Variance shared
between x1 and x2
(not equal to
rx1x2)

Variance in Y
accounted for by
x2
279
rx2y2 = 0.36

So
We can no longer sum the r2
Need to sum them, and subtract the
shared variance i.e. the correlation

But
Its not the correlation between them
Its the correlation between them as a
proportion of the variance of Y

Two different ways


280

Based on estimates

R b1ryx1 b2ryx2
2

If rx1x2 = 0
rxy = bx1
Equivalent to ryx12 + ryx22

281

Based on correlations

R
2

2
yx1

2
yx2

2ryx1 ryx2 rx1x2

1 r

2
x1x2

rx1x2 = 0
Equivalent to ryx12 + ryx22
282

Can also be calculated using


methods we have seen
Based on PRE
Based on correlation with prediction

Same procedure with >2 IVs

283

Adjusted R2
R2 is an overestimate of population
value of R2
Any x will not correlate 0 with Y
Any variation away from 0 increases R
Variation from 0 more pronounced
with lower N

Need to correct R2
Adjusted R2
284

Calculation of Adj. R2

N 1
Adj. R 1 (1 R )
N k 1
2

1 R2
Proportion of unexplained variance
We multiple this by an adjustment
More variables greater adjustment
More people less adjustment
285

Shrunken R2
Some authors treat shrunken and
adjusted R2 as the same thing
Others dont

286

N 1
N k 1

N 20, k 3
20 1
19

1.1875
20 3 1 16

N 10, k 8

N 10, k 3

10 1
9
9
10 8 1 1

10 1
9
1.5
10 3 1 6
287

Extra Bits
Some stranger things that
can
happen
Counter-intuitive
288

Suppressor variables
Can be hard to understand
Very counter-intuitive

Definition
An independent variable which
increases the size of the parameters
associated with other independent
variables above the size of their
correlations
289

An example (based on Horst, 1941)


Success of trainee pilots
Mechanical ability (x1), verbal ability
(x2), success (y)

Correlation matrix
Mech
Mech
Verb
Success

1
0.5
0.3

Verb
0.5
1
0

Success
0.3
0
1
290

Mechanical ability correlates 0.3 with


success
Verbal ability correlates 0.0 with
success
What will the parameter estimates
be?
(Dont look ahead until you have had
a guess)

291

Mechanical ability
b = 0.4
Larger than r!

Verbal ability
b = -0.2
Smaller than r!!

So what is happening?
You need verbal ability to do the test
Not related to mechanical ability
Measure of mechanical ability is
contaminated by verbal ability
292

High mech, low verbal


High mech
This is positive

Low verbal
Negative, because we are talking about
standardised scores
Your mech is really high you did well on
the mechanical test, without being good
at the words

High mech, high verbal


Well, you had a head start on mech,
because of verbal, and need to be
brought down a bit
293

Another suppressor?
x1
x2
y

x1
1
0.5
0.3

x2
0.5
1
0.2

y
0.3
0.2
1

b1 =
b2 =
294

Another suppressor?
x1
x2
y

x1
1
0.5
0.3

x2
0.5
1
0.2

y
0.3
0.2
1

b1 =0.26
b2 = -0.06
295

And another?
x1
x2
y

x1
1
0.5
0.3

x2
0.5
1
-0.2

y
0.3
-0.2
1

b1 =
b2 =
296

And another?
x1
x2
y

x1
1
0.5
0.3

x2
0.5
1
-0.2

y
0.3
-0.2
1

b1 = 0.53
b2 = -0.47
297

One more?
x1
x2
y

x1
1
-0.5
0.3

x2
-0.5
1
0.2

y
0.3
0.2
1

b1 =
b2 =
298

One more?
x1
x2
y

x1
1
-0.5
0.3

x2
-0.5
1
0.2

y
0.3
0.2
1

b1 = 0.53
b2 = 0.47
299

Suppression happens when two


opposing forces are happening together
And have opposite effects

Dont throw away your IVs,


Just because they are uncorrelated with the
DV

Be careful in interpretation of regression


estimates
Really need the correlations too, to interpret
what is going on
Cannot compare between studies with
different IVs
300

Standardised Estimates >


1
Correlations are bounded
-1.00 r +1.00
We think of standardised regression
estimates as being similarly bounded
But they are not

Can go >1.00, <-1.00


R cannot, because that is a proportion
of variance
301

Three measures of ability


Mechanical ability, verbal ability 1,
verbal ability 2
Score on science exam
Mech
Mech
Verbal1
Verbal2
Scores

1
0.1
0.1
0.6

Verbal1
0.1
1
0.9
0.6

Verbal2
0.1
0.9
1
0.3

Scores
0.6
0.6
0.3
1

Before reading on, what are the parameter


estimates?
302

Mech
Verbal1
Verbal2

0.56
1.71
-1.29

Mechanical
About where we expect

Verbal 1
Very high

Verbal 2
Very low
303

What is going on
Its a suppressor again
An independent variable which
increases the size of the parameters
associated with other independent
variables above the size of their
correlations

Verbal 1 and verbal 2 are


correlated so highly
They need to cancel each other out

304

Variable Selection
What are the appropriate
independent variables to use in a
model?
Depends what you are trying to do

Multiple regression has two


separate uses
Prediction
Explanation
305

Prediction
What will happen
in the future?
Emphasis on
practical
application
Variables selected
(more) empirically
Value free

Explanation
Why did
something
happen?
Emphasis on
understanding
phenomena
Variables selected
theoretically
Not value free

306

Visiting the doctor


Precedes suicide attempts
Predicts suicide
Does not explain suicide

More on causality later on


Which are appropriate variables
To collect data on?
To include in analysis?
Decision needs to be based on theoretical
knowledge of the behaviour of those variables
Statistical analysis of those variables (later)
Unless you didnt collect the data

Common sense (not a useful thing to say)


307

Variable Entry Techniques


Entry-wise
All variables entered simultaneously

Hierarchical
Variables entered in a predetermined
order

Stepwise
Variables entered according to
change in R2
Actually a family of techniques
308

Entrywise
All variables entered simultaneously
All treated equally

Hierarchical
Entered in a theoretically determined
order
Change in R2 is assessed, and tested
for significance
e.g. sex and age
Should not be treated equally with other
variables
Sex and age MUST be first

Confused with hierarchical linear


309
modelling

Stepwise
Variables entered empirically
Variable which increases R2 the most
goes first
Then the next

Variables which have no effect can be


removed from the equation

Example
IVs: Sex, age, extroversion,
DV: Car how long someone spends
looking after their car
310

Correlation Matrix
SEX
SEX
AGE
EXTRO
CAR

AGE
1.00
-0.05
0.40
0.66

-0.05
1.00
0.40
0.23

EXTRO CAR
0.40
0.66
0.40
0.23
1.00
0.67
0.67
1.00

311

Entrywise analysis
r2 = 0.64

SEX
AGE
EXTRO

b
0.49
0.08
0.44

p
<0.01
0.46
<0.01

312

Stepwise Analysis
Data determines the order
Model 1: Extroversion, R2 = 0.450
Model 2: Extroversion + Sex, R2 =
0.633

EXTRO
SEX

b
0.48
0.47

p
<0.01
<0.01

313

Hierarchical analysis
Theory determines the order
Model 1: Sex + Age, R2 = 0.510
Model 2: S, A + E, R2 = 0.638
Change in R2 = 0.128, p = 0.001

SEX

0.49

< 0.01

A GE

0.08

0.46

E X TRO

0.44

< 0.01

314

Which is the best model?


Entrywise OK
Stepwise excluded age
Did have a (small) effect

Hierarchical
The change in R2 gives the best estimate
of the importance of extroversion

Other problems with stepwise


F and df are wrong (cheats with df)
Unstable results
Small changes (sampling variance)
large differences in models
315

Uses a lot of paper


Dont use a stepwise procedure to
pack your suitcase

316

Is Stepwise Always Evil?


Yes
All right, no
Research goal is predictive
(technological)
Not explanatory (scientific)
What happens, not why

N is large
40 people per predictor, Cohen, Cohen,
Aiken, West (2003)

Cross validation takes place


317

A quick note on R2
R2 is sometimes regarded as the fit
of a regression model
Bad idea

If good fit is required maximise R2


Leads to entering variables which do
not make theoretical sense

318

Critique of Multiple
Regression
Goertzel (2002)
Myths of murder and multiple
regression
Skeptical Inquirer (Paper B1)

Econometrics and regression are


junk science
Multiple regression models (in US)
Used to guide social policy
319

More Guns, Less Crime


(controlling for other factors)

Lott and Mustard: A 1% increase in


gun ownership
3.3% decrease in murder rates

But:
More guns in rural Southern US
More crime in urban North (crack
cocaine epidemic at time of data)
320

Executions Cut Crime


No difference between crimes in
states in US with or without death
penalty
Ehrlich (1975) controlled all
variables that effect crime rates
Death penalty had effect in reducing
crime rate

No statistical way to decide whos


right
321

Legalised Abortion
Donohue and Levitt (1999)
Legalised abortion in 1970s cut crime in
1990s

Lott and Whitley (2001)


Legalising abortion decreased murder rates
by 0.5 to 7 per cent.

Its impossible to model these data


Controlling for other historical events
Crack cocaine (again)
322

Another Critique
Berk (2003)
Regression analysis: a constructive critique
(Sage)

Three cheers for regression


As a descriptive technique

Two cheers for regression


As an inferential technique

One cheer for regression


As a causal analysis
323

Is Regression Useless?
Do regression carefully
Dont go beyond data which you have
a strong theoretical understanding of

Validate models
Where possible, validate predictive
power of models in other areas,
times, groups
Particularly important with stepwise
324

Lesson 7: Categorical
Independent Variables

325

Introduction

326

Introduction
So far, just looked at continuous
independent variables
Also possible to use categorical
(nominal, qualitative) independent
variables
e.g. Sex; Job; Religion; Region; Type
(of anything)

Usually analysed with t-test/ANOVA


327

Historical Note
But these (t-test/ANOVA) are
special cases of regression analysis
Aspects of General Linear Models
(GLMs)

So why treat them differently?


Fishers fault
Computers fault

Regression, as we have seen, is


computationally difficult
Matrix inversion and multiplication
Unfeasible, without a computer
328

In the special cases where:


You have one categorical IV
Your IVs are uncorrelated

It is much easier to do it by
partitioning of sums of squares

These cases
Very rare in applied research
Very common in experimental
research
Fisher worked at Rothamsted agricultural
research station
Never have problems manipulating
wheat, pigs, cabbages, etc
329

In psychology
Led to a split between experimental
psychologists and correlational
psychologists
Experimental psychologists (until
recently) would not think in terms of
continuous variables

Still (too) common to dichotomise


a variable
Too difficult to analyse it properly
Equivalent to discarding 1/3 of your
data
330

The Approach

331

The Approach
Recode the nominal variable
Into one, or more, variables to represent
that variable

Names are slightly confusing


Some texts talk of dummy coding to refer
to all of these techniques
Some (most) refer to dummy coding to
refer to one of them
Most have more than one name
332

If a variable has g possible


categories it is represented by g-1
variables
Simplest case:
Smokes: Yes or No
Variable 1 represents Yes
Variable 2 is redundant
If it isnt yes, its no

333

The Techniques

334

We will examine two coding


schemes
Dummy coding
For two groups
For >2 groups

Effect coding
For >2 groups

Look at analysis of change


Equivalent to ANCOVA
Pretest-posttest designs
335

Dummy Coding 2 Groups


Also called simple coding by SPSS
A categorical variable with two groups
One group chosen as a reference group
The other group is represented in a variable

e.g. 2 groups: Experimental (Group 1)


and Control (Group 0)
Control is the reference group
Dummy variable represents experimental
group
Call this variable group1
336

For variable group1


1 = Yes, 2=No

O rig in a l

New

C a t e g o ry

V a ria b le

E xp

Con

337

Some data
Group is x, score is y
Control
Group
Experiment 1
Experiment 2
Experiment 3

Experimental
Group
10
10
10
20
10
30

338

Control Group = 0
Intercept = Score on Y when x = 0
Intercept = mean of control group

Experimental Group = 1
b = change in Y when x increases 1
unit
b = difference between experimental
group and control group

339

35
30
Gradient of slope
25 represents
difference
between
20
means
15
10
5
0

Control Group

Experiment 1

Experimental Group

Experiment 2

Experiment 3
340

Dummy Coding 3+
Groups
With three groups the approach is
the similar
g = 3, therefore g-1 = 2 variables
needed
3 Groups
Control
Experimental Group 1
Experimental Group 2
341

Original
Category
Con
Gp1
Gp2

Gp1

Gp2

0
1
0

0
0
1

Recoded into two variables


Note do not need a 3rd variable
If we are not in group 1 or group 2 MUST
be in control group
3rd variable would add no information
(What would happen to determinant?)
342

F and associated p
Tests H0 that

g1 g 2 g 3
b1 and b2 and associated p-values
Test difference between each
experimental group and the control
group

To test difference between


experimental groups
Need to rerun analysis

343

One more complication


Have now run multiple comparisons
Increases i.e. probability of type I
error

Need to correct for this


Bonferroni correction
Multiply given p-values by two/three
(depending how many comparisons
were made)

344

Effect Coding
Usually used for 3+ groups
Compares each group (except the
reference group) to the mean of all
groups
Dummy coding compares each group to the
reference group.

Example with 5 groups


1 group selected as reference group
Group 5
345

Each group (except reference) has


a variable
1 if the individual is in that group
0 if not
-1 if in reference group
group
1
2
3
4
5

group_1 group_2 group_3 group_4


1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
-1
-1
-1
-1
346

Examples
Dummy coding and Effect Coding
Group 1 chosen as reference group
each time
Data
Group
Mean
SD
1
52.40
4.60
2
56.30
5.70
3
60.10
5.00
Total
56.27
5.88
347

Dummy
Group

dummy
2

dummy
3

1
2
3

0
1
0

0
0
1

Group

Effect2

effect3

1
2
3

-1
1
0

-1
0
1

Effect

348

Dummy
Effect
R=0.543, F=5.7,
R=0.543, F=5.7,
df=2, 27, p=0.009
df=2, 27, p=0.009
b0 = 52.4,
b0 = 56.27,
b1 = 3.9, p=0.100

b1 = 0.03, p=0.980

b2 = 7.7, p=0.002

b2 = 3.8, p=0.007

b0 g1

b0 G

b1 g2 g1

b1 g2 G

b2 g3 g1

b2 g3 G
349

In SPSS
SPSS provides two equivalent
procedures for regression
Regression (which we have been using)
GLM (which we havent)

GLM will:
Automatically code categorical variables
Automatically calculate interaction terms

GLM wont:
Give standardised effects
Give hierarchical R2 p-values
Allow you to not understand
350

ANCOVA and Regression

351

Test
(Which is a trick; but its designed to
make you think about it)

Use employee data.sav


Compare the pay rise (difference
between salbegin and salary)
For ethnic minority and non-minority
staff
What do you find?
352

ANCOVA and Regression


Dummy coding approach has one
special use
In ANCOVA, for the analysis of change

Pre-test post-test experimental design


Control group and (one or more)
experimental groups
Tempting to use difference score + t-test /
mixed design ANOVA
Inappropriate
353

Salivary cortisol levels


Used as a measure of stress
Not absolute level, but change in
level over day may be interesting

Test at: 9.00am, 9.00pm


Two groups
High stress group (cancer biopsy)
Group 1

Low stress group (no biopsy)


Group 0

354

High Stress
Low Stress

AM
20.1
22.3

PM
6.8
11.8

Diff
13.3
10.5

Correlation of AM and PM = 0.493


(p=0.008)
Has there been a significant
difference in the rate of change of
salivary cortisol?
3 different approaches
355

Approach 1 find the differences,


do a t-test
t = 1.31, df=26, p=0.203

Approach 2 mixed ANOVA, look


for interaction effect
F = 1.71, df = 1, 26, p = 0.203
F = t2

Approach 3 regression (ANCOVA)


based approach
356

IVs: AM and group


DV: PM
b1 (group) = 3.59, standardised
b1=0.432, p = 0.01

Why is the regression approach


better?
The other two approaches took the
difference
Assumes that r = 1.00
Any difference from r = 1.00 and you
add error variance
Subtracting error is the same as adding
357
error

Using regression
Ensures that all the variance that is
subtracted is true
Reduces the error variance

Two effects
Adjusts the means
Compensates for differences between
groups

Removes error variance

358

In SPSS
SPSS automates all of this
But you have to understand it, to
know what it is doing

Use Analyse, GLM, Univariate


ANOVA

359

Outcome
here

Categorical
predictors here
Continuous
predictors here

Click options

360

Select
parameter
estimaters

361

More on Change
If difference score is correlated
with either pre-test or post-test
Subtraction fails to remove the
difference between the scores
If two scores are uncorrelated
Difference will be correlated with both
Failure to control

Equal SDs, r = 0
Correlation of change and pre-score
=0.707
362

Even More on Change


A topic of surprising complexity
What I said about difference scores
isnt always true
Lords paradox it depends on the
precise question you want to answer

Collins and Horn (1993). Best


methods for the analysis of change
Collins and Sayer (2001). New
methods for the analysis of change.
363

Lesson 8: Assumptions in
Regression Analysis

364

The Assumptions
1. The distribution of residuals is normal
(at each value of the dependent
variable).
2. The variance of the residuals for every
set of values for the independent
variable is equal.
violation is called
heteroscedasticity.
3. The error term is additive

no interactions.

4. At every value of the dependent


variable the expected (mean) value of
365
the residuals is zero

5. The expected correlation between


residuals, for any two cases, is 0.

The independence assumption (lack of


autocorrelation)

6. All independent variables are


uncorrelated with the error term.
7. No independent variables are a perfect
linear function of other independent
variables (no perfect multicollinearity)
8. The mean of the error term is zero.

366

What are we going to do

Deal with some of these


assumptions in some detail
Deal with others in passing only
look at them again later on

367

Assumption 1: The
Distribution of Residuals is
Normal at Every Value of
the Dependent Variable

368

Look at Normal
Distributions
A normal distribution
symmetrical, bell-shaped (so they
say)

369

What can go wrong?


Skew
non-symmetricality
one tail longer than the other

Kurtosis
too flat or too peaked
kurtosed

Outliers
Individual cases which are far from the
distribution
370

Effects on the Mean


Skew
biases the mean, in direction of skew

Kurtosis
mean not biased
standard deviation is
and hence standard errors, and
significance tests

371

Examining Univariate
Distributions

Histograms
Boxplots
P-P plots
Calculation based methods

372

Histograms
30

A and B

30

20

20

10

10

373

C and D
40

14

12
30

10

8
20
6

10

374

E&F
20

10

375

Histograms can be tricky .


7

4
4

376

Boxplots

377

P-P Plots

A&B
1.00

1.00

.75

.75

.50

.50

.25

.25

0.00
0.00

.25

.50

.75

1.00

0.00
0.00

.25

.50

378

.75

1.00

C&D
1.00

1.00

.75

.75

.50

.50

.25

.25

0.00
0.00

.25

.50

.75

1.00

0.00
0.00

.25

.50

379

.75

1.00

E&F
1.00

1.00

.75

.75

.50

.50

.25

.25

0.00
0.00

.25

.50

.75

1.00

0.00
0.00

.25

.50

.75

380

1.00

Calculation Based
Skew and Kurtosis statistics
Outlier detection statistics

381

Skew and Kurtosis


Statistics
Normal distribution
skew = 0
kurtosis = 0

Two methods for calculation


Fishers and Pearsons
Very similar answers

Associated standard error


can be used for significance of departure
from normality
not actually very useful
Never normal above N = 400

382

Skewness SE Skew Kurtosis SE Kurt


A
B
C
D
E
F

-0.12
0.271
0.454
0.117
2.106
0.171

0.172
0.172
0.172
0.172
0.172
0.172

-0.084
0.265
1.885
-1.081
5.75
-0.21

0.342
0.342
0.342
0.342
0.342
0.342

383

Outlier Detection
Calculate distance from mean
z-score (number of standard deviations)
deleted z-score
that case biased the mean, so remove it

Look up expected distance from mean


1% 3+ SDs

Calculate influence
how much effect did that case have on the
mean?
384

Non-Normality in
Regression

385

Effects on OLS Estimates


The mean is an OLS estimate
The regression line is an OLS
estimate
Lack of normality
biases the position of the regression
slope
makes the standard errors wrong
probability values attached to statistical
significance wrong
386

Checks on Normality
Check residuals are normally
distributed
SPSS will draw histogram and p-p plot
of residuals

Use regression diagnostics


Lots of them
Most arent very interesting
387

Regression Diagnostics
Residuals
standardised, unstandardised, studentised,
deleted, studentised-deleted
look for cases > |3| (?)

Influence statistics
Look for the effect a case has
If we remove that case, do we get a
different answer?
DFBeta, Standardised DFBeta
changes in b
388

DfFit, Standardised DfFit


change in predicted value

Covariance ratio
Ratio of the determinants of the
covariance matrices, with and without
the case

Distances
measures of distance from the
centroid
some include IV, some dont
389

More on Residuals
Residuals are trickier than you
might have imagined
Raw residuals
OK

Standardised residuals
Residuals divided by SD

se

e
n k 1
2

390

Leverage
But
That SD is wrong
Variance of the residuals is not equal
Those further from the centroid on the
predictors have higher variance
Need a measure of this

Distance from the centroid is


leverage, or h (or sometimes hii)
One predictor
Easy
391


xi x
1
hi
2
n ( x x )
2

Minimum hi is 1/n, the maximum is


1
Except
SPSS uses standardised leverage - h*
It doesnt tell you this, it just uses it

392

1
hi hi
n
2

xi x
*
hi
2
( x x )
*

Minimum 0, maximum (N-1/N)

393

Multiple predictors
Calculate the hat matrix (H)
Leverage values are the diagonals of
this matrix

H X(X' X) X'
Where X is the augmented matrix of
predictors (i.e. matrix that includes
the constant)
Hence leverage hii element ii of H
394

Example of calculation of hat


matrix

1 15 1 15 1 15

1 20 1 20 1 20
H

... ... ... ...


... ...

1 65
1 65 1 65

1 15
0.318 0.273

1 20
0.273 0.236

... ...

1 65

395

0.318

Standardised / Studentised
Now we can calculate the
standardised residuals
SPSS calls them studentised residuals
Also called internally studentised
residuals

ei
ei
se 1 hi
396

Deleted Studentised
Residuals
Studentised residuals do not have
a known distribution
Cannot use them for inference

Deleted studentised residuals


Externally studentised residuals
Jackknifed residuals
Distributed as t
With df = N k 1
397

Testing Significance
We can calculate the probability of
a residual
Is it sampled from the same
population

BUT
Massive type I error rate
Bonferroni correct it
Multiply p value by N
398

Bivariate Normality
We didnt just say residuals
normally distributed
We said at every value of the
dependent variables
Two variables can be normally
distributed univariate,
but not bivariate
399

Couples IQs
male and female
FEMALE

MALE

5
6
4

Frequency

Frequency

0
60.0

70.0

80.0

90.0

100.0

110.0

120.0

130.0

140.0

0
60.0

70.0

80.0

90.0

100.0

Seem reasonably normal


400

110.0

120.0

130.0

140.0

But wait!!
160

140

120

100

M ALE

80

60

40
40

60

80

100

120

140

FEMALE

401

160

When we look at bivariate


normality
not normal there is an outlier

So plot X against Y
OK for bivariate
but may be a multivariate outlier
Need to draw graph in 3+ dimensions
cant draw a graph in 3 dimensions

But we can look at the residuals


instead
402

IQ histogram of residuals
12

10

403

Multivariate Outliers
Will be explored later in the
exercises
So we move on

404

What to do about NonNormality


Skew and Kurtosis
Skew much easier to deal with
Kurtosis less serious anyway

Transform data
removes skew
positive skew log transform
negative skew - square
405

Transformation
May need to transform IV and/or DV
More often DV
time, income, symptoms (e.g. depression) all
positively skewed

can cause non-linear effects (more later) if


only one is transformed
alters interpretation of unstandardised
parameter
May alter meaning of variable
May add / remove non-linear and moderator
effects
406

Change measures
increase sensitivity at ranges
avoiding floor and ceiling effects

Outliers
Can be tricky
Why did the outlier occur?
Error? Delete them.
Weird person? Probably delete them
Normal person? Tricky.

407

You are trying to model a process


is the data point outside the process
e.g. lottery winners, when looking at
salary
yawn, when looking at reaction time

Which is better?
A good model, which explains 99% of
your data?
A poor model, which explains all of it

Pedhazur and Schmelkin (1991)


analyse the data twice
408

We will spend much less time on


the other 6 assumptions
Can do exercise 8.1.

409

Assumption 2: The variance


of the residuals for every
set of values for the
independent variable is
equal.

410

Heteroscedasticity
This assumption is a about
heteroscedasticity of the residuals
Hetero=different
Scedastic = scattered

We dont want heteroscedasticity


we want our data to be
homoscedastic

Draw a scatterplot to investigate


411

160

140

120

100

80

MALE

60
40
40

60

FEMALE

80

100

120

140
412

160

Only works with one IV


need every combination of IVs

Easy to get use predicted values


use residuals there

Plot predicted values against


residuals
or
or
or
or

standardised residuals
deleted residuals
standardised deleted residuals
studentised residuals

A bit like turning the scatterplot on


its side
413

R e s id u a l

Good no heteroscedasticity

Predicted Value

414

R e s id u a l

Bad heteroscedasticity

Predicted Value
415

Testing Heteroscedasticity

Whites test

1.
2.
3.
4.

Not automatic in SPSS (is in SAS)


Luckily, not hard to do
Do regression, save residuals.
Square residuals
Square IVs
Calculate interactions of IVs
e.g. x1x2, x1x3, x2 x3
416

5. Run regression using


squared residuals as DV
IVs, squared IVs, and interactions as IVs

6. Test statistic = N x R2
Distributed as 2
Df = k (for second regression)

Use education and salbegin to


predict salary (employee
data.sav)
. R2 = 0.113, N=474, 2 = 53.5, df=5,
p < 0.0001
417

Plot of Pred and Res

Regression Standardized Residual

-2

-4
-2

Regression Standardized Predicted Value


418

Magnitude of
Heteroscedasticity
Chop data into slices
5 slices, based on X (or predicted
score)
Done in SPSS

Calculate variance of each slice


Check ratio of smallest to largest
Less than 10:1
OK
419

The Visual Bander


New in SPSS 12

420

Variances
1
of.219the 5 groups
2

.336

.757

.751

3.119

We have a problem
3 / 0.2 ~= 15
421

Dealing with
Heteroscedasticity

Use Huber-White estimates


Very easy in Stata
Fiddly in SPSS bit of a hack

Use Complex samples


1. Create a new variable where all
cases are equal to 1, call it const
2. Use Complex Samples, Prepare for
Analysis
3. Create a plan file
422

4.
5.
6.
7.

Sample weight is const


Finish
Use Complex Samples, GLM
Use plan file created, and set up
model as in GLM
(More on complex samples later)

In Stata, do regression as normal,


and click robust.
423

Heteroscedasticity
Implications and Meanings
Implications
What happens as a result of
heteroscedasticity?
Parameter estimates are correct
not biased

Standard errors (hence p-values) are


incorrect
424

However
If there is no skew in predicted
scores
P-values a tiny bit wrong

If skewed,
P-values very wrong

Can do exercise

425

Meaning
What is heteroscedasticity trying
to tell us?
Our model is wrong it is misspecified
Something important is happening
that we have not accounted for

e.g. amount of money given to


charity (given)
depends on:
earnings
degree of importance person assigns to
the charity (import)
426

Do the regression analysis


R2 = 0.60, F=31.4, df=2, 37, p < 0.001
seems quite good

b0 = 0.24, p=0.97
b1 = 0.71, p < 0.001
b2 = 0.23, p = 0.031

Whites test
2 = 18.6, df=5, p=0.002

The plot of predicted values


against residuals
427

Plot shows heteroscedastic


relationship
428

Which means
the effects of the variables are not
additive
If you think that what a charity does
is important
you might give more money
how much more depends on how much
money you have

429

70

60

50

40

GIVEN

30

Earnings

20

High
10

Low
4

10

12

14

16

IMPORT
430

One more thing about


heteroscedasticity
it is the equivalent of homogeneity of
variance in ANOVA/t-tests

431

Assumption 3: The Error


Term is Additive

432

Additivity
What heteroscedasticity shows you
effects of variables need to be additive

Heteroscedasticity doesnt always show


it to you
can test for it, but hard work
(same as homogeneity of covariance
assumption in ANCOVA)

Have to know it from your theory


A specification error
433

Additivity and Theory


Two IVs
Alcohol has sedative effect
A bit makes you a bit tired
A lot makes you very tired

Some painkillers have sedative effect


A bit makes you a bit tired
A lot makes you very tired

A bit of alcohol and a bit of painkiller


doesnt make you very tired
Effects multiply together, dont add
together
434

If you dont test for it


Its very hard to know that it will
happen

So many possible non-additive


effects
Cannot test for all of them
Can test for obvious

In medicine
Choose to test for salient non-additive
effects
e.g. sex, race
435

Assumption 4: At every
value of the dependent
variable the expected
(mean) value of the
residuals is zero

436

Linearity
Relationships between variables should
be linear
best represented by a straight line

Not a very common problem in social


sciences
except economics
measures are not sufficiently accurate to
make a difference
R2 too low
unlike, say, physics

437

Fuel

Relationship between speed of


travel and fuel used

Speed

438

R2 = 0.938
looks pretty good
know speed, make a good prediction
of fuel

BUT
look at the chart
if we know speed we can make a
perfect prediction of fuel used
R2 should be 1.00

439

Detecting Non-Linearity
Residual plot
just like heteroscedasticity

Using this example


very, very obvious
usually pretty obvious

440

Residual plot

441

Linearity: A Case of
Additivity
Linearity = additivity along the range of
the IV
Jeremy rides his bicycle harder
Increase in speed depends on current speed
Not additive, multiplicative
MacCallum and Mar (1995). Distinguishing
between moderator and quadratic effects in
multiple regression. Psychological Bulletin.

442

Assumption 5: The expected


correlation between
residuals, for any two cases,
is 0.
The independence assumption
(lack of autocorrelation)

443

Independence Assumption
Also: lack of autocorrelation
Tricky one
often ignored
exists for almost all tests

All cases should be independent of one


another
knowing the value of one case should not
tell you anything about the value of other
cases
444

How is it Detected?
Can be difficult
need some clever statistics
(multilevel models)

Better off avoiding situations


where it arises
Residual Plots
Durbin-Watson Test
445

Residual Plots
Were data collected in time order?
If so plot ID number against the
residuals
Look for any pattern
Test for linear relationship
Non-linear relationship
Heteroscedasticity

446

R
e
s
id
u
a
l

2
1
0
--1
201
0P
2
0
3
0
4
0
a
rtic
p
a
n
tN
u
m
b
e
r
447

How does it arise?


Two main ways
time-series analyses
When cases are time periods
weather on Tuesday and weather on Wednesday
correlated
inflation 1972, inflation 1973 are correlated

clusters of cases
patients treated by three doctors
children from different classes
people assessed in groups
448

Why does it matter?


Standard errors can be wrong
therefore significance tests can be
wrong

Parameter estimates can be wrong


really, really wrong
from positive to negative

An example
students do an exam (on statistics)
choose one of three questions
IV: time
DV: grade
449

Result, with line of best fit


90
80
70
60
50
40

Grade

30
20
10
10

Time

20

30

40

50

60
450

70

Result shows that


people who spent longer in the exam,
achieve better grades

BUT
we havent considered which question
people answered
we might have violated the
independence assumption
DV will be autocorrelated

Look again
with questions marked
451

Now somewhat different


90
80
70
60
50
40

Question

Grade

30

20

10
10

1
20

30

40

50

60

70

Time
452

Now, people that spent longer got


lower grades
questions differed in difficulty
do a hard one, get better grade
if you can do it, you can do it quickly

Very difficult to analyse well


need multilevel models

453

Durbin Watson Test


Not well implemented in SPSS
Depends on the order of the data
Reorder the data, get a different
result

Doesnt give statistical significance


of the test

454

Assumption 6: All
independent variables are
uncorrelated with the
error term.

455

Uncorrelated with the


Error Term
A curious assumption
by definition, the residuals are
uncorrelated with the independent
variables (try it and see, if you like)

It is about the DV
must have no effect (when the IVs
have been removed)
on the DV
456

Problem in economics
Demand increases supply
Supply increases wages
Higher wages increase demand

OLS estimates will be (badly)


biased in this case
need a different estimation procedure
two-stage least squares
simultaneous equation modelling

457

Assumption 7: No
independent variables are
a perfect linear function
of other independent
variables
no perfect multicollinearity

458

No Perfect Multicollinearity
IVs must not be linear functions of one
another
matrix of correlations of IVs is not positive
definite
cannot be inverted
analysis cannot proceed

Have seen this with


age, age start, time working
also occurs with subscale and total
459

Large amounts of collinearity


a problem (as we shall see)
sometimes
not an assumption

460

Assumption 8: The mean of


the error term is zero.

You will like this one.

461

Mean of the Error Term =


0

Mean of the residuals = 0


That is what the constant is for

if the mean of the error term deviates


from zero, the constant soaks it up

Y 0 1 x1
Y ( 0 3) 1 x1 ( 3)
- note, Greek letters because we are
talking about population values
462

Can do regression without the


constant
Usually a bad idea
E.g R2 = 0.995, p < 0.001
Looks good

463

1
3
1
2
1
1
0
9
8
76789x
1
0
1
1
2
1
3
1
464

465

Lesson 9: Issues in
Regression Analysis
Things that alter the
interpretation of the
regression equation
466

The Four Issues

Causality
Sample sizes
Collinearity
Measurement error

467

Causality

468

What is a Cause?
Debate about definition of cause
some statistics (and philosophy)
books try to avoid it completely
We are not going into depth
just going to show why it is hard

Two dimensions of cause


Ultimate versus proximal cause
Determinate versus probabilistic
469

Proximal versus Ultimate


Why am I here?
I walked here because
This is the location of the class
because
Eric Tanenbaum asked me because
(I dont know)
because I was in my office when he
rang because
I am a lecturer at York because
I saw an advert in the paper because
470

I exist because
My parents met because
My father had a job

Proximal cause
the direct and immediate cause of
something

Ultimate cause
the thing that started the process off
I fell off my bicycle because of the
bump
I fell off because I was going too fast
471

Determinate versus Probabilistic


Cause
Why did I fall off my bicycle?
I was going too fast
But every time I ride too fast, I dont
fall off
Probabilistic cause

Why did my tyre go flat?


A nail was stuck in my tyre
Every time a nail sticks in my tyre,
the tyre goes flat
Deterministic cause
472

Can get into trouble by mixing


them together
Eating deep fried Mars Bars and doing
no exercise are causes of heart
disease
My Grandad ate three deep fried
Mars Bars every day, and the most
exercise he ever got was when he
walked to the shop next door to buy
one
(Deliberately?) confusing
deterministic and probabilistic causes
473

Criteria for Causation


Association
Direction of Influence
Isolation

474

Association
Correlation does not mean causation
we all know

But
Causation does mean correlation

Need to show that two things are


related
may be correlation
my be regression when controlling for third
(or more) factor
475

Relationship between price and


sales
suppliers may be cunning
when people want it more
stick the price up

Price

Price
Demand
Sales

1
0.6
0

Demand
0.6
1
0.6

Sales
0
0.6
1

So no relationship between price


and sales
476

Until (or course) we control for


demand
b1 (Price) = -0.56
b2 (Demand) = 0.94

But which variables do we enter?

477

Direction of Influence
Relationship between A and B
three possible processes
A

B causes A

C causes A & B

A causes B

478

How do we establish the direction


of influence?
Longitudinally?
Barometer
Drops

Storm

Now if we could just get that


barometer
needle to stay where it is

Where the role of theory comes


in
479
(more on this later)

Isolation
Isolate the dependent variable
from all other influences
as experimenters try to do

Cannot do this
can statistically isolate the effect
using multiple regression

480

Role of Theory
Strong theory is crucial to making
causal statements
Fisher said: to make causal
statements make your theories
elaborate.
dont rely purely on statistical
analysis

Need strong theory to guide


analyses
481
what critics of non-experimental

S.J. Gould a critic


says correlate price of petrol and his
age, for the last 10 years
find a correlation
Ha! (He says) that doesnt mean
there is a causal link
Of course not! (We say).
No social scientist would do that analysis
without first thinking (very hard) about
the possible causal relations between the
variables of interest
Would control for time, prices, etc
482

Atkinson, et al. (1996)


relationship between college grades
and number of hours worked
negative correlation
Need to control for other variables
ability, intelligence

Gould says Most correlations are


non-causal (1982, p243)
Of course!!!!

483

I drink a lot
of beer
16 causal
relations

120 non-causal
correlations

laugh
toilet
jokes (about
statistics)
vomit
karaoke
curtains closed
sleeping
headache
equations (beermat)
thirsty
fried breakfast
no beer
curry
chips
falling over
lose keys
484

Abelson (1995) elaborates on this


method of signatures

A collection of correlations relating


to the process
the signature of the process

e.g. tobacco smoking and lung


cancer
can we account for all of these
findings with any other theory?

485

1.
2.
3.
4.
5.
6.
7.
8.

The longer a person has smoked cigarettes, the


greater the risk of cancer.
The more cigarettes a person smokes over a
given time period, the greater the risk of
cancer.
People who stop smoking have lower cancer
rates than do those who keep smoking.
Smokers cancers tend to occur in the lungs,
and be of a particular type.
Smokers have elevated rates of other diseases.
People who smoke cigars or pipes, and do not
usually inhale, have abnormally high rates of lip
cancer.
Smokers of filter-tipped cigarettes have lower
cancer rates than other cigarette smokers.
Non-smokers who live with smokers have
elevated cancer rates.
486
(Abelson, 1995: 183-184)

In addition, should be no anomalous


correlations
If smokers had more fallen arches than
non-smokers, not consistent with theory

Failure to use theory to select


appropriate variables
specification error
e.g. in previous example
Predict wealth from price and sales
increase price, price increases
Increase sales, price increases
487

Sometimes these are indicators of


the process
e.g. barometer stopping the needle
wont help
e.g. inflation? Indicator or cause?

488

No Causation without
Experimentation
Blatantly untrue
I dont doubt that the sun shining
makes us warm

Why the aversion?


Pearl (2000) says problem is no
mathematical operator
No one realised that you needed one
Until you build a robot
489

AI and Causality
A robot needs to make judgements
about causality
Needs to have a mathematical
representation of causality
Suddenly, a problem!
Doesnt exist
Most operators are non-directional
Causality is directional
490

Sample Sizes
How many subjects does it
take to run a regression
analysis?
491

Introduction
Social scientists dont worry enough
about the sample size required
Why didnt you get a significant result?
I didnt have a large enough sample
Not a common answer

More recently awareness of sample size


is increasing
use too few no point doing the research
use too many waste their time
492

Research funding bodies


Ethical review panels
both become more interested in
sample size calculations

We will look at two approaches


Rules of thumb (quite quickly)
Power Analysis (more slowly)

493

Rules of Thumb
Lots of simple rules of thumb exist
10 cases per IV
>100 cases
Green (1991) more sophisticated
To test significance of R2 N = 50 + 8k
To test sig of slopes, N = 104 + k

Rules of thumb dont take into


account all the information that we
have
Power analysis does
494

Power Analysis
Introducing Power Analysis
Hypothesis test
tells us the probability of a result of
that magnitude occurring, if the null
hypothesis is correct (i.e. there is no
effect in the population)

Doesnt tell us
the probability of that result, if the
null hypothesis is false
495

According to Cohen (1982) all null


hypotheses are false
everything that might have an effect,
does have an effect
it is just that the effect is often very tiny

496

Type I Errors
Type I error is false rejection of H0
Probability of making a type I error
the significance value cut-off
usually 0.05 (by convention)

Always this value


Not affected by
sample size
type of test
497

Type II errors
Type II error is false acceptance of
the null hypothesis
Much, much trickier

We think we have some idea


we almost certainly dont

Example
I do an experiment (random
sampling, all assumptions perfectly
satisfied)
I find p = 0.05
498

You repeat the experiment exactly


different random sample from same
population

What is probability you will find p <


0.05?

Another experiment, I find p = 0.01
Probability you find p < 0.05?

Very hard to work out


not intuitive
need to understand non-central
499
sampling distributions (more in a

Probability of type II error = beta


( )
same as population regression
parameter (to be confusing)

Power = 1 Beta
Probability of getting a significant
result

500

State of the World

Research
Findings

H0 True
(no effect to
be found)

H0 false
(effect to be
found)

H0 true (we
find no effect
p > 0.05)

Type II error
p =
power = 1 -

H0 false (we
find an effect
p < 0.05)

Type I error
p=

501

Four parameters in power analysis


prob. of Type I error
prob. of Type II error (power = 1
)
Effect size size of effect in
population
N

Know any three, can calculate the


fourth
Look at them one at a time
502

Probability of Type I error


Usually set to 0.05
Somewhat arbitrary
sometimes adjusted because of
circumstances
rarely because of power analysis

May want to adjust it, based on power


analysis

503

Probability of type II error


Power (probability of finding a result)
=1
Standard is 80%
Some argue for 90%

Implication that Type I error is 4 times


more serious than type II error
adjust ratio with compromise power
analysis

504

Effect size in the population


Most problematic to determine
Three ways
1. What effect size would be useful to
find?

R2 = 0.01 - no use (probably)

2. Base it on previous research


what have other people found?

3. Use Cohens conventions


small R2 = 0.02
medium R2 = 0.13
large R2 = 0.26
505

Effect size usually measured as f2


For R2

R
f
2
1 R
2

506

For (standardised) slopes


2

sri
f
2
1 R
2

Where sr2 is the contribution to the


variance accounted for by the
variable of interest
i.e. sr2 = R2 (with variable) R2
(without)
change in R2 in hierarchical regression
507

N the sample size


usually use other three parameters to
determine this
sometimes adjust other parameters
() based on this
e.g. You can have 50 participants. No
more.

508

Doing power analysis


With power analysis program
SamplePower, GPower, Nquery

With SPSS MANOVA


using non-central distribution
functions
Uses MANOVA syntax
Relies on the fact you can do anything
with MANOVA
Paper B4
509

Underpowered Studies
Research in the social sciences is
often underpowered
Why?
See Paper B11 the persistence of
underpowered studies

510

Extra Reading
Power traditionally focuses on p
values
What about CIs?
Paper B8 Obtaining regression
coefficients that are accurate, not
simply significant

511

Collinearity

512

Collinearity as Issue and


Assumption
Collinearity (multicollinearity)
the extent to which the independent
variables are (multiply) correlated

If R2 for any IV, using other IVs =


1.00
perfect collinearity
variable is linear sum of other
variables
regression will not proceed
513
(SPSS will arbitrarily throw out
a

R2 < 1.00, but high


other problems may arise

Four things to look at in collinearity


meaning
implications
detection
actions

514

Meaning of Collinearity
Literally co-linearity
lying along the same line

Perfect collinearity
when some IVs predict another
Total = S1 + S2 + S3 + S4
S1 = Total (S2 + S3 + S4)
rare
515

Less than perfect


when some IVs are close to predicting
correlations between IVs are high
(usually, but not always)

516

Implications
Effects the stability of the
parameter estimates
and so the standard errors of the
parameter estimates
and so the significance

Because
shared variance, which the regression
procedure doesnt know where to put
517

Red cars have more accidents than


other coloured cars
because of the effect of being in a red
car?
because of the kind of person that
drives a red car?
we dont know

No way to distinguish between these


three:

Accidents = 1 x colour + 0 x person


Accidents = 0 x colour + 1 x person
Accidents = 0.5 x colour +5180.5 x

Sex differences
due to genetics?
due to upbringing?
(almost) perfect collinearity
statistically impossible to tell

519

When collinearity is less than


perfect
increases variability of estimates
between samples
estimates are unstable
reflected in the variances, and hence
standard errors

520

Detecting Collinearity
Look at the parameter estimates
large standardised parameter
estimates (>0.3?), which are not
significant
be suspicious

Run a series of regressions


each IV as DV
all other IVs as IVs
for each IV
521

Sounds like hard work?


SPSS does it for us!

Ask for collinearity diagnostics


Tolerance calculated for every IV

Tolerance 1-R

Variance Inflation Factor


sq. root of amount s.e. has been
increased

1
VIF
Tolerance

522

Actions
What you can do about collinearity
no quick fix (Fox, 1991)

1. Get new data

avoids the problem


address the question in a different
way
e.g. find people who have been
raised as the wrong gender

exist, but rare

Not a very useful suggestion


523

2. Collect more data

not different data, more data


collinearity increases standard error
(se)
se decreases as N increases

get a bigger N

3. Remove / Combine variables

If an IV correlates highly with other


IVs
Not telling us much new
If you have two (or more) IVs which
are very similar

e.g. 2 measures of depression, socio524


economic status, achievement,
etc

sum them, average them, remove one

Many measures

use principal components analysis to


reduce them

3. Use stepwise regression (or some


flavour of)

See previous comments


Can be useful in theoretical vacuum

4. Ridge regression

not very useful


behaves weirdly
525

Measurement Error

526

What is Measurement
Error
In social science, it is unlikely that
we measure any variable perfectly
measurement error represents this
imperfection

We assume that we have a true


score
T

A measure of that score


x
527

xT e
just like a regression equation
standardise the parameters
T is the reliability
the amount of variance in x which comes from
T

but, like a regression equation


assume that e is random and has mean of
zero
more on that later
528

Simple Effects of
Measurement Error
Lowers the measured correlation
between two variables

Real correlation
true scores (x* and y*)

Measured correlation
measured scores (x and y)

529

True correlation
of x and y
rx*y*

x*

y*

Reliability of x
rxx

Reliability of y
ryy

Measured
correlation of x and y
rxy

530

Attenuation of correlation

rxy rx * y * rxx ryy


Attenuation corrected correlation

rx * y *

rxy
rxx ryy
531

Example

rxx 0.7
ryy 0.8
rxy 0.3

rx* y*
rx* y*

rxy
rxx ryy

0.3

0.40
0.7 0.8

532

Complex Effects of
Measurement Error
Really horribly complex
Measurement error reduces
correlations
reduces estimate of
reducing one estimate
increases others

because of effects of control


combined with effects of suppressor
variables
exercise to examine this
533

Dealing with Measurement


Error
Attenuation correction
very dangerous
not recommended

Avoid in the first place


use reliable measures
dont discard information
dont categorise
Age: 10-20, 21-30, 31-40
534

Complications
Assume measurement error is
additive
linear

Additive
e.g. weight people may under-report /
over-report at the extremes

Linear
particularly the case when using proxy
variables
535

e.g. proxy measures


Want to know effort on childcare,
count number of children
1st child is more effort than last

Want to know financial status, count


income
1st 10 much greater effect on financial
status than the 1000th.

536

Lesson 10: Non-Linear


Analysis in Regression

537

Introduction
Non-linear effect occurs
when the effect of one independent
variable
is not consistent across the range of
the IV

Assumption is violated
expected value of residuals = 0
no longer the case
538

Some Examples

539

Skill

A Learning Curve

Experience

540

Performance

Yerkes-Dodson Law of Arousal

Arousal

541

Suicidal

Enthusiastic

Enthusiasm Levels over a


Lesson on Regression

Time

3.5
542

Learning
line changed direction once

Yerkes-Dodson
line changed direction once

Enthusiasm
line changed direction twice

543

Everything is Non-Linear
Every relationship we look at is
non-linear, for two reasons
Exam results cannot keep increasing
with reading more books
Linear in the range we examine

For small departures from linearity


Cannot detect the difference
Non-parsimonious solution
544

Non-Linear
Transformations

545

Bending the Line


Non-linear regression is hard
We cheat, and linearise the data
Do linear regression

Transformations
We need to transform the data
rather than estimating a curved line
which would be very difficult
may not work with OLS

we can take a straight line, and bend it


or take a curved line, and straighten it
back to linear (OLS) regression
546

We still do linear regression


Linear in the parameters
Y = b1x + b2x2 +

Can do non-linear regression


Non-linear in the parameters
Y = b1x + b2x2 +

Much trickier
Statistical theory either breaks down
OR becomes harder
547

Linear transformations
multiply by a constant
add a constant
change the slope and the intercept

548

y=2x

y=x + 3

y=x
x
549

Linear transformations are no use


alter the slope and intercept
dont alter the standardised
parameter estimate

Non-linear transformation
will bend the slope
quadratic transformation

y = x2
one change of direction

550

Cubic transformation
y = x2 + x3
two changes of direction

551

Quadratic Transformation

y=0 + 0.1x + 1x2

552

Square Root Transformation

y=20 + -3x +
5x
553

Cubic Transformation
6

y = 3 - 4x + 2x2 - 0.2x3
5
4
3
2
1
0
0

554

Logarithmic Transformation
y = 1 + 0.1x + 10log(x)

555

Inverse Transformation

y = 20 -10x + 8(1/x)

556

To estimate a non-linear regression


we dont actually estimate anything
non-linear
we transform the x-variable to a nonlinear version
can estimate that straight line
represents the curve
we dont bend the line, we stretch the
space around the line, and make it
flat

557

Detecting Non-linearity

558

Draw a Scatterplot
Draw a scatterplot of y plotted
against x
see if it looks a bit non-linear
e.g. Anscombes data
e.g. Education and beginning salary
from bank data
drawn in SPSS
with line of best fit
559

Anscombe (1973)
constructed a set of datasets
show the importance of graphs in
regression/correlation

For each dataset


N
Mean of x
Mean of y
Equation of regression line
sum of squares (X - mean)
correlation coefficient
R2

11
9
7.5
y = 3 + 0.5x
110
0.82
0.67
560

561

562

563

564

A Real Example
Starting salary and years of
education
From employee data.sav

565

Beginning Salary

Expected
value of error
(residual) is >
0

Educational Level (years)

Expected
value of error
(residual) is <
0
566

Use Residual Plot


Scatterplot is only good for one
variable
use the residual plot (that we used for
heteroscedasticity)

Good for many variables

567

We want
points to lie in a nice straight sausage

568

We dont want
a nasty bent sausage

569

Educational level and starting


salary
10

-2
-2

-1

570

Carrying Out Non-Linear


Regression

571

Linear Transformation
Linear transformation doesnt
change
interpretation of slope
standardised slope
se, t, or p of slope
R2

Can change
effect of a transformation
572

Actually more complex


with some transformations can add a
constant with no effect (e.g.
quadratic)

With others does have an effect


inverse, log

Sometimes it is necessary to add a


constant
negative numbers have no square
root
0 has no log
573

Education and Salary


Linear Regression
Saw previously that the assumption of
expected errors = 0 was violated
Anyway
R2 = 0.401, F=315, df = 1, 472, p < 0.001
salbegin = -6290 + 1727 educ
Standardised
b1 (educ) = 0.633

Both parameters make sense


574

Non-linear Effect
Compute new variable
quadratic
educ2 = educ2

Add this variable to the equation


R2 = 0.585, p < 0.001
salbegin = 46263 + -6542 educ + 310
educ2
slightly curious

Standardised
b1 (educ) = -2.4
b2 (educ2) = 3.1

What is going on?


575

Collinearity
is what is going on
Correlation of educ and educ2
r = 0.990

Regression equation becomes difficult


(impossible?) to interpret

Need hierarchical regression


what is the change in R2
is that change significant?
R2 (change) = 0.184, p < 0.001
576

Cubic Effect
While we are at it, lets look at the
cubic effect
R2 (change) = 0.004, p = 0.045
19138 + 103 e + -206 e2 + 12 e3
Standardised:
b1(e) = 0.04
b2(e2) = -2.04
b3(e3) = 2.71

577

Fourth Power
Keep going while we are ahead
wont run
???

Collinearity is the culprit


Tolerance (educ4) = 0.000005
VIF = 215555

Matrix of correlations of IVs is not


positive definite
cannot be inverted
578

Interpretation
Tricky, given that parameter
estimates are a bit nonsensical
Two methods
1: Use R2 change
Save predicted values
or calculate predicted values to plot line
of best fit

Save them from equation


Plot against IV
579

50000

40000

30000

Beginning Salary

20000

Cubic

10000

Quadratic
0

Linear
8

10

12

Education (Years)

14

16

18

20

22
580

Differentiate with respect to e


We said:
s = 19138 + 103 e + -206 e2 + 12
e3
but first we will simplify it to quadratic
s = 46263 + -6542 e + 310 e2

dy/dx = -6542 + 310 x 2 x e

581

Education Slope
9
-962
10
-342
11
278
12
898
13
1518
14
2138
15
2758
16
3378
17
3998
18
4618
19
5238
20
5858

1 year of
education at the
higher end of the
scale, better than
1 year at the lower
end of the scale.
MBA versus GCSE
582

Differentiate Cubic
19138 + 103 e + -206 e2 + 12
e3
dy/dx = 103 206 2 e + 12 3
e2
Can calculate slopes for quadratic
and cubic at different values
583

Education Slope (Quad) Slope (Cub)


9
-962
-689
10
-342
-417
11
278
-73
12
898
343
13
1518
831
14
2138
1391
15
2758
2023
16
3378
2727
17
3998
3503
18
4618
4351
19
5238
5271
20
5858
6263
584

A Quick Note on
Differentiation
For y = xp
dx/dy = pxp-1

For equations such as


y =b1x + b2xP
dy/dx = b1 + b2pxp-1
y = 3x + 4x2
dy/dx = 3 + 4 2x
585

y = b1x + b2x2 + b3x3


dy/dx = b1 + b2 2x + b3 3 x2

y = 4x + 5x2 + 6x3
dx/dy = 4 + 5 2 x + 6 3 x2
Many functions are simple to
differentiate
Not all though
586

Automatic Differentiation
If you
Dont know how to differentiate
Cant be bothered to look up the
function

Can use automatic differentiation


software
e.g. GRAD (freeware)
587

588

Lesson 11: Logistic


Regression
Dichotomous/Nominal
Dependent Variables

589

Introduction
Often in social sciences, we have a
dichotomous/nominal DV
we will look at dichotomous first, then a
quick look at multinomial

Dichotomous DV
e.g.

guilty/not guilty
pass/fail
won/lost
Alive/dead (used in medicine)
590

Why Wont OLS Do?

591

Example: Passing a Test


Test for bus drivers
pass/fail
we might be interested in degrees of pass
fail
a company which trains them will not
fail means pay for them to take it again

Develop a selection procedure


Two predictor variables
Score Score on an aptitude test
Exp Relevant prior experience (months)
592

1st ten cases


Score
5
1
1
4
1
1
4
1
3
4

Exp
6
15
12
6
15
6
16
10
12
26

Pass
0
0
0
0
1
0
1
1
0
1
593

DV
pass (1 = Yes, 0 = No)

Just consider score first


Carry out regression
Score as IV, Pass as DV
R2 = 0.097, F = 4.1, df = 1, 48, p =
0.028.
b0 = 0.190
b1 = 0.110, p=0.028
Seems OK
594

Or does it?
1st Problem pp plot of residuals
1.00

E xp ec te d C u m P rob

.75

.50

.25

0.00
0.00

.25

Observed Cum Prob

.50

.75

1.00

595

2nd problem - residual plot

596

Problems 1 and 2
strange distributions of residuals
parameter estimates may be wrong
standard errors will certainly be
wrong

597

3rd problem interpretation


I score 2 on aptitude.
Pass = 0.190 + 0.110 2 = 0.41
I score 8 on the test
Pass = 0.190 + 0.110 8 = 1.07

Seems OK, but


What does it mean?
Cannot score 0.41 or 1.07
can only score 0 or 1

Cannot be interpreted
need a different approach
598

A Different Approach
Logistic Regression

599

Logit Transformation
In lesson 10, transformed IVs
now transform the DV

Need a transformation which gives


us
graduated scores (between 0 and 1)
No upper limit
we cant predict someone will pass twice

No lower limit
you cant do worse than fail

600

Step 1: Convert to
Probability
First, stop talking about values
talk about probability
for each value of score, calculate
probability of pass

Solves the problem of graduated


scales

601

probability of
failure given a
score of 1 is 0.7

Score 1 2 3 4 5
N
7 5 6 4 2
Fail
P
0.7 0.5 0.6 0.4 0.2
N
3 5 4 6 8
Pass
P
0.3 0.5 0.4 0.6 0.8

probability of
passing given a
score of 5 is 0.8
602

This is better
Now a score of 0.41 has a meaning
a 0.41 probability of pass

But a score of 1.07 has no


meaning
cannot have a probability > 1 (or < 0)
Need another transformation

603

Step 2: Convert to OddsRatio


Need to remove upper limit
Convert to odds
Odds, as used by betting shops
5:1, 1:2

Slightly different from odds in


speech
a 1 in 2 chance
odds are 1:1 (evens)
50%

604

Odds ratio = (number of times it


happened) / (number of times it
didnt happen)
p(event)
p(event)
odds ratio

p(not event) 1 p(event)

605

0.8 = 0.8/0.2 = 4
equivalent to 4:1 (odds on)
4 times out of five

0.2 = 0.2/0.8 = 0.25


equivalent to 1:4 (4:1 against)
1 time out of five

606

Now we have solved the upper


bound problem
we can interpret 1.07, 2.07,
1000000.07

But we still have the zero problem


we cannot interpret predicted scores
less than zero

607

Step 3: The Log


Log10 of a number(x)

log( x )

10

log(10) = 1
log(100) = 2
log(1000) = 3
608

log(1) = 0
log(0.1) = -1
log(0.00001) = -5

609

Natural Logs and e


Dont use log10
Use loge

Natural log, ln
Has some desirable properties, that
log10 doesnt

For us
If y = ln(x) + c
dy/dx = 1/x
Not true for any other logarithm
610

Be careful calculators and stats


packages are not consistent when
they use log
Sometimes log10, sometimes loge
Can prove embarrassing (a friend told
me)

611

Take the natural log of the odds ratio


Goes from - +
can interpret any predicted value

612

Putting them all together


Logit transformation
log-odds ratio
not bounded at zero or one

613

Score 1
Fail
Pass

N
P
N
P

Odds (Fail)
log(odds)fail

7
5
6
4
0.7 0.5 0.6 0.4
3
5
4
6
0.3 0.5 0.4 0.6

5
2
0.2
8
0.8

2.33 1.00 1.50 0.67 0.25


0.85 0.00 0.41 -0.41 -1.39

614

probability

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

Probability gets
closer to zero, but
never reaches it as
logit goes down.

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0.5

1.5

Logit
615

2.5

3.5

Hooray! Problem solved, lesson


over
errrmmm almost

Because we are now using logodds ratio, we cant use OLS


we need a new technique, called
Maximum Likelihood (ML) to estimate
the parameters

616

Parameter Estimation
using ML
ML tries to find estimates of model
parameters that are most likely to
give rise to the pattern of
observations in the sample data
All gets a bit complicated
OLS is a special case of ML
the mean is an ML estimator
617

Dont have closed form equations


must be solved iteratively
estimates parameters that are most
likely to give rise to the patterns
observed in the data
by maximising the likelihood function
(LF)

We arent going to worry about this


except to note that sometimes, the
estimates do not converge
ML cannot find a solution
618

Interpreting Output
Using SPSS
Overall fit for:
step (only used for stepwise)
block (for hierarchical)
model (always)
in our model, all are the same
2=4.9, df=1, p=0.025
F test
619

Omnibus Tests of Model Coefficients


Chi-square
Step 1

df

Sig.

Step

4.990

.025

Block

4.990

.025

Model

4.990

.025

620

Model summary
-2LL (=2/N)
Cox & Snell R2
Nagelkerke R2
Different versions of R2
No real R2 in logistic regression
should be considered pseudo R2

621

Model Summary
Step
1

-2 Log
likelihood

Cox & Snell


R Square

64.245

.095

Nagelkerke
R Square
.127

622

Classification Table
predictions of model
based on cut-off of 0.5 (by default)
predicted values x actual values

623

Classification Tablea
Predicted
PASS
Observed
Step 1

PASS

Percentage
Correct

18

69.2

12

12

50.0

Overall P ercentage
60.0
a. The cut value is .500

624

Model parameters
B
Change in the logged odds associated
with a change of 1 unit in IV
just like OLS regression
difficult to interpret

SE (B)
Standard error
Multiply by 1.96 to get 95% CIs

625

Variables in the Equation


B
Step
a
1

S.E.

Wald

SCORE

-.467

.219

4.566

Constant

1.314

.714

3.390

a. Variable(s) entered on step 1: SCORE.


Variables in the Equation
95.0% C.I.for EXP(B)
Step
a
1

score
Constant

Sig.
.386

Exp(B)
1.263

.199

.323

Lower
.744

a. Variable(s) entered on step 1: score.


626

Upper
2.143

Constant
i.e. score = 0
B = 1.314
Exp(B) = eB = e1.314 = 3.720
OR = 3.720, p = 1 (1 / (OR + 1))
= 1 (1 / (3.720 + 1))
p = 0.788

627

Score 1
Constant b = 1.314
Score B = -0.467
Exp(1.314 0.467) = Exp(0.847)
= 2.332
OR = 2.332
p = 1 (1 / (2.332 + 1))
= 0.699

628

Standard Errors and CIs


SPSS gives
B, SE B, exp(B) by default
Can work out 95% CI from standard
error
B 1.96 x SE(B)
Or ask for it in options

Symmetrical in B
Non-symmetrical (sometimes very) in
exp(B)
629

Variables in the Equation


95.0% C.I.f or
EXP(B)
B

S.E .

Exp(B)

SCORE

-.467

.219

.627

Constan
t

1.314

.714

3.720

Lower
.408

a. Variable(s) entered on step 1: SCORE.

630

Upper
.962

The odds of passing the test are


multiplied by 0.63 (CIs = 0.408,
0.962p p = 0.033), for every
additional point on the aptitude
test.

631

More on Standard Errors


In OLS regression
If a variable is added in a hierarchical
fashion
The p-value associated with the change in
R2 is the same as the p-value of the variable
Not the case in logistic regression
In our data 0.025 and 0.033

Wald standard errors


Make p-value in estimates is wrong too
high
(CIs still correct)
632

Two estimates use slightly different


information
P-value says what if no effect
CI says what if this effect
Variance depends on the hypothesised ratio
of the number of people in the two groups

Can calculate likelihood ratio based


p-values
If you can be bothered
Some packages provide them
automatically

633

Probit Regression
Very similar to logistic
much more complex initial
transformation (to normal
distribution)
Very similar results to logistic
(multiplied by 1.7)

In SPSS:
A bit weird
Probit regression available through
menus
634

But requires data structured


differently

However
Ordinal logistic regression is
equivalent to binary logistic
If outcome is binary

SPSS gives option of probit

635

Results
Estimat
e

SE

Logistic
(binary)

Score

0.288

0.301

0.339

Exp

0.147

0.073

0.043

Logistic
(ordinal)

Score

0.288

0.301

0.339

Exp

0.147

0.073

0.043

Logistic
(probit)

Score

0.191

0.178

0.282

Exp

0.090

0.042

0.033

636

Differentiating Between
Probit and Logistic
Depends on shape of the error term
Normal or logistic
Graphs are very similar to each other
Could distinguish quality of fit
Given enormous sample size

Logistic = probit x 1.7


Actually 1.6998

Probit advantage
Understand the distribution

Logistic advantage
Much simpler to get back to the probability
637

638

2.8

2.6

2.4

2.2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

-0.2

-0.4

-0.6

-0.8

-1

-1.2

-1.4

-1.6

-1.8

-2

-2.2

-2.4

-2.6

-2.8

-3

1.2

Normal (Probit)
Logistic

0.8

0.6

0.4

0.2

Infinite Parameters
Non-convergence can happen
because of infinite parameters
Insoluble model

Three kinds:
Complete separation
The groups are completely distinct
Pass group all score more than 10
Fail group all score less than 10
639

Quasi-complete separation
Separation with some overlap
Pass group all score 10 or more
Fail group all score 10 or less

Both cases:
No convergence

Close to this
Curious estimates
Curious standard errors
640

Categorical Predictors
Can cause separation
Esp. if correlated
Need people in every cell
Male
White

Non-White

Female
White

Non-White

Below
Poverty
Line
Above
Poverty
Line

641

Logistic Regression and


Diagnosis
Logistic regression can be used for
diagnostic tests
For every score
Calculate probability that result is positive
Calculate proportion of people with that score (or
lower) who have a positive result

Calculate c statistic
Measure of discriminative power
%age of all possible cases, where the model
gives a higher probability to a correct case
than to an incorrect case
642

Perfect c-statistic = 1.0


Random c-statistic = 0.5

SPSS doesnt do it automatically


But easy to do

Save probabilities
Use Graphs, ROC Curve
Test variable: predicted probability
State variable: outcome

643

Sensitivity and Specificity


Sensitivity:
Probability of saying someone has a
positive result
If they do: p(pos)|pos

Specificity
Probability of saying someone has a
negative result
If they do: p(neg)|neg
644

Calculating Sens and Spec


For each value
Calculate
proportion of minority earning less p(m)
proportion of non-minority earning less
p(w)

Sensitivity (value)
P(m)

645

Salary

P(minority)

10
20
30
40
50
60
70
80
90

.39
.31
.23
.17
.12
.09
.06
.04
.03
646

Using Bank Data


Predict minority group, using
salary (000s)
Logit(minority) = -0.044 + salary x
0.039

Find actual proportions

647

R
O
C
u
r
v
e
.0
1
0
.0
8
.6

S
ensit

.0
0
4
.0
2
.D
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
S
p
e
c
i
f
t
y
ia
g
o
n
a
ls
e
g
m
e
n
ts
a
rro
d
u
c
e
d
b
y
tie
s
.
Area under
curve is cstatistic

648

More Advanced
Techniques
Multinomial Logistic Regression
more than two categories in DV
same procedure
one category chosen as reference
group
odds of being in category other than
reference

Polytomous Logit Universal Models


(PLUM)
Ordinal multinomial logistic regression
649
For ordinal outcome variables

Final Thoughts
Logistic Regression can be
extended
dummy variables
non-linear effects
interactions (even though we dont
cover them until the next lesson)

Same issues as OLS


collinearity
outliers
650

651

652

Lesson 12: Mediation and


Path Analysis

653

Introduction
Moderator
Level of one variable influences effect of
another variable

Mediator
One variable influences another via a third
variable

All relationships are really mediated


are we interested in the mediators?
can we make the process more explicit
654

In examples with bank

education

beginning
salary

Why?
What is the process?
Are we making assumptions about the
process?
Should we test those assumptions?
655

job skills

expectations
beginning
salary

education
negotiating
skills

kudos
for bank
656

Direct and Indirect


Influences
X may affect Y in two ways
Directly X has a direct (causal)
influence on Y
(or maybe mediated by other
variables)

Indirectly X affects Y via a


mediating variable - M
657

e.g. how does going to the pub


effect comprehension on a
Summer school course
on, say, regression

Having
fun in pub
in
evening

not
reading
books on
regressio
n

less
knowledg
e

Anythin
g here?
658

Having
fun in pub
in
evening

not
reading
books on
regressio
n

less
knowledg
e

fatigue

Still
needed
?
659

Mediators needed
to cope with more sophisticated
theory in social sciences
make explicit assumptions made
about processes
examine direct and indirect influences

660

Detecting Mediation

661

4 Steps
From Baron and Kenny (1986)
To establish that the effect of X on Y
is mediated by M
1. Show that X predicts Y
2. Show that X predicts M
3. Show that M predicts Y, controlling
for X
4. If effect of X controlling for M is
zero, M is complete mediator of the
662
relationship

Example: Book habits


Enjoy Books

Buy books

Read Books
663

Three Variables
Enjoy
How much an individual enjoys books

Buy
How many books an individual buys
(in a year)

Read
How many books an individual reads
(in a year)
664

ENJOY
BUY
READ

ENJOY BUY
READ
1.00
0.64
0.73
0.64
1.00
0.75
0.73
0.75
1.00

665

The Theory

enjoy

buy

read

666

Step 1
1. Show that X (enjoy) predicts Y
(read)
b1 = 0.487, p < 0.001
standardised b1 = 0.732
OK

667

2. Show that X (enjoy) predicts M


(buy)
b1 = 0.974, p < 0.001
standardised b1 = 0.643
OK

668

3. Show that M (buy) predicts Y


(read), controlling for X (enjoy)
b1 = 0.469, p < 0.001
standardised b1 = 0.206
OK

669

4. If effect of X controlling for M is


zero, M is complete mediator of
the relationship
(Same as analysis for step 3.)

b2 = 0.287, p = 0.001
standardised b2 = 0.431
Hmmmm

Significant, therefore not a complete


mediator

670

0.287
(step 4)

enjoy

read
buy

0.974
(from step
2)

0.206
(from step
3)
671

The Mediation Coefficient


Amount of mediation =
Step 1 Step 4
=0.487 0.287
= 0.200
OR
Step 2 x Step 3
=0.974 x 0.206
= 0.200
672

SE of Mediator
enjoy

buy
a
(from step
2)

read
b
(from step
2)

sa = se(a)
sb = se(b)
673

Sobel test
Standard error of mediation
coefficient can be calculated

se b s + a s - s s
2 2
a

a = 0.974
sa = 0.189

2 2
b

2 2
a b

b = 0.206
sb = 0.054

674

Indirect effect = 0.200


se = 0.056
t =3.52, p = 0.001

Online Sobel test:


http://www.unc.edu/~preacher/sobel/
sobel.htm
(Wont be there for long; probably will be
somewhere else)

675

A Note on Power
Recently
Move in methodological literature away
from this conventional approach
Problems of power:
Several tests, all of which must be
significant
Type I error rate = 0.05 * 0.05 = 0.0025
Must affect power

Bootstrapping suggested as alternative


See Paper B7, A4, B9
B21 for SPSS syntax

676

677

678

Lesson 13: Moderators


in Regression
different slopes for different
folks

679

Introduction
Moderator relationships have many
different names
interactions (from ANOVA)
multiplicative
non-linear (just confusing)
non-additive

All talking about the same thing


680

A moderated relationship occurs


when the effect of one variable
depends upon the level of another
variable

681

Hang on
That seems very like a nonlinear
relationship
Moderator
Effect of one variable depends on level of another

Non-linear
Effect of one variable depends on level of itself

Where there is collinearity


Can be hard to distinguish between them
Paper in handbook (B5)
Should (usually) compare effect sizes

682

e.g. How much it hurts when I drop


a computer on my foot depends on
x1: how much alcohol I have drunk
x2: how high the computer was
dropped from
but if x1 is high enough
x2 will have no effect
683

e.g. Likelihood of injury in a car


accident
depends on
x1: speed of car
x2: if I was wearing a seatbelt
but if x1 is low enough
x2 will have no effect

684

685

e.g. number of words (from a list) I


can remember
depends on
x1: type of words (abstract, e.g.
justice, or concrete, e.g. carrot)
x2: Method of testing (recognition
i.e. multiple choice, or free recall)
but if using recognition
x1: will not make a difference

686

We looked at three kinds of


moderator
alcohol x height = pain
continuous x continuous

speed x seatbelt = injury


continuous x categorical

word type x test type


categorical x categorical

We will look at them in reverse


order
687

How do we know to look


for moderators?
Theoretical rationale
Often the most powerful
Many theories predict
additive/linear effects
Fewer predict moderator effects

Presence of heteroscedasticity
Clue there may be a moderated
relationship missing
688

Two Categorical Predictors

689

2 IVs

Data

word type (concrete [1], abstract [2])


test method (recog [1], recall [2])

20 Participants in one of four groups

1,
1,
2,
2,

1
2
1
2

5 per group
lesson12.1.sav
690

Recog
Recall
Total

Concrete Abstract Total


Mean
15.40
15.20
15.30
SD
2.19
2.59
2.26
Mean
15.60
6.60
11.10
Std. Deviation 1.67
7.44
6.95
Mean
15.50
10.90
13.20
Std. Deviation 1.84
6.94
5.47

691

Graph of means
18

16

14

12

10

WORDS
8

1.00

6
1.00

2.00
2.00

TEST
692

ANOVA Results
Standard way to analyse these
data would be to use ANOVA
Words: F=6.1, df=1, 16, p=0.025
Test: F=5.1, df=1, 16, p=0.039
Words x Test: F=5.6, df=1, 16,
p=0.031

693

Procedure for Testing


1: Convert to effect coding
can use dummy coding,
collinearity is less of an issue
doesnt make any difference to
substantive interpretation
2: Calculate interaction term
In ANOVA interaction is automatic
In regression we create an
interaction variable
694

Interaction term (wxt)


multiply effect coded variables
together

word
-1
1
-1
1

test
-1
-1
1
1

wxt
1
-1
-1
1
695

3: Carry out regression


Hierarchical
linear effects first
interaction effect in next block

696

b0=13.2
b1 (words) = -2.3, p=0.025
b2 (test) = -2.1, p=0.039
b3 (words x test) = -2.2, p=0.031

Might need to use change in R2 to


test sig of interaction, because of
collinearity
What do these mean?
b0 (intercept) = predicted value of
Y (score) when all X = 0
i.e. the central point

697

b0 = 13.2
grand mean

b1 = -2.3
distance from grand to mean for two
word types
13.2 (-2.3) = 15.5
13.2 + (-2.3) = 10.9
Recog
Recall
Total

Concrete Abstract Total


15.40
15.20
15.30
15.60
6.60
11.10
15.50
10.90
13.20
698

b2 = -2.1
distance from grand mean to recog
and recall means

b3 = -2.2
to understand b3 we need to look at
predictions from the equation without
this term

Score = 13.2 + (-2.3) w + (-2.1) t

699

Score = 13.2 + (-2.3) w + (-2.1) t


So for each group we can calculate
an expected value

700

b1 = -2.3, b2 = -2.1
W

Word

Test

Expected Value

Cog

-1

-1

13.2 + (-2.3) x (-1) + (-2.1) x -1

Call

-1

13.2 + (-2.3) x (-1) + (-2.1) x 1

Cog

-1

13.2 + (-2.3) x 1 + (-2.1) x (-1)

Call

13.2 + (-2.3) x 1 + (-2.1) x 1

701

W
C
C
A
A

T
Word Test Exp
Actual Value
Call
-1 -1
17.6
15.4
Cog
-1
1
13.4
15.6
Call
1 -1
13.0
15.2
Cog
1
1
8.8
11.0

The exciting part comes when we


look at the differences between
the actual value and the value in
the 2 IV model
702

Each difference = 2.2 (or 2.2)


The value of b3 was 2.2
the interaction term is the correction
required to the slope when the
second IV is included

703

Examine the slope for word type


18
16
14
12
10
8
6
4

Gradient =
(11.1 - 15.3) / 2 =
-2.1

2
0
Recog (-1)

Recall (1)

Test Type
704

Add the slopes for two test groups


18
16
14
12
10
8
6
4
2
0
Recog (-1)

Both word
groups (2.1)
Abstract
(6.6 - 15.2 )/2
= -4.3

Test Type

Concrete
(15.6-15.4 )/2
= 0.1
Recall (1)
705

b associated with interaction


the change in slope, away from the
average, associated with a 1 unit
change in the moderating variable
OR
Half the difference in the slopes

706

Another way to look at it


Y = 13.2 + -2.3w + -2.1t + -2.2wt
Examine concrete words group (w = -1)
substitute values into the equation

Y(concrete) = 13.2 + -2.3-1 + -2.1t + -2.21t


Y(concrete) = 13.2 + 2.3 + -2.1t + 2.2t
Y(concrete) = 15.5 + 0.1t
The effect of changing test type for
concrete words (the slope, which is half
the actual difference)
707

Why go to all that effort? Why not


do ANOVA in the first place?
1. That is what ANOVA actually does

if it can handle an unbalanced


design (i.e. different numbers of
people in each group)
Helps to understand what can be
done with ANOVA
SPSS uses regression to do ANOVA

2. Helps to clarify more complex


cases

as we shall see
708

Categorical x Continuous

709

Note on Dichotomisation
Very common to see people
dichotomise a variable
Makes the analysis easier
Very bad idea
Paper B6

710

Data
A chain of 60 supermarkets
examining the relationship
between profitability, shop size,
and local competition
2 IVs
shop size
comp (local competition, 0=no,
1=yes)

DV
profit

711

Data, lesson 12.2.sav


Shopsize
4
10
7
10
10
29
12
6
14
62

Comp
1
1
0
0
1
1
0
1
0
0

Profit
23
25
19
9
18
33
17
20
21
8
712

1st Analysis
Two IVs
R2=0.367, df=2, 57, p < 0.001
Unstandardised estimates
b1 (shopsize) = 0.083 (p=0.001)
b2 (comp) = 5.883 (p<0.001)

Standardised estimates
b1 (shopsize) = 0.356
b2 (comp) = 0.448

713

Suspicions
Presence of competition is likely to
have an effect
Residual plot shows a little
heteroscedasticity
3

-1

-2

-3
-2.0

-1.5

-1.0

-.5

0.0

.5

1.0

1.5

714

2.0

Procedure for Testing


Very similar to last time
convert comp to effect coding
-1 = No competition
1 = competition
Compute interaction term
comp (effect coded) x size

Hierarchical regression
715

Result
Unstandardised estimates
b1 (shopsize) = 0.071 (p=0.006)
b2 (comp) = -1.67 (p = 0.506)
b3 (sxc) = -0.050 (p=0.050)

Standardised estimates
b1 (shopsize) = 0.306
b2 (comp) = -0.127
b3 (sxc) = -0.389

716

comp now non-significant


shows importance of hierarchical
it obviously is important

717

Interpretation
Draw graph with lines of best fit
drawn automatically by SPSS

Interpret equation by substitution


of values
evaluate effects of
size
competition

718

40

30

20

Profit

10

Competition
No competition

All Shops
0

20

40

60

80

100

Shopsize
719

Effects of size
in presence and absence of
competition
(can ignore the constant)
Y=x10.071 + x2(-1.67) + x1x2 (0.050)
Competition present (x2 = 1)
Y=x10.071 + 1(-1.67) + x11 (-0.050)
Y=x10.071 + -1.67 + x1(-0.050)
Y=x1 0.021

+ (1.67)
720

Y=x10.071 + x2(-1.67) + x1x2 (0.050)


Competition absent (x2 = -1)
Y=x10.071 + -1(-1.67) + x1-1 (0.050)
Y=x1 0.071 + x1-1 (-0.050) + -1(1.67)
Y= x1 0.121 (+ 1.67)

721

Two Continuous Variables

722

Data
Bank Employees
only using clerical staff
363 cases
predicting starting salary
previous experience
age
age x experience
723

Correlation matrix
only one significant

724

Initial Estimates (no moderator)


(standardised)
R2 = 0.061, p<0.001
Age at start = -0.37, p<0.001
Previous experience = 0.36, p<0.001

Suppressing each other


Age and experience compensate for
one another
Older, with no experience, bad
Younger, with experience, good
725

The Procedure
Very similar to previous
create multiplicative interaction term
BUT

Need to eliminate effects of means


cause massive collinearity

and SDs
cause one variable to dominate the
interaction term

By standardising
726

To standardise x,
subtract mean, and divide by SD
re-expresses x in terms of distance
from the mean, in SDs
ie z-scores

Hint: automatic in SPSS in


Descriptives
Create interaction term of age and
exp
axe = z(age) z(exp)
727

Hierarchical regression
two linear effects first
moderator effect in second
hint: it is often easier to interpret if
standardised versions of all variables
are used

728

Change in R2
0.085, p<0.001

Estimates (standardised)
b1 (exp) = 0.104
b2 (agestart) = -0.54
b3 (age x exp) = -0.54

729

Interpretation 1: Pick-aPoint
Graph is tricky
cant have two continuous variables
Choose specific points (pick-a-point)
Graph the line of best fit of one variable
at others

Two ways to pick a point


1: Choose high (z = +1), medium (z = 0)
and low (z = -1)
Choose sensible values age 20, 50,
80?
730

We know:
Y = e 0.10 + a -0.54 + a e -0.54
Where a = agestart, and e = experience

We can rewrite this as:


Y = (e 0.10) + (a -0.54) + (a e -0.54)
Take a out of the brackets
Y = (e 0.10) + (-0.54 + e -0.54)a

Bracketed terms are simple intercept and


simple slope
0= (e 0.10)
1= (-0.54 + e -0.54)a
Y = 0 + 1a

731

Pick any value of e, and we know the


slope for a
Standardised, so its easy

e = -1

0= (-1 0.10) = -0.10


1= (-0.54 + -1 -0.54)a = -0.0a

e=0

0= (0 0.10) = 0
1= (-0.54+ 0 -0.54)a = -0.54a

e=1

0= (1 0.10) = 0.10
1= (-0.54 + 1 -0.54)a = -1.08a
732

Graph the Three Lines

733

Interpretation 2: P-Values and


CIs
Second way
Newer, rarely done

Calculate CIs of the slope


At any point

Calculate p-value
At any point

Give ranges of significance


734

What do you need?


The variance and covariance of the
estimates
SPSS doesnt provide estimates for
intercept
Need to do it manually

In options, exclude intercept


Create intercept c = 1
Use it in the regression

735

Enter information into web page:


www.unc.edu/~preacher/interact/a
cov.htm
(Again, may not be around for long)

Get results
Calculations in Bauer and Curran
(in press: Multivariate Behavioral
Research)
Paper B13

736

4.1

4.2

4.3

4.4

4.5

MLR 2-Way Interaction Plot

4.0

CVz1(1)
CVz1(2)
CVz1(3)
-1.0

-0.5

0.0

0.5

1.0

737

Areas of Significance

0 .0
-0 .2
-0 .4
-0 .6

S im p le S lo p e

0 .2

0 .4

Confidence Bands

-4

-2

Experience

738

2 complications
1: Constant differed
2: DV was logged, hence non-linear
effect of 1 unit depends on where the unit
is

Can use SPSS to do graphs showing


lines of best fit for different groups
See paper A2

739

Finally

740

Unlimited Moderators
Moderator effects are not limited
to
2 variables
linear effects

741

Three Interacting Variables


Age, Sex, Exp
Block 1
Age, Sex, Exp

Block 2
Age x Sex, Age x Exp, Sex x Exp

Block 3
Age x Sex x Exp
742

Results
All two way interactions significant
Three way not significant
Effect of Age depends on sex
Effect of experience depends on sex
Size of the age x experience
interaction does not depend on sex
(phew!)

743

Moderated Non-Linear
Relationships
Enter non-linear effect
Enter non-linear effect x moderator
if significant indicates degree of nonlinearity differs by moderator

744

745

Modelling Counts: Poisson


Regression
Lesson 14

746

Counts and the Poisson


Distribution
Von Bortkiewicz
(1898)
Numbers of
Prussian soldiers
kicked to death by
horses
0
1
2
3
4
5

109
65
22
3
1
0
747

The data fitted a Poisson probability


distribution
When counts of events occur, poisson
distribution is common
E.g. papers published by researchers, police
arrests, number of murders, ship accidents

Common approach
Log transform and treat as normal

Problems
Censored at 0
Integers only allowed
Heteroscedasticity
748

The Poisson Distribution

749

exp( )
p( y | x)
y!

750

exp( )
p( y | x)
y!
Where:
y is the count
is the mean of the poisson
distribution

In a poisson distribution
The mean = the variance (hence
heteroscedasticity issue))

751

Poisson Regression in
SPSS
Not directly available
SPSS can be tweaked to do it in three ways:
General loglinear model (genlog)
Non-linear regression (CNLR)
Bootstrapped p-values only

Both are quite tricky

SPSS 15,
752

Example Using Genlog


Number of shark
bites on different
colour surfboards
100 surfboards, 50
red, 50 blue

Weight cases by
bites
Analyse,
Loglinear, General
Colour is factor

753

Results
Correspondence Between Parameters and
Terms of the Design
Parameter
Aliased Term
1
Constant
2
[COLOUR = 1]
3 x [COLOUR = 2]
Note: 'x' indicates an aliased (or a
redundant) parameter. These parameters
are set to zero.
754

Asymptotic
Param
Est.
1
2
3

4.1190
-.5495
.0000

SE
.1275
.2108
.

Note: Intercept
(param 1) is
curious
Param 2 is the
difference in the
means

95% CI
Z-value Lower

Upper

32.30
-2.61
.

4.37
-.14
.

3.87
-.96
.

755

SPSS: Continuous
Predictors
Bleedin nightmare
http://www.spss.com/tech/answer/
details.cfm?
tech_tan_id=100006204

756

Poisson Regression in
Stata
SPSS will save a Stata file
Open it in Stata
Statistics, Count outcomes, Poisson
regression

757

Poisson Regression in R
R is a freeware program
Similar to SPlus
www.r-project.org

Steep learning curve to start with


Much nicer to do Poisson (and other)
regression analysis
http://www.stat.lsa.umich.edu/~faraway/book
/
http://www.jeremymiles.co.uk/regressionbook
/extras/appendix2/R/
758

Commands in R
Stage 1: enter data
colour <- c(1, 0, 1, 0, 1, 0 1)
bites <- c(3, 1, 0, 0, )

Run analysis
p1 <- glm(bites ~ colour, family
= poisson)

Get results
summary.glm(p1)

759

R Results
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3567
0.1686 -2.115 0.03441 *
colour
0.5555
0.2116
2.625 0.00866 **

Results for colour


Same as SPSS
For intercept different (weird SPSS)
760

Predicted Values
Need to get exponential of
parameter estimates
Like logistic regression

Exp(0.555) = 1.74
You are likely to be bitten by a shark
1.74 times more often with a red
surfboard
761

Checking Assumptions
Was it really poisson distributed?
For Poisson, 2

As mean increases, variance should also


increase

Residuals should be random


Overdispersion is common problem
Too many zeroes

For blue: 2 = exp(-0.3567) =


1.42
For red: 2 = exp(-0.3567
+
762
0.555) = 2.48

exp( )
p( y | x)
y!

Strictly:

exp( )
p ( yi | xi )
y!

763

Compare Predicted with


Actual Distributions

764

Overdispersion
Problem in poisson regression
Too many zeroes

Causes
2 inflation
Standard error deflation
Hence p-values too low

Higher type I error rate

Solution
Negative binomial regression
765

Using R
R can read an SPSS file
But you have to ask it nicely

Click Packages menu, Load


package, choose Foreign
Click File, Change Dir
Change to the folder that contains
your data
766

More on R
R uses objects
To place something into an object use < X <- Y
Puts Y into X

Function is read.spss()
Mydata <- read.spss(spssfilename.sav)

Variables are then referred to as


Mydata$VAR1
Note 1: R is case sensitive
Note 2: SPSS variable name in capitals
767

GLM in R
Command
glm(outcome ~ pred1 + pred2 + +
predk [,family = familyname])
If no familyname, default is OLS
Use binomial for logistic, poisson for
poisson

Output is a GLM object


You need to give this a name
my1stglm <- glm(outcome ~ pred1 +
pred2 + + predk [,family =
768
familyname])

Then need to explore the result


summary(my1stglm)

To explore what it means


Need to plot regressions
Easiest is to use Excel

769

770

Introducing Structural
Equation Modelling
Lesson 15

771

Introduction
Related to regression analysis
All (OLS) regression can be
considered as a special case of SEM

Power comes from adding


restrictions to the model
SEM is a system of equations
Estimate those equations
772

Regression as SEM
Grades example
Grade = constant + books + attend +
error
Looks like a regression equation

Also
Books correlated with attend
Explicit modelling of error

773

Path Diagram
System of equations are usefully
represented in a path diagram
x

Measured variable

unmeasured variable
regression
correlation

774

Path Diagram for


Regression
Must usually
explicitly
model error

error
Books
Grade
Attend

Must explicitly
model correlation

775

Results
Unstandardised
2.00

1.00

BOOKS

4.04

2.65
17.84

13.52

GRADE
1.28

ATTEND

776

Standardised
e

BOOKS

.35

.44

.82

GRADE
.33

ATTEND

777

Table
GRADE
GRADE
GRADE
GRADE

<-- BOOKS
<-- ATTEND
<-- e

Estimate
4.04
1.28
13.52
37.38

S.E.
1.71
0.57
1.53
7.54

C.R.
2.36
2.25
8.83
4.96

St. Est.
0.02
0.35
0.03
0.33
0.00
0.82
0.00

Coefficientsa
Unstandardized
Coefficients
M odel
1

B
(Constant)

Standardized
Coefficients

Std. Error

Beta

Sig.

37.38

7.74

BOOKS

4.04

1.75

.35

.03

ATTEND

1.28

.59

.33

.04

a. Dependent Variable: GRADE

.00

778

So What Was the Point?


Regression is a special case
Lots of other cases
Power of SEM
Power to add restrictions to the model

Restrict parameters
To zero
To the value of other parameters
To 1
779

Restrictions
Questions
Is a parameter really necessary?
Are a set of parameters necessary?
Are parameters equal

Each restriction adds 1 df


Test of model with 2

780

The 2 Test
Can the model proposed have
generated the data?
Test of significance of difference of
model and data
Statistically significant result
Bad

Theoretically driven
Start with model
Dont start with data
781

Regression Again
0, 1

BOOKS

GRADE
ATTEND

Both estimates restricted to zero

782

Two restrictions
2 df for 2 test
2 = 15.9, p = 0.0003

This test is (asymptotically)


equivalent to the F test in
regression
We still havent got any further

783

Multivariate Regression
y1
x1
y2
x2

y3

784

Test of all xs on all ys


(6 restrictions = 6 df)

y1
x1
y2
x2

y3

785

Test of all x1 on all ys


(3 restrictions)

y1
x1
y2
x2

y3

786

Test of all x1 on all y1


(3 restrictions)

y1
x1
y2
x2

y3

787

Test of all 3 partial correlations between


ys, controlling for xs
(3 restrictions)
y1
x1
y2
x2

y3

788

Path Analysis and SEM


More complex
models can
add more
restrictions

ENJOY

BUY

E.g. mediator
model

1 restriction
No path from
enjoy -> read

e_buy

READ

e_read

789

Result
2 = 10.9, 1 df, p = 0.001
Not a complete mediator
Additional path is required

790

Multiple Groups
Same model
Different people

Equality constraints between


groups
Means, correlations, variances,
regression estimates
E.g. males and females
791

Multiple Groups Example


Age
Severity of psoriasis
SEVE in emotional areas
Hands, face, forearm

SEVNONE in non-emotional areas


Anxiety
Depression
792

Correlationsa
AGE
AGE

.017

.035

.004

.009

.859

.717

110

110

110

110

110

-.270

.665

.045

.075

.004

.000

.639

.436

110

110

110

110

110

-.248

.665

.109

.096

.009

.000

.255

.316

110

110

110

110

110

Pearson Correlation

.017

.045

.109

.782

Sig. (2-tailed)

.859

.639

.255

.000

110

110

110

110

110

Pearson Correlation

.035

.075

.096

.782

Sig. (2-tailed)

.717

.436

.316

.000

110

110

110

110

110

Pearson Correlation
N
Pearson Correlation
Sig. (2-tailed)
N

N
GHQ_D

GHQ_D

-.248

Sig. (2-tailed)

GHQ_A

GHQ_A

-.270

SEVNONE

SEVNONE

Pearson Correlation
Sig. (2-tailed)

SEVE

SEVE

N
a. SEX = f

793

Correlationsa
AGE
AGE

Pearson Correlation
Sig. (2-tailed)
N

SEVE

Pearson Correlation
Sig. (2-tailed)
N

SEVNONE

Pearson Correlation
Sig. (2-tailed)
N

GHQ_A

Pearson Correlation
Sig. (2-tailed)
N

GHQ_D

Pearson Correlation
Sig. (2-tailed)
N

SEVE

SEVNONE

GHQ_A

GHQ_D

-.243

-.116

-.195

-.190

.031

.310

.085

.094

79

79

79

79

79

-.243

.671

.456

.453

.031

.000

.000

.000

79

79

79

79

79

-.116

.671

.210

.232

.310

.000

.063

.040

79

79

79

79

79

-.195

.456

.210

.800

.085

.000

.063

.000

79

79

79

79

79

-.190

.453

.232

.800

.094

.000

.040

.000

79

79

79

79

79

a. SEX = m
794

Model
AGE

SEVE

SEVNONE

e_s

e_sn

Dep

Anx

E_d

e_a

795

AGE

Females
-.27

.96

-.25

SEVE

SEVNONE

.07

.04

e_s

.97

e_sn
.03

.09 -.04

.15

.64

Dep

Anx
.99

.99

E_d

e_a

.78

796

AGE

Males

-.24

.97

-.12

SEVE

SEVNONE

-.08

-.08

e_s

.99

e_sn
.52

-.12 .55

-.17

.67

Dep

Anx
.88

.88

E_d

e_a

.74

797

Constraint
sevnone -> dep
Constrained to be equal for males and
females

1 restriction, 1 df
2 = 1.3 not significant

4 restrictions
2 severity -> anx & dep
798

4 restrictions, 4 df
2 = 1.3, p = 0.014

Parameters are not equal

799

Missing Data: The big


advantage
SEM programs tend to deal with
missing data
Multiple imputation
Full Information (Direct) Maximum
Likelihood
Asymptotically equivalent

Data can be MAR, not just MCAR


800

Power: A Smaller
Advantage
Power for regression gets tricky
with large models
With SEM power is (relatively) easy
Its all based on chi-square
Paper B14

801

Lesson 16: Dealing with


clustered data & longitudinal
models

802

The Independence
Assumption
In Lesson 8 we talked about
independence
The residual of any one case should not tell
you about the residual of any other case

Particularly problematic when:


Data are clustered on the predictor variable
E.g. predictor is household size, cases are
members of family
E.g. Predictor is doctor training, outcome is
patients of doctor

Data are longitudinal


Have people measured over time
Its the same person!

803

Clusters of Cases
Problem with cluster (group)
randomised studies
Or group effects

Use Huber-White sandwich


estimator
Tell it about the groups
Correction is made
Use complex samples in SPSS
804

Complex Samples
As with Huber-White for
heteroscedasticity
Add a variable that tells it about the clusters
Put it into clusters

Run GLM
As before

Warning:
Need about 20 clusters for solutions to be
stable
805

Example
People randomised by week to one of
two forms of triage
Compare the total cost of treating each

Ignore clustering
Difference is 2.40 per person, with 95%
confidence intervals 0.58 to 4.22, p
=0.010

Include clustering
Difference is still 2.40, with 95% CIs 5.65
to -0.85, and p = 0.141.

Ignoring clustering led to type I error


806

Longitudinal Research
For comparing
repeated
measures
Clusters are
people
Can model the
repeated
measures over
time

Data are usually


short and fat

ID

V1 V2 V3 V4

807

Converting Data
Change data to
tall and thin
Use Data,
Restructure in
SPSS
Clusters are ID

ID

5
808

(Simple) Example
Use employee data.sav
Compare beginning salary and salary
Would normally use paired samples ttest

Difference = $17,403, 95% CIs


$16,427.407, $18,379.555

809

Restructure the Data


Do it again
With data tall and
thin

Complex GLM with


Time as factor
ID as cluster

Difference =
$17,430, 95% CIs =
16427.407,
18739.555

ID

Time

Cash

$18,75
0

$21,45
0

$12,00
0

$21,90
0

$13,20
0

$45,00
0

810

Interesting
That wasnt very interesting
What is more interesting is when we
have multiple measurements of the
same people

Can plot and assess trajectories


over time

811

Single Person Trajectory

+
+

+
+
+

Time
812

Multiple Trajectories: Whats


the Mean and SD?

Time
813

Complex Trajectories
An event occurs
Can have two effects:
A jump in the value
A change in the slope

Event doesnt have to happen at


the same time for each person
Doesnt have to happen at all
814

Slope 1

Jump

Slope 2

Event Occurs

815

Parameterising
Time
1
2
3
4
5
6
7
8
9

Event
0
0
0
0
0
1
1
1
1

Time2
0
0
0
0
0
0
1
2
3

Outcome
12
13
14
15
16
10
9
8
7
816

Draw the Line

What are the parameter estimates?

817

Main Effects and


Interactions
Main effects
Intercept differences

Moderator effects
Slope differences

818

Multilevel Models
Fixed versus random effects
Fixed effects are fixed across
individuals (or clusters)
Random effects have variance

Levels
Level 1 individual measurement
occasions
Level 2 higher order clusters
819

More on Levels
NHS direct study
Level 1 units: .
Level 2 units:

Widowhood food study


Level 1 units
Level 2 units

820

More Flexibility
Three levels:
Level 1: measurements
Level 2: people
Level 3: schools

821

More Effects
Variances and covariances of
effects
Level 1 and level 2 residuals
Makes R2 difficult to talk about

Outcome variable
Yij
The score of the ith person in the jth
group
822

Y
2.3
3.2
4.5
4.8
7.2
3.1
1.6

i
1
2
3
1
2
3
4

j
1
1
1
2
2
2
2

823

Notation
Notation gets a bit horrid
Varies a lot between books and
programs

We used to have b0 and b1


If fixed, thats fine
If random, each person has their own
intercept and slope

824

Standard Errors
Intercept has standard errors
Slopes have standard errors
Random effects have variances
Those variances have standard errors
Is there statistically significant variation
between higher level units (people)?
OR
Is everyone the same?
825

Programs
Since version 12
Can do this in SPSS
Cant do anything really clever

Menus
Completely unusable
Have to use syntax

826

SPSS Syntax
MIXED
relfd with time
/fixed = time
/random = intercept time |
subject (id) covtype(un)
/print = solution.
827

SPSS Syntax
MIXED
relfd with time
Outcome

Continuous
predictor

828

SPSS Syntax
MIXED
relfd with time
/fixed = time
Must specify effect as
fixed first

829

SPSS Syntax
MIXED
relfd with time
/fixed = time
/random = intercept time |
Intercept and
subject (id) covtype(un)
time are random

Specify random
effects

SPSS assumes that your


level 2 units are subjects,
and needs to know the id
variable
830

SPSS Syntax
MIXED
relfd with time
fixed = time
/random = intercept time |
subject (id) covtype(un)
Covariance matrix of random
effects is unstructured.
(Alternative is id identity or vc
variance components).
831

SPSS Syntax
MIXED
relfd with time
fixed = time
/random = intercept time |
subject (id) covtype(un)
/print = solution.
Print the answer
832

The Output
Information criteria
Well come back
Information Criteriaa
-2 Restricted Log
Likelihood

64899.758

Akaike's Information
64907.758
Criterion (AIC)
Hurvich and Tsai's
Criterion (AICC)

64907.763

Bozdogan's Criterion
64940.134
(CAIC)
Schwarz's Bayesian
Criterion (BIC)

64936.134

The information criteria are displayed in smaller-is-better forms.


a. Dependent Variable: relfd.
833

Fixed Effects
Not useful here, useful for
interactions
a
Type III Tests of Fixed Effects

Numerator df

Denominator
df

Intercept

741

3251.877

.000

time

741.000

2.550

.111

Source

Sig.

a. Dependent Variable: relfd.

834

Estimates of Fixed Effects


Interpreted as regression equation
Estimates of Fixed Effectsa
95% Confidence
Interval
Parameter
Intercept
time

Estimate

Std.
Error

df

Sig.

Lower
Bound

Upper
Bound

21.90

21.90

.38

57.025

.000

21.15

22.66

-.06

-.06

.04

-1.597

.111

-.14

.01

a. Dependent Variable: relfd.

835

Covariance Parameters
Estimates of Covariance Parametersa
Parameter

Estimate

Residual

64.11577 1.0526353

Intercept +
time [subject
= id]

Std. Error

UN (1,1)

85.16791 5.7003732

UN (2,1)

-4.53179

.5067146

UN (2,2)

.7678319

.0636116

a. Dependent Variable: relfd.


836

Change Covtype to VC
We know that this is wrong
The covariance of the effects was
statistically significant
Can also see if it was wrong by
comparing information criteria

We have removed a parameter from


the model
Model is worse
Model is more parsimonious
Is it much worse, given the increase in
parsimony?
837

UN Model
Information Criteriaa
-2 Restricted Log
Likelihood

64899.758

VC Model
Information Criteriaa
-2 Restricted Log
Likelihood

65041.891

Akaike's Information
64907.758
Criterion (AIC)

Akaike's Information
65047.891
Criterion (AIC)

Hurvich and Tsai's


Criterion (AICC)

Hurvich and Tsai's


Criterion (AICC)

64907.763

65047.894

Bozdogan's Criterion
64940.134
(CAIC)

Bozdogan's Criterion
65072.173
(CAIC)

Schwarz's Bayesian
64936.134
Criterion (BIC)

Schwarz's Bayesian
65069.173
Criterion (BIC)

The information
criteria are displayed in smaller-is
The information criteria are displayed in smaller-is-better
forms.
a. Dependent Variable: relfd.
a. Dependent Variable: relfd.

Lower is better.
838

Adding Bits
So far, all a bit dull
We want some more predictors, to make
it more exciting
E.g. female
Add:
Relfd with time female
/fixed = time sex time * sex

What does the interaction term


represent?
839

Extending Models
Models can be extended
Any kind of regression can be used
Logistic, multinomial, Poisson, etc

More levels
Children within classes within schools
Measures within people within classes within
prisons

Multiple membership / cross classified


models
Children within households and classes, but
households not nested within class

Need a different program


E.g. MlwiN

840

MlwiN Example (very


quickly)

841

Books
Singer, JD and Willett, JB (2003). Applied
Longitudinal Data Analysis: Modeling
Change and Event Occurrence. Oxford,
Oxford University Press.
Examples at:
http://www.ats.ucla.edu/stat/SPSS/ex
amples/alda/default.htm

842

The End

843

You might also like