You are on page 1of 36

Linear regression

with gradient descent


Ingmar Schuster
Patrick Jhnichen
using slides by Andrew Ng

Institut fr Informatik

This lecture covers

Linear Regression

Hypothesis formulation,
hypthesis space

Optimizing Cost with Gradient


Descent
Using multiple input features
with Linear Regression

Feature Scaling

Nonlinear Regression

Optimizing Cost using


derivatives

Linear regression w. gradient descent

Linear Regression

Institut fr Informatik

Price for buying a flat in Berlin

Supervised learning problem

Expected answer available for each example in data

Regression Problem

Prediction of continuous output


Linear regression w. gradient descent

Training data of flat prices

m Number of training examples


x is input (predictor) variable
features in ML-speek
y is output (response) variable
Notation

Square meters

Price in 1000

73

174

146

367

38

69

124

257

...

...

Linear regression w. gradient descent

Learning procedure

Hypothesis parameters

Training data

linear regression,
one input variable (univariate)

Learning Algorithm

Size
of flat

Estimated
price
hypothesis
(mapping between
input and output)

How to choose parameters?


Linear regression w. gradient descent

Optimization objective

Purpose of learning algorithm expressed in


optimization objective and cost function (often called J)

Fit data well

Few false negatives

Few false positives

...

Fitting data well: least squares cost function

In regression almost always want to fit data well

smallest average distance to points in training data


(h(x) close to y for (x,y) in training data)
Cost function often named J
Number
Number of
of
training
training instances
instances

Squaring

Penalty for positive and negative deviations the same

Penalty for large deviations stronger


Linear regression w. gradient descent

Optimizing Cost
with Gradient Descent

Linear regression w. gradient descent

Gradient Descent Outline

Want to minimize

Start with random

Keep changing
to reduce
until we end up at minimum

Linear regression w. gradient descent

10

3D plots and contour plots

Stepwise
Stepwise
descent
descent
towards
towards
minimum
minimum

Derivatives
Derivatives
work
work only
only for
for
few
few parameters
parameters

[plot by Andrew Ng]

Gradient descent
partial
partial
derivative
derivative

beware: incremental
update incorrect!

Linear regression w. gradient descent

steps
steps become
become smaller
smaller
without
without changing
changing
learning
learning rate
rate
12

Learning Rate considerations

Small learning rate leads to slow


convergence

Overly large learning rate may


not lead to convergence or to
divergence
Often

Linear regression w. gradient descent

13

Checking convergence

Gradient descent works


correctly if
decreases
with every step
Possible convergence
criterion: converged if
decreases by less than
constant

Linear regression w. gradient descent

14

Local Minima

Gradient descent can get stuck at local minima


(e.g. J not squared error for regression with only one variable)

Linear regression w. gradient descent

Random restart
with different
parameter(s)

15

Variants of Gradient Descent

Using multiple input features

Linear regression w. gradient descent

16

Multiple features
Square Bedrooms Floors Age of building
meters
(years)
x1

x2

x3

x4

Price in
1000
y

200

45

460

131

40

232

142

30

315

756

36

178

Notation

Linear regression w. gradient descent

17

Hypothesis representation

More compact

with definition

Linear regression w. gradient descent

18

Gradient descent for multiple variables

Generalized cost function

Generalized gradient descent

Linear regression w. gradient descent

19

Partial derivative of cost function for multiple variables

Calculating the partial derivative

Linear regression w. gradient descent

20

Gradient descent for multiple variables

Simplified gradient descent

Linear regression w. gradient descent

21

Conversion considerations for multiple variables

With multiple variables, comparison of variance in data is lost


(scales can vary strongly)
Square meters 30 - 400

Bedrooms

1 - 10

Price

80 000
2 000 000

Gradient descent converges faster for features on similar scale

Linear regression w. gradient descent

22

Feature Scaling

Linear regression w. gradient descent

23

Feature scaling

Different approaches for converting features to comparable


scale

Min-Max-Scaling makes all data fall into range [0, 1]

(for single data point of feature j)

Z-score conversion

Linear regression w. gradient descent

24

Z-Score conversion

Center data on 0

Scale data so majority falls into range [-1, 1]


mean
mean // empirical
empirical
expected
expected value
value
(mu)
(mu)

empirical
empirical standard
standard
deviation
deviation (sigma)
(sigma)

Z-score conversion of single data point for feature j

Linear regression w. gradient descent

25

Visualizing standard deviation

Linear regression w. gradient descent

26

Nonlinear Regression
(by cheap trickery)

Linear regression w. gradient descent

27

Nonlinear Regression Problems

Linear regression w. gradient descent

28

Nonlinear Regression Problems (linear approximation)

Linear regression w. gradient descent

29

Nonlinear Regression Problems (nonlinear hypothesis)

Linear regression w. gradient descent

30

Nonlinear Regression with cheap trickery

Linear Regression can be used for Nonlinear Problems

Choose nonlinear hypothesis space

Linear regression w. gradient descent

31

Optimizing cost
using derivatives

Linear regression w. gradient descent

32

Comparison Gradient Descent vs. Setting derivative = 0

Instead of Gradient descent


solve
for all i

Linear regression w. gradient descent

33

Comparison Gradient Descent vs. Setting derivative = 0


Gradient Descent

Need to choose
Needs many iterations,
random restarts etc.
Works well for many features

Derivation

No need to choose

No iterations

Slow for many features

Linear regression w. gradient descent

34

This lecture covers

Linear Regression

Hypothesis formulation,
hypthesis space

Optimizing Cost with Gradient


Descent
Using multiple input features
with Linear Regression

Feature Scaling

Nonlinear Regression

Optimizing Cost using


derivatives

Linear regression w. gradient descent

35

Pictures

Some public domain plots from


en.wikipedia.org and
de.wikipedia.org

Linear regression w. gradient descent

36