You are on page 1of 2

1

1.1

Chapter 2
Supervised learning

One usual usage of linear regression is for prediction of the value for one class. However,linear regression can be used as both classication (more than one class) and model tting for prediction (just one class). For classication, the regression line is the decision boundary, e.g. {x : xT = 0.5}. Sometimes linear model is too rigid, depending on the scenarios where the real data is generated. Consider two scenarios (two independent variables): Scenario 1: Each class is generated from bivariate Gaussian distribution with uncorrelated components and dierent means. (a linear decision boundary is already optimal, since the region of overlap is inevitable.) Scenario 2: Each class comes from a mixture of 10 low-variance Gaussian distribution with individual means themselves distributed as Gaussian. (The boundary is nonlinear and disjoint, so we could use nearest-Neighbor methods.) We have to note that the number of parameters in least-squares ts is p, while the k-nearest-neighbor is N/k (real eective number, not a single paprameter k). The sum of squared error is not applicable for picking k on training data set, since we would always pick k=1! The error on the training data should be approximately an increasing function of k, and will always be 0 for k=1. Sum of squared error is
N

RSS() =
i=1

(yi xT )2 = (y X)T (y X) i

(1)

k-nearest neighbor t for Y is 1 Y (x) = k yi


xi Nk (x)

(2)

The statistical studies will be more powerful if one knows the underlying data generation process. Otherwise, several methods might be tried to get the (possible)right one. Table 1. A summary of dierences between least squared method and nearest-neighbor method. items LS NN DOF p N/k Fit linear boundary no assumption Objective SSE by test dataset k-nearest-neighbor can be used for low-dimensional problems. Some ways to enhance these simple procedures:

Kernal methods: use weights to smoothly decrease to zero with distance from the target point, rather than 0/1 weights in k-nearest neighbor methods; high-dimensional spaces Kernal: emphasize on some variables more than others; local regression: t linear model by locally weighted least squares; basis expansion on linear model t; neural network models: sums of non-linearly transformed linear models.

1.2

Statistical Decision Theory

The theory requires a loss function L(Y, f (x)) for penalizing errors in prediction. If we use squared error loss, we have E(f (x)) = E(Y f (X))2 = [y f (x)] 2P r(dx, dy) (3) (4)

The solution to minimize the expectation is f (x) = [Y |X = x]), if the loss function is squared error loss. The nearest-neighbor methods attempt to implement this recipe directly with the approximations of sample mean and regions close enough to the target point. (Locally constant function) The least squared methods are model-based. They use knowledge of functional relationship to pool over the values of X, not conditional on X. (Globally linear function.) For categorical variable G, the Bayes classier is used.

1.3

High Dimensionality

The curse of dimensionality can be seen in many manifestations. If we want to examine the same fraction of local region of a target point, high dimension requires bigger range of each input variable, which make such neighborhood no longer local. The distance of nearest neighbor to the target point is longer than low dimensions, usually it is nearer to boundary/edge. The sample density is propional to N 1/p . If special structure/model is known to exist, this can be used to reduce both the bias and the variance of the estimates. The curse of dimensionality also exists in the methods that attempts to produce locally varying functions in a small isotropic neighborhood. The methods that overcome the dimensionality problem usually does not allow the neighborhood to be simultaneously small in all directions. Model complexity: Overt VS undert 2

You might also like