You are on page 1of 43

Unsupervised Learning (§4 to §7)

14.4 Self-Organizing Maps

14.5 Principal Components


Principal Curves
Principal Surfaces

14.6 Factor Analysis


Independant Componant Analysis
Exploratory Projection Pursuit

14.7 Multidimensional Scaling


14.4 Self-Organizing Maps (SOMs)
Constrained version of K-means (VQ)
with prototypes on a topological map (grid)

For K prototypes mj in ℝp placed on a given map


- Find closest mj of considered xi
- Move mj and its neighbors mk towards xi
m k ← m k + α (x i − m k )

Running the algorithm moves the prototypes


inside the input distribution w.r.t. map constrains
SOMs can be extended

Presented version is the ‘basic online’ algorithm

Enhanced version : m k ← m k + αh ( l i − l k )(x i − mk )

Batch one : m j =
∑w xk k

∑w k

Further developments : WEBSOM, …


Can SOMs separate sphere data ?

Well … and so what ?


VQ aims to reduce the inputs (K≤N)

Vector Quantization (VQ) from 14.3.9 :

We aim to :

We use :

Neural VQ algorithms are


an adaptative (iterated) version of K-means
SOMs are a constrained VQ algorithm

Intuition
Input space (concrete) Predefined map (abstract)

On ‘simple drawings’ and ‘long speeches’ … ;)


14.5 Principal Components (PCA)

Set of q directions towards which


p-dimensional data are linearly projected
-> dimension reduction since q ≤p

Data : x1, x2, …, xn


Model : f(λ) = µ + Vq λ

Model fitted by least squares q


w.r.t. reconstruction error min
2

µ , λ i ,V q
∑x
i =1
i − µ − Vq λ i
Principal Components are computed
from matrix operations
Considering µ̂ = x and λ i = Vq (x i − x ) , solve
ˆ T

∑ (x − x ) + Vq V (x i − x )
2
T
min
µ , λ i ,V q
i q
i =1

Data are usually centered,


and VqVqT = Hq is the projection matrix

Solution Vq are the q first columns of V obtained


from T
X = UDV
PCA can project data
PCA can separate sphere data
PCA can be used for handwritten digits

Data in ℝ256 2-D projections

fˆ(λ ) = x + λ1v 1 + λ 2v 2
14.5 Principal Curves (PCurves)

Let f(λ) be a parametrized smooth curve in ℝp

Each coordinate of p-dimensional f(λ)


is a smooth function of λ (e.g. arc-length)

Then f (λ ) = E(X | λ f (X ) = λ ) is a principal curve

Related concepts
- responsability
- principal points
PCurves are a smooth
generalization of (linear) PCA
PCurves can be obtained automatically

Algorithm principle :
Consider the PCurve coordinates one by one
(
fˆj (λ ) ← E X j | λ f (X ) = λ
j
)
λˆ f (X ) ← argmin x − fˆ λ'( )
λ'

Those two steps are repeated until convergence


14.5 Principal Surfaces (PSurfaces)

Generalization to more than one parameter


(but rarely more than two)
[ ]
f (λ1 , λ 2 ) = f1 (λ1 , λ 2 ), f2 (λ1 , λ 2 ),..., fp (λ1 , λ 2 )

Algorithm for PCurves can be applied


for PSurfaces

Links with the SOMs


PSurfaces can separate sphere data
14.6 Factor Analysis (FA)

Goal : find the latent variables


within the observed ones

Latent representation of data (using SVD)


X = UDV T
= SA T
DV
with S = N U and A
T
T
=
N
PCA estimates a latent variable model
X1 = a11S 1 + a12S 2 + K + a 1p S p
X2 = a 21S 1 + a 22S 2 + K + a 2 p S p
M M
Xp = a p 1S 1 + a p 2S 2 + K + a pp S p
or X = A S

Due to hypothesis on X there are many solutions


X = AS = AR TRS = A * S *
as
Cov (S *) = R Cov (S ) R = I
Factor Analysis avoids
the multi-solution problem
Idea : use q < p
X1 = a 11S 1 + a12S 2 + K + a1q S q + ε1
X2 = a 21S 1 + a 22S 2 + K + a 2q S q + ε 2
M M
Xp = a p 1S 1 + a p 2S 2 + K + a pq S q + ε p
or X = A S + ε, with εj the unique information of Xj

Parameters are given by the covariance matrix


Σ = AA T + D ε
with
[
D ε = diag Var( ε1 ),K, Var( ε p ) ]
Factor Analysis avoids only partially
the multi-solution problem
Still many possible solutions
since A and ART are equivalent in
Σ = AA T + D ε
Presented as an ‘advantage’ of the method :
allows rotation of the selected factors
(interpretation purpose)
PCA and FA are related
FA can be generalized
X1 = a11F1 + a12F 2 + K + a 1q Fq + ε1
M
Xp = a p 1F1 + a p 2F 2 + K + a pq Fq + ε p

PCA : all variability in an item should be used


FA : only the variability in common is used

In most cases, the methods yield similar results

PCA is preferred for data reduction


FA is preferred for structure detection
14.6 Independent Component
Analysis (ICA)
Goal : find source signals S from mixed ones X
Blind Source Separation

Assumption: Mutual Independency of Sources


ICA Applications
• Adaptive Filtering
• Speech Signal Processing
• Wireless Communication
• Biomedical Signal Processing
• Computational Neuroscience
• Image Coding
• Text data modeling
• Financial data analysis
ICA for MEG/EEG Analysis
MEG or EEG
signals

Event-related
Dipolar Potentials
brain sources

• EEG systems record the electrical potentials on the scalp, which


are the mixture of a huge number of the action potentials of
neurons inside the brain.

Our objective: to extract the evoked potential from measurements


to find the location of the action potentials
to find how the action potentials travel (their dynamics).
Speech Separation
Mathematical Formulation 

x(k)

x(k )  As(k )  v k )
s(k)

• s(k)= (s1(k),…,sn(k))T: the vector of n-source signals;


• x(k)= (x1(k),…,xm(k))T: the vector of m-sensor signals;
• v (k): the vector of sensor noises.
• A is the mixing matrix.
Demixing Model
Mixing Model

x(k )  As(k )  v k )

Problem: to estimate the source signals (or event-related potentials)


by using the sensor signals

y(k)=W x(k)
• y(k)= (y1(k),…,ym(k))T: the vector of recovered signals
• W is the demixing matrix.
Basic Models for BSS
• Assumption:  
– The source signal are mutually independent
• Model:  
– Linear instantanuous mixture
– Linear convolutive mixture
• Principles
– Maximum Entropy
– Minimum Mutual Information
– Joint Diagonalization of Cross‐corrlations
– Linear Predictability 
– Sparseness Maximization    
The key to ICA model :
maximize nongaussianity
Let X be a mixture of independant components

A single component can be found


Y = B T X = ∑ Bi x i
i

Let z = AT B
Y = B T X = B T As = z T s

Then refer to large interpretation of CLT …


Kurtosis is a mesure of nongaussianity

Simple but not robust to outliers


Negentropy is a mesure of nongaussianty
The differential entropy is
H (Y ) = − ∫ g (y ) log g (y )dy

From Information Theory :


entropy is maximum for Gaussian signals

Use negentropy instead


J (Y ) = H (Y gauss ) − H (Y )
but need to estimate it …
Mutual Information
can also measure nongaussianity
Mutual Information
p
I (Y ) = ∑ H (Y j ) − H (Y )
j =1

a.k.a. Kullback-Leibler distance


between g(y) and product of its marginal densities

Finding B that minimizes mutual information


is equivalent to find directions
in which negentropy is maximized
Preprocessing improves ICA

Centering : X has zero mean

Whitening : X ’ s. t. E[X ’ X ’T] = I

Others : application dependant


ICA works better than PCA
ICA works better than PCA
ICA can be used for handwritten digits
14.6 Exploratory Projection Pursuit

Chapter 11
Find ‘interesting’ direction on which
multidimensional data can be projected

‘Interesting’ means displays some structure


of the data distribution

Gaussian distributions are the least interesting


(from structure point of view)
thus find nongaussian distributions (as ICA)
14.6 A different approach for ICA

The data are assigned to a class G = 1 and


compared to a ‘reference signal’ of class G = 0

Kind of generalized logistic regression


Pr (G = 1)
log
1 − Pr (G = 1)
( ) (
= f 1 a1T X + f 2 a 2T X )
Project Pursuit Regression techniques
can therefore be applied
14.7 Multidimensional Scaling (MDS)
Aims to project data, as SOMs, PCurves, …
but tries to preserve pairwise distances

Based on distance or dissimilarity (only)


d ij = x i − x j

Goal : find values minimizing the stress function

S D (z 1 , z 2 ,K , z n ) = ∑ (d − zi − z j )
2
ij
i ≠j

Least square or Kruskal-Shephard scaling


MDS can be used with other scalings

Classical scaling minimizes


∑ (s ij − zi − zi ,z j − z j )
i ≠j

Sammon’s mapping minimizes


(d ij − zi − z j ) 2


i ≠j d ij

Shephard-Kruskal nonmetric scaling minimizes


(θ( z i )
− z j − d ij ) 2


i ≠j d ij2
MDS can separate sphere data
Unsupervised Learning (§4 to §7)
SOMs : is a constrained VQ algorithm

PCA : finds (linearly) orthogonal componants


PCurves : is a smooth generalization of PCA
PSurfaces : is a 2(+)-D generalization of PCurves

FA : finds common (correlated) componant parts


ICA : find independant componants
EPP : can be related to ICA

MDS : tries to preserve pairwise distances

You might also like