Talk13 SOMS PCA ICA MDS2014 PDF

Unsupervised Learning (§4 to §7)
14.4 Self-Organizing Maps
14.5 Principal Components

Principal Curves
Principal Surfaces
14.6 Factor Analysis

Independant Componant Analysis
Exploratory Projection Pursuit
14.7 Multidimensional Scaling

14.4 Self-Organizing Maps (SOMs)
Constrained version of K-means (VQ)
with prototypes on a topological map (grid)
For K prototypes mj in ℝp placed on a given map

- Find closest mj of considered xi
- Move mj and its neighbors mk towards xi
m k ← m k + α (x i − m k )
Running the algorithm moves the prototypes

inside the input distribution w.r.t. map constrains
SOMs can be extended
Presented version is the ‘basic online’ algorithm
Enhanced version : m k ← m k + αh ( l i − l k )(x i − mk )
Batch one : m j =
∑w xk k
∑w k
Further developments : WEBSOM, …

Can SOMs separate sphere data ?
Well … and so what ?

VQ aims to reduce the inputs (K≤N)
Vector Quantization (VQ) from 14.3.9 :
We aim to :
We use :
Neural VQ algorithms are

an adaptative (iterated) version of K-means
SOMs are a constrained VQ algorithm
Intuition
Input space (concrete) Predefined map (abstract)
On ‘simple drawings’ and ‘long speeches’ … ;)

14.5 Principal Components (PCA)
Set of q directions towards which

p-dimensional data are linearly projected
-> dimension reduction since q ≤p
Data : x1, x2, …, xn

Model : f(λ) = µ + Vq λ
Model fitted by least squares q

w.r.t. reconstruction error min
2
µ , λ i ,V q
∑x
i =1
i − µ − Vq λ i
Principal Components are computed
from matrix operations
Considering µ̂ = x and λ i = Vq (x i − x ) , solve
ˆ T
∑ (x − x ) + Vq V (x i − x )
2
T
min
µ , λ i ,V q
i q
i =1
Data are usually centered,

and VqVqT = Hq is the projection matrix
Solution Vq are the q first columns of V obtained

from T
X = UDV
PCA can project data
PCA can separate sphere data
PCA can be used for handwritten digits
Data in ℝ256 2-D projections
fˆ(λ ) = x + λ1v 1 + λ 2v 2
14.5 Principal Curves (PCurves)
Let f(λ) be a parametrized smooth curve in ℝp
Each coordinate of p-dimensional f(λ)

is a smooth function of λ (e.g. arc-length)
Then f (λ ) = E(X | λ f (X ) = λ ) is a principal curve
Related concepts
- responsability
- principal points
PCurves are a smooth
generalization of (linear) PCA
PCurves can be obtained automatically
Algorithm principle :
Consider the PCurve coordinates one by one
(
fˆj (λ ) ← E X j | λ f (X ) = λ
j
)
λˆ f (X ) ← argmin x − fˆ λ'( )
λ'
Those two steps are repeated until convergence

14.5 Principal Surfaces (PSurfaces)
Generalization to more than one parameter

(but rarely more than two)
[ ]
f (λ1 , λ 2 ) = f1 (λ1 , λ 2 ), f2 (λ1 , λ 2 ),..., fp (λ1 , λ 2 )
Algorithm for PCurves can be applied

for PSurfaces
Links with the SOMs

PSurfaces can separate sphere data
14.6 Factor Analysis (FA)
Goal : find the latent variables

within the observed ones
Latent representation of data (using SVD)

X = UDV T
= SA T
DV
with S = N U and A
T
T
=
N
PCA estimates a latent variable model
X1 = a11S 1 + a12S 2 + K + a 1p S p
X2 = a 21S 1 + a 22S 2 + K + a 2 p S p
M M
Xp = a p 1S 1 + a p 2S 2 + K + a pp S p
or X = A S
Due to hypothesis on X there are many solutions

X = AS = AR TRS = A * S *
as
Cov (S *) = R Cov (S ) R = I
Factor Analysis avoids
the multi-solution problem
Idea : use q < p
X1 = a 11S 1 + a12S 2 + K + a1q S q + ε1
X2 = a 21S 1 + a 22S 2 + K + a 2q S q + ε 2
M M
Xp = a p 1S 1 + a p 2S 2 + K + a pq S q + ε p
or X = A S + ε, with εj the unique information of Xj
Parameters are given by the covariance matrix

Σ = AA T + D ε
with
[
D ε = diag Var( ε1 ),K, Var( ε p ) ]
Factor Analysis avoids only partially
the multi-solution problem
Still many possible solutions
since A and ART are equivalent in
Σ = AA T + D ε
Presented as an ‘advantage’ of the method :
allows rotation of the selected factors
(interpretation purpose)
PCA and FA are related
FA can be generalized
X1 = a11F1 + a12F 2 + K + a 1q Fq + ε1
M
Xp = a p 1F1 + a p 2F 2 + K + a pq Fq + ε p
PCA : all variability in an item should be used

FA : only the variability in common is used
In most cases, the methods yield similar results
PCA is preferred for data reduction

FA is preferred for structure detection
14.6 Independent Component
Analysis (ICA)
Goal : find source signals S from mixed ones X
Blind Source Separation
Assumption: Mutual Independency of Sources

ICA Applications
• Adaptive Filtering
• Speech Signal Processing
• Wireless Communication
• Biomedical Signal Processing
• Computational Neuroscience
• Image Coding
• Text data modeling
• Financial data analysis
ICA for MEG/EEG Analysis
MEG or EEG
signals
Event-related
Dipolar Potentials
brain sources
• EEG systems record the electrical potentials on the scalp, which

are the mixture of a huge number of the action potentials of
neurons inside the brain.
Our objective: to extract the evoked potential from measurements

to find the location of the action potentials
to find how the action potentials travel (their dynamics).
Speech Separation
Mathematical Formulation
x(k)
x(k )  As(k )  v k )
s(k)
• s(k)= (s1(k),…,sn(k))T: the vector of n-source signals;

• x(k)= (x1(k),…,xm(k))T: the vector of m-sensor signals;
• v (k): the vector of sensor noises.
• A is the mixing matrix.
Demixing Model
Mixing Model
x(k )  As(k )  v k )
Problem: to estimate the source signals (or event-related potentials)

by using the sensor signals
y(k)=W x(k)
• y(k)= (y1(k),…,ym(k))T: the vector of recovered signals
• W is the demixing matrix.
Basic Models for BSS
• Assumption:
– The source signal are mutually independent
• Model:
– Linear instantanuous mixture
– Linear convolutive mixture
• Principles
– Maximum Entropy
– Minimum Mutual Information
– Joint Diagonalization of Cross‐corrlations
– Linear Predictability
– Sparseness Maximization
The key to ICA model :
maximize nongaussianity
Let X be a mixture of independant components
A single component can be found

Y = B T X = ∑ Bi x i
i
Let z = AT B
Y = B T X = B T As = z T s
Then refer to large interpretation of CLT …

Kurtosis is a mesure of nongaussianity
Simple but not robust to outliers

Negentropy is a mesure of nongaussianty
The differential entropy is
H (Y ) = − ∫ g (y ) log g (y )dy
From Information Theory :

entropy is maximum for Gaussian signals
Use negentropy instead

J (Y ) = H (Y gauss ) − H (Y )
but need to estimate it …
Mutual Information
can also measure nongaussianity
Mutual Information
p
I (Y ) = ∑ H (Y j ) − H (Y )
j =1
a.k.a. Kullback-Leibler distance

between g(y) and product of its marginal densities
Finding B that minimizes mutual information

is equivalent to find directions
in which negentropy is maximized
Preprocessing improves ICA
Centering : X has zero mean
Whitening : X ’ s. t. E[X ’ X ’T] = I
Others : application dependant

ICA works better than PCA
ICA works better than PCA
ICA can be used for handwritten digits
14.6 Exploratory Projection Pursuit
Chapter 11
Find ‘interesting’ direction on which
multidimensional data can be projected
‘Interesting’ means displays some structure

of the data distribution
Gaussian distributions are the least interesting

(from structure point of view)
thus find nongaussian distributions (as ICA)
14.6 A different approach for ICA
The data are assigned to a class G = 1 and

compared to a ‘reference signal’ of class G = 0
Kind of generalized logistic regression

Pr (G = 1)
log
1 − Pr (G = 1)
( ) (
= f 1 a1T X + f 2 a 2T X )
Project Pursuit Regression techniques
can therefore be applied
14.7 Multidimensional Scaling (MDS)
Aims to project data, as SOMs, PCurves, …
but tries to preserve pairwise distances
Based on distance or dissimilarity (only)

d ij = x i − x j
Goal : find values minimizing the stress function
S D (z 1 , z 2 ,K , z n ) = ∑ (d − zi − z j )
2
ij
i ≠j
Least square or Kruskal-Shephard scaling

MDS can be used with other scalings
Classical scaling minimizes

∑ (s ij − zi − zi ,z j − z j )
i ≠j
Sammon’s mapping minimizes

(d ij − zi − z j ) 2
∑
i ≠j d ij
Shephard-Kruskal nonmetric scaling minimizes

(θ( z i )
− z j − d ij ) 2
∑
i ≠j d ij2
MDS can separate sphere data
Unsupervised Learning (§4 to §7)
SOMs : is a constrained VQ algorithm
PCA : finds (linearly) orthogonal componants

PCurves : is a smooth generalization of PCA
PSurfaces : is a 2(+)-D generalization of PCurves
FA : finds common (correlated) componant parts

ICA : find independant componants
EPP : can be related to ICA
MDS : tries to preserve pairwise distances

Talk13 SOMS PCA ICA MDS2014 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Talk13 SOMS PCA ICA MDS2014 PDF

Uploaded by

Copyright:

Available Formats

Unsupervised Learning (§4 to §7)

14.4 Self-Organizing Maps

14.5 Principal Components

14.6 Factor Analysis

14.7 Multidimensional Scaling

For K prototypes mj in ℝp placed on a given map

Running the algorithm moves the prototypes

Presented version is the ‘basic online’ algorithm

Enhanced version : m k ← m k + αh ( l i − l k )(x i − mk )

Further developments : WEBSOM, …

Well … and so what ?

Vector Quantization (VQ) from 14.3.9 :

Neural VQ algorithms are

On ‘simple drawings’ and ‘long speeches’ … ;)

Set of q directions towards which

Data : x1, x2, …, xn

Model fitted by least squares q

Data are usually centered,

Solution Vq are the q first columns of V obtained

Data in ℝ256 2-D projections

Let f(λ) be a parametrized smooth curve in ℝp

Each coordinate of p-dimensional f(λ)

Then f (λ ) = E(X | λ f (X ) = λ ) is a principal curve

Those two steps are repeated until convergence

Generalization to more than one parameter

Algorithm for PCurves can be applied

Links with the SOMs

Goal : find the latent variables

Latent representation of data (using SVD)

Due to hypothesis on X there are many solutions

Parameters are given by the covariance matrix

PCA : all variability in an item should be used

In most cases, the methods yield similar results

PCA is preferred for data reduction

Assumption: Mutual Independency of Sources

• EEG systems record the electrical potentials on the scalp, which

Our objective: to extract the evoked potential from measurements

• s(k)= (s1(k),…,sn(k))T: the vector of n-source signals;

Problem: to estimate the source signals (or event-related potentials)

A single component can be found

Then refer to large interpretation of CLT …

Simple but not robust to outliers

From Information Theory :

Use negentropy instead

a.k.a. Kullback-Leibler distance

Finding B that minimizes mutual information

Centering : X has zero mean

Whitening : X ’ s. t. E[X ’ X ’T] = I

Others : application dependant

‘Interesting’ means displays some structure

Gaussian distributions are the least interesting

The data are assigned to a class G = 1 and

Kind of generalized logistic regression

Based on distance or dissimilarity (only)

Goal : find values minimizing the stress function

Least square or Kruskal-Shephard scaling

Classical scaling minimizes

Sammon’s mapping minimizes

Shephard-Kruskal nonmetric scaling minimizes

PCA : finds (linearly) orthogonal componants

FA : finds common (correlated) componant parts

MDS : tries to preserve pairwise distances

You might also like