PCA Lab1

PrincipalComponentAnalysis
1stTUTORIAL
There are two commands for performing principal component analysis (PCA) in the R
environment:princompandprcomp.Wewillfocusonusingthefirstonebutthetwoarevery
similar (princomp uses eigen to compute the PCA, while prcomp uses svd, which can be
morestableinsomecases).
ThemaingoalofPCAinthetaskstodaywillbedimensionreduction.Wewishtoproduceasmaller
setofuncorrelatedvariablesfromthelargersetofcorrelatedvariables.
Jobperformance
Let'sconsideradatasetjob_perfthatyoucanfindtheMultivariatecoursefolder.Datacontain
observationsonfiftypoliceofficersthatwereratedbytheirsupervisorsin6categoriesaspartof
standardpolicedepartmentaladministrativeprocedure:
Column1(commun):CommunicationSkills
Column2(probl_solv):ProblemSolving
Column3(logical):LogicalAbility
Column4(learn):LearningAbility
Column5(physical):PhysicalAbility
Column6(appearance):Appearance
1.ReadthedataintoRandprintoutthevariablenames.
> job<-read.table("c:\\temp\\job_perf.txt",header=TRUE,sep="")
> job
commun probl_solv logical learn physical appearance
1 12 52 20 44 48 16
2 12 57 25 45 50 16
3 12 54 21 45 50 16
4 13 52 21 46 51 17
5 14 54 24 46 51 17
... ... ... ... ... ... ...
47 22 52 25 54 58 26
48 22 56 26 55 58 27
49 22 57 25 55 59 28
50 24 55 24 56 59 28
> names(job)
[1] "commun" "probl_solv" "logical" "learn" "physical" "appearance"
2.Lookatthesummarystatistics
> summary(job)
Min. :12.00 Min. :48.00 Min. :20.00 Min. :44.00 Min. :48.00 Min. :16.00
1st Qu.:16.00 1st Qu.:52.25 1st Qu.:22.00 1st Qu.:48.00 1st Qu.:52.25 1st Qu.:19.00
Median :18.00 Median :54.00 Median :24.00 Median :50.00 Median :54.00 Median :21.00
Mean :17.68 Mean :54.16 Mean :24.02 Mean :50.28 Mean :54.16 Mean :21.06
3rd Qu.:19.75 3rd Qu.:56.00 3rd Qu.:26.00 3rd Qu.:52.00 3rd Qu.:56.00 3rd Qu.:23.00
Max. :24.00 Max. :59.00 Max. :31.00 Max. :56.00 Max. :59.00 Max. :28.00
1
> S<-cov(job)
> S
commun 7.5281633 0.8685714 1.537143 7.6424490 6.4604082 7.9583673
probl_solv 0.8685714 5.8106122 2.180408 0.9746939 0.9534694 0.9289796
logical 1.5371429 2.1804082 6.183265 1.6269388 1.5477551 1.6110204
learn 7.6424490 0.9746939 1.626939 8.0424490 6.7093878 8.2481633
physical 6.4604082 0.9534694 1.547755 6.7093878 5.8106122 6.9902041
appearance 7.9583673 0.9289796 1.611020 8.2481633 6.9902041 8.9555102
>
> sd.job<-apply(job,2,sd)
> sd.job
2.743750 2.410521 2.486617 2.835921 2.410521 2.992576
>
> R<-cor(job)
> R
commun 1.0000000 0.1313258 0.2252998 0.9821863 0.9767974 0.9692466
probl_solv 0.1313258 1.0000000 0.3637625 0.1425815 0.1640910 0.1287805
logical 0.2252998 0.3637625 1.0000000 0.2307109 0.2582155 0.2164945
learn 0.9821863 0.1425815 0.2307109 1.0000000 0.9814717 0.9718918
physical 0.9767974 0.1640910 0.2582155 0.9814717 1.0000000 0.9690222
appearance 0.9692466 0.1287805 0.2164945 0.9718918 0.9690222 1.0000000
>
It is possible to reduce dimensionality without losing large amounts of information when the
originalvariablesarehighlycorrelated.Whenthecorrelationbetweenvariablesisweakalarger
numberofcomponentsisneededinordertocaptureenoughvariability.Thecorrelationvalues
lookquitelarge,sointhiscontextthePCAseemsusefultoreducedimensionality.
ShouldthePCAbecarriedoutwiththecovarianceorthecorrelationmatrix?
If there are large differences between the variances of the elements of X, then those variables
whose variances are largest will tend to dominate the first few PCs. Since in this examples the
standarddeviations(andthusthevariances)arequitesimilaritispossibletousethecovariance
matrixtocarryoutthePC.
3. Perform a PCA of the data using the covariance matrix. View the components of your PCA
object and obtain the definitions of the principal components (PCs) and the corresponding
proportionsofvariationexplainedbyeach.Produceascreeplot.
Thefirstargumentofprincompisthenpdatamatrix.Thesecondargumentiscorwhichisset
to TRUE if the sample correlation matrix is to be used in the PCA and FALSE if the sample
covariancematrixistobeused.Theoutcomeobjectisalistwithvariousentries.Theentrysdev
gives the standard deviations (the square root of the eigen values) of the components, the
loadings entry is a matrix with the eigen vectors in the columns, center gives the average
vector. If thee argument scores=TRUE then the output will also include an np matrix of
principalcomponentscores.Usingplotonthefittedobjectwillresultinascreeplot.
> eigen(S)
$values
[1] 30.3655113 7.6195995 3.7833151 0.2926963 0.1497802 0.1197098
2
$vectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.49155807 0.08748571 0.01214618 -0.380639251 0.75573635 -0.185864772
[2,] -0.08675503 -0.69046582 0.71781729 0.008879732 0.01827673 0.007480667
[3,] -0.13662756 -0.70496706 -0.69539380 0.015845890 0.01597971 0.016594522
[4,] -0.50947274 0.08031174 0.02069595 -0.329958740 -0.33064205 0.717887386
[5,] -0.43261611 0.04023081 0.01036964 -0.217845082 -0.56076816 -0.670223722
[6,] -0.53428266 0.10274312 0.02196412 0.835735944 0.06699316 0.023681484
ThefirstentryofthelistcontainstheeigenvaluesofthecovariancematrixS,i.e.thevariancesof
theprincipalcomponents.Thesecondentryisamatrixwherethecolumnsaretheeigenvectors
ofthematrixS.
Thesameresultcouldbeobtainedbyusingthefunctionprincomp:
> pca_job<-princomp(job,cor=FALSE)
> names(pca_job)
[1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"
> loadings(pca_job)
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
commun 0.492 -0.381 0.756 -0.186
probl_solv -0.690 0.718
logical 0.137 -0.705 -0.695
learn 0.509 -0.330 -0.331 0.718
physical 0.433 -0.218 -0.561 -0.670
appearance 0.534 0.103 0.836

SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.167 0.167 0.167 0.167 0.167 0.167
Cumulative Var 0.167 0.333 0.500 0.667 0.833 1.000
> summary(pca_job)
Importance of components:
Standard deviation 5.4551078 2.7326192 1.92552560 0.535576640 0.383124752 0.342513705
Proportion of Variance 0.7173417 0.1800021 0.08937539 0.006914529 0.003538342 0.002827973
Cumulative Proportion 0.7173417 0.8973438 0.98671916 0.993633685 0.997172027 1.000000000
Thisyieldsthestandarddeviationofeachcomponent,andtheproportionofvarianceexplainedby
eachcomponent.Thetotalvarianceexplainedbythecomponentsisthesumofthevariancesof
thecomponents:
> sum((pca_job$sdev)^2)
[1] 41.484
Thetotalvarianceisequaltothesumofthevariancesoftheindividualvariables.
We find that the first PC accounts for the 71.7% of the total variation, whereas the first two
componentscoveralmostthe90%ofthetotalvariability.
Howmanyprincipalcomponentshouldberetained?
3
Accordingtothecumulativeproportionofexplainedvariancecriterion,thatsuggeststoretainas
many PCs as are needed in order to explain approximately 8090% of the total variance, the
numberofPCneededis2.
Kaiser's rule suggests to retain as many PCs as are the ones whose variance is larger than the
averagevariance:
> mean(eigen(S)$values)
[1] 7.055102
Accordingtothiscriterion,thenumberofPCsthathaveavariancelargerthan7isagain2.
The"screeplot"isaplotofthelkagainstk(k=1,...,p).Ingeneral,afterasharpdeclinethecurve
tends to become flat. The flat portion corresponds to noise components, unable to capture the
leadingvariability;therulethereforesuggeststhatmshouldcorrespondtothevalueofkatwhich
theelbowofthescreeplotoccurs:
> plot(pca_job,type="lines")
>

Accordingtothislastcriterion,thenumberofPCstobeselectedwouldbe3.
4.WhatistheinterpretationoftheselectedPC,i.e.theninterpretationoftheloadingsfromthe
firstandthesecondeigenvector?
> eigen(S)$vectors[,1:2]
[,1] [,2]
[1,] -0.49155807 0.08748571
[2,] -0.08675503 -0.69046582
[3,] -0.13662756 -0.70496706
[4,] -0.50947274 0.08031174
[5,] -0.43261611 0.04023081
4
[6,] -0.53428266 0.10274312
>
In the first PC (that accounts for about the 72% of the total variability) all the loadings but
probl_solvandlogicalones(whichareclosetozero)havethesamesizeandalmostthe
samemagnitude.Hence,thiscomponentappearstobea sortofaverageofthe measurements,
i.e.:
Comp.1=0.492commun0.509learn0.433physical0.534appearance
=0.5(sumoftestresults)
ThesecondPChastwoloadingsthatarenotclosetozero,thoserelatedtothelogicalabilityand
problemsolving.Itseparatespoliceofficerswithamarkedlogicalskill(withasmallscoreonthe
test,asthemagnitudeisnegative)fromtheothers.
WehaveinterpretedthefirstPCsbyexaminingthesizeandthesignofthedifferentcoefficients.
Some authors suggest to use the correlations between each principal component and all the
observedvariablesinstead:
> sqrt(eigen(S)$value[1])*t(eigen(S)$vectors[,1])%*%solve(sqrt(diag(diag(S))))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.9872352 -0.1983234 -0.3027748 -0.9899587 -0.9889676 -0.9838213
AndforthesecondPC:
> sqrt(eigen(S)$value[2])*t(eigen(S)$vectors[,2])%*%solve(sqrt(diag(diag(S))))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.08801541 -0.7906737 -0.782575 0.07817195 0.04606954 0.09477061
>
Thefirstprincipalcomponentishighlynegativelycorrelatedwitheachoftheobservedvariables
but probl_solv and logical. All variables contribute more or less in the same way to the
first principal component, confirming the interpretation given above. The second principal
component is highly negatively correlated with probl_solv and logical, separating police
officerswithamarkedlogicalskill(withasmallscoreonthetest,asthemagnitudeisnegative)
fromtheothers.
AlternativelyitispossibletoapplythecorfunctiontothePCscores(obtainedasY1=XAm,with
m=2)andeachoftheobservedvariableXj,j=1,...,p.
> job.m<-as.matrix(job)
> scores<-job.m%*%(eigen(S)$vector[,1:2])
> cor(scores[,1],job)
[1,] -0.9872352 -0.1983234 -0.3027748 -0.9899587 -0.9889676 -0.9838213
> cor(scores[,2],job)
[1,] 0.08801541 -0.7906737 -0.782575 0.07817195 0.04606954 0.09477061
>
5.ProduceaplotofthefirstversusthesecondPC
5
Theobjectscoresisamatrix502containingthescoresoftheunitswithrespectthefirsttwo
principalcomponents. It is obtained by multiplying the observed matrix X by thefirst two eigen
vectorsoftheSmatrix:
> scores<-job.m%*%(eigen(S)$vector[,1:2])
> plot(scores[,1], scores[,2], main="PC score plot of the Job Performance data")
6.Arealltheunitswellrepresentedinabidimensionalplot?
Asthisbidimensionalplotisanapproximationofthestructureintheoriginalpdimensionalspace
it may happen that not all of the n statistical units are well represented in the bidimensional
projection. A way to check the adequacy of the approximation for each statistical unit is an
inspectionofthescoresonthelast(discarded)PCs.IfaunithashighscoresonthelastPCsthis
means that it is largely displaced when projected on the PC plane and therefore it is
misrepresented.Onecouldconsider,foreachstatisticalunitthesumofthesquaredscoresonthe
lastpmPCs.ThepositiononthePCplaneofthoseunitshavingalargevalueofsuchasumshould
becarefullyconsidered.
> scores_last<-job.m%*%(eigen(S)$vector[,-(1:2)])
> apply(scores_last^2,1,sum)
[1] 1784.433 1893.018 1936.171 1877.315 1804.755 1746.998 1684.826 1743.272
[9] 1880.757 1915.520 1880.757 1681.001 1854.007 1668.917 1964.539 1859.880
[17] 1727.699 1865.102 1759.574 2104.255 1863.209 1873.058 2184.831 1703.492
[25] 1707.550 1831.796 1938.200 1732.942 1830.631 1912.833 1741.321 1780.767
[33] 1745.660 1921.742 1795.591 1862.057 1854.136 1722.972 1894.969 1891.101
[41] 1756.409 1970.561 1901.513 1819.324 1868.010 1812.201 1717.760 1811.706
[49] 1900.057 1828.348
Sparrows
6
InFebruary,1898,therewasaseverewinterstormwithrain,sleet,andsnownearProvidence,RI(Rhode
Island).OnehundredthirtysixEnglishsparrowswerefoundfreezingandbroughttoDr.Bumpus'laboratory
atBrownUniversity.Ofthose,72survivedand64died.Bumpustookadvantageoftheopportunitytostudy
anepisodeofnaturalselection.Hemeasuredanumberofcharacteristicsofthebirdsandanalyzedthemto
finddifferencesbetweenthesurvivorsandthosethatperished.
Here we consider a sample of 49 sparrows. Read data sparrows.dat which contains the following
variables:
Column1(totL):TotalLength
Column2(AlarE):AlarExtent
Column3(bhL):LengthofBeakandHead
Column4(hL):LengthofHumerus
Column5(kL):LengthofkeelofSternum
1.ReadthedataintoRandprintoutthevariablenames.
andcalculatethemeanvector:
> sparrows<-read.table("c:\\temp\\sparrows.dat",header=TRUE)
> sparrows
totL AlarE bhL hL kL
1 156 245 31.6 18.5 20.5
2 154 240 30.4 17.9 19.6
3 153 240 31.0 18.4 20.6
... ... ... ... ... ...
47 153 237 30.6 18.6 20.4
48 162 245 32.5 18.5 21.1
49 164 248 32.3 18.8 20.9
> names(sparrows)
[1] "totL" "AlarE" "bhL" "hL" "kL"
>
2.Lookatthesummarystatistics
> sp.mean<-colMeans(sparrows)
> sp.mean
157.97959 241.32653 31.45918 18.46939 20.82653
NowcalculatethecenteringmatrixA

> n<-nrow(sparrows)
> A<-diag(n)-(1/n)*matrix(1,15,1)%*%matrix(1,1,15)
> centred.data<-A%*%sparrows
Alternatively
> scale(sparrows,T,F)
Nowcomputethesamplevariancematrix
> S<-centred.data%*%centred.data/(n-1)
7
Or,alternatively
> S<-var(sparrows)
> S
totL 13.353741 13.610969 1.9220663 1.3306122 2.1922194
AlarE 13.610969 25.682823 2.7136054 2.1977041 2.6578231
bhL 1.922066 2.713605 0.6316327 0.3422662 0.4146471
hL 1.330612 2.197704 0.3422662 0.3184184 0.3393707
kL 2.192219 2.657823 0.4146471 0.3393707 0.9828231
>
Determinatenowthestandardizeddata:

> D<-diag(diag(S))*(n-1)/n
> Z<-centred.data%*%(solve(D^0.5))
Or,alternatively
> Z<-scale(sparrows,T,T)
CalculatenowthecorrelationmatrixR

> R<-(solve(D^0.5))%*%S%*%(solve(D^0.5))
Or,
> R<-t(Z)%*%Z/n
Or,
> R<-cor(sparrows)
> R
totL 1.0000000 0.7349642 0.6618119 0.6452841 0.6051247
AlarE 0.7349642 1.0000000 0.6737411 0.7685087 0.5290138
bhL 0.6618119 0.6737411 1.0000000 0.7631899 0.5262701
hL 0.6452841 0.7685087 0.7631899 1.0000000 0.6066493
kL 0.6051247 0.5290138 0.5262701 0.6066493 1.0000000
>
ShouldthePCAbecarriedoutwiththecovarianceorthecorrelationmatrix?
If there are large differences between the variances of the elements of X, then those variables
whose variances are largest will tend to dominate the first few PCs. Since in this examples the
variancesareverydifferentitisadvisabletousethecorrelationmatrixtocarryoutthePCs.Thisis
equivalenttoperformthePCAonthecovariancematrixofthestandardizeddata.
3. Perform a PCA of the data using the correlation matrix. View the components of your PCA
object and obtain the definitions of the principal components (PCs) and the corresponding
proportionsofvariationexplainedbyeach.Produceascreeplot.
> eigen(R)
$values
[1] 3.6159783 0.5315041 0.3864245 0.3015655 0.1645275
8
$vectors
[,1] [,2] [,3] [,4] [,5]
[1,] -0.4517989 0.05072137 0.6904702 0.42041399 -0.3739091
[2,] -0.4616809 -0.29956355 0.3405484 -0.54786307 0.5300805
[3,] -0.4505416 -0.32457242 -0.4544927 0.60629605 0.3427923
[4,] -0.4707389 -0.18468403 -0.4109350 -0.38827811 -0.6516665
[5,] -0.3976754 0.87648935 -0.1784558 -0.06887199 0.1924341
ThefirstentryofthelistcontainstheeigenvaluesofthecorrelationmatrixR,i.e.thevariancesof
theprincipalcomponents.Thesecondentryisamatrixwherethecolumnsaretheeigenvectors
ofthematrixR.
Thesameresultcouldbeobtainedbyusingthefunctionprincomp:
> pca_sparrow<-princomp(sparrows,cor=TRUE)
> names(pca_sparrow)
[1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"
> loadings(pca_sparrows)
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
totL -0.452 0.690 0.420 -0.374
AlarE -0.462 -0.300 0.341 -0.548 0.530
bhL -0.451 -0.325 -0.454 0.606 0.343
hL -0.471 -0.185 -0.411 -0.388 -0.652
kL -0.398 0.876 -0.178 0.192

SS loadings 1.0 1.0 1.0 1.0 1.0
Proportion Var 0.2 0.2 0.2 0.2 0.2
Cumulative Var 0.2 0.4 0.6 0.8 1.0
>
> summary(pca_sparrow)
Importance of components:
Standard deviation 1.9015726 0.7290433 0.62163056 0.5491498 0.4056199
Proportion of Variance 0.7231957 0.1063008 0.07728491 0.0603131 0.0329055
Cumulative Proportion 0.7231957 0.8294965 0.90678139 0.9670945 1.0000000
Thisyieldsthestandarddeviationofeachcomponent,andtheproportionofvarianceexplainedby
eachcomponent.Thetotalvarianceexplainedbythecomponentsisthesumofthevariancesof
thecomponents:
> sum((pca_sparrow$sdev)^2)
[1] 5
Thetotalvarianceisequaltothesumofthevariancesoftheindividualvariables.Inthiscasewe
see that the total variance is 6, which is equal to the number of standardized variables. This is
becauseforstandardizeddata,thevarianceofeachstandardizedvariableis1.
We find that the first PC accounts for the 72.3% of the total variation, whereas the first two
componentscoveraboutthe83%ofthetotalvariability.
9
Howmanyprincipalcomponentshouldberetained?
Accordingtothecumulativeproportionofexplainedvariancecriterion,thatsuggeststoretainas
many PCs as are needed in order to explain approximately 8090% of the total variance, the
numberofPCneededis2.
TheKaiser'sruleforPCsderivedfromacorrelationmatrixsuggeststoretainasmanyprincipal
componentsasaretheeigenvaluesofRlargerthan1.Themotivationunderlyingthisruleisthat
wewanttoretainalltheprincipalcomponentsthathaveavariancelargerthantheonerelatedto
theoriginalvariables(thatisequalto1forstandardizeddata).Inthiscaseonewouldchooseonly
thefirstPrincipalComponent.
The"screeplot"isaplotofthelkagainstk(k=1,...,p).Ingeneral,afterasharpdeclinethecurve
tends to become flat. The flat portion corresponds to noise components, unable to capture the
leadingvariability;therulethereforesuggeststhatmshouldcorrespondtothevalueofkatwhich
theelbowofthescreeplotoccurs:
> plot(pca_sparrow,type="lines")
>
Accordingtothislastcriterion,thenumberofPCstobeselectedisagain2.
4.WhatistheinterpretationoftheselectedPC,i.e.theninterpretationoftheloadingsfromthe
firstandthesecondeigenvector?
> eigen(R)$vectors[,1:2]
[,1] [,2]
[1,] -0.4517989 0.05072137
[2,] -0.4616809 -0.29956355
[3,] -0.4505416 -0.32457242
[4,] -0.4707389 -0.18468403
[5,] -0.3976754 0.87648935
InthefirstPC(thataccountsforaboutthe72%ofthetotalvariability)alltheloadingshavethe
samesizeandalmostthesamemagnitude.Hence,thiscomponentappearstobeasortofaverage
sizeofthesparrows,i.e.:
Comp.1=0.452totL0.461AlarE0.451bhL0.471hL 0.398 kL
=0.446(size)
10
EvenifthesecondPCisnotimportantinrecoveringareduceddimensionrepresentationofour
datawecantrytogiveitanempiricalmeaning.Itisalinearcontrast,i.e.somevariableshavea
positivecoefficientandsomehaveanegativeone.Itseparatessparrowshavingasmallalarextent
andshorthumerus,withsmallbeakbutwithalargekeelofsternumfromtheothers.Itcanbe
thoughtofasameasureofshape.Thefirstloadingisclosetozerosoitisnotinterpretable.
WehaveinterpretedthefirstPCsbyexaminingthesizeandthesignofthedifferentcoefficients.
Someauthorssuggesttousethecorrelationsbetweeneachprincipalcomponentandallthe
observedvariablesinstead:
> sqrt(eigen(R)$value[1])*eigen(R)$vectors[,1]
[1] -0.8591285 -0.8779197 -0.8567376 -0.8951441 -0.7562086
AndforthesecondPC:
> sqrt(eigen(R)$value[2])*eigen(R)$vectors[,2]
[1] 0.03697807 -0.21839478 -0.23662734 -0.13464265 0.63899865
Thefirstprincipalcomponentishighlynegativelycorrelatedwitheachoftheobservedvariables.
Allvariablescontributemoreorlessinthesamewaytothefirstprincipalcomponent,confirming
the interpretation given above. The second principal component is negatively correlated with
AlarE, bhL, hL andpositivelycorrelatedwith kL,separatingsparrowswithasmallalarextent
andshorthumerus,withsmallbeakandheadbutwithalargekeelofsternumfromtheothers.
AlternativelyitispossibletoapplythecorfunctiontothePCscores(obtainedasY1=XAm,with
m=2)andeachoftheobservedvariableXj,j=1,...,p.
> sparrows.m<-as.matrix(scale(sparrows,T,T))
> scores<- sparrows.m %*%(eigen(R)$vector[,1:2])
> cor(scores[,1],sparrows.m [,1:5])

[1,] -0.8591285 -0.8779197 -0.8567376 -0.8951441 -0.7562086
> cor(scores[,2],sparrows.m [,1:5])

[1,] 0.03697807 -0.2183948 -0.2366273 -0.1346426 0.6389987
5.ProduceaplotofthefirstversusthesecondPC
Theobjectscoresisamatrix492containingthescoresoftheunitswithrespectthefirsttwo
principalcomponents.ItisobtainedbymultiplyingthestandardizedobservedmatrixXbythefirst
twoeigenvectorsoftheRmatrix:
> scores<-sparrows.m%*%(eigen(R)$vector[,1:2])
> plot(scores[,1], scores[,2], main="PC score plot of the Sparrow data")
11

6.Arealltheunitswellrepresentedinabidimensionalplot?
Asthisbidimensionalplotisanapproximationofthestructureintheoriginalpdimensionalspace
it may happen that not all of the n statistical units are well represented in the bidimensional
projection. A way to check the adequacy of the approximation for each statistical unit is an
inspectionofthescoresonthelast(discarded)PCs.IfaunithashighscoresonthelastPCsthis
means that it is largely displaced when projected on the PC plane and therefore it is
misrepresented.Onecouldconsider,foreachstatisticalunitthesumofthesquaredscoresonthe
lastpmPCs.ThepositiononthePCplaneofthoseunitshavingalargevalueofsuchasumshould
becarefullyconsidered.
> scores_last<-sparrows.m%*%(eigen(R)$vector[,-(1:2)])
> apply(scores_last^2,1,sum)
[1] 0.597 0.630 1.014 0.344 0.583 0.467 0.295 3.698 0.355 0.749 0.130 0.780
[13] 0.218 0.565 0.993 0.041 0.303 0.778 1.271 0.868 1.003 0.505 0.206 0.959
[25] 0.308 1.388 0.545 2.035 1.470 0.178 1.466 0.450 0.329 2.620 1.152 1.239
[37] 0.487 0.824 0.146 0.813 2.629 0.682 0.568 0.146 0.195 0.497 1.549 0.980
[49] 0.875
Whiteleghornfowls
Thefilecorr.matr.txtcontainsthecorrelationmatrixofvariables:
X1:lengthofthecranium
X2:widthofthecranium
X3:lengthofthehumerus
x4:lengthoftheulna
X5:lengthofthefemur
X6:lengthofthetibia
12
measuredon275whitechickenswiththespur.Thesixoriginalmeasurescanbenaturallygroupedinthree
set of two measures each: cranium measures (length, width), wings measures (humerus, ulna) and legs
measures(femur,tibia).
Sinceobservationsarereallydifferentitisbettertoworkonstandardizeddata,henceusingthecorrelation
matrixasbasisfortheanalysis.
> R<-read.table("c:\\temp\\cor.matr.txt")
> eigen(R)
$values
[1] 4.56757080 0.71412326 0.41212898 0.17318890 0.07585872 0.05712934
$vectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.3474394 0.5369741 0.76667320 0.049098693 0.027212370 0.002372446
[2,] -0.3263730 0.6964675 -0.63630517 0.002032525 0.008043773 0.058826974
[3,] -0.4434189 -0.1873007 -0.04007060 -0.524077391 0.168396957 -0.680938905
[4,] -0.4399832 -0.2513820 0.01119563 -0.488770969 -0.151152669 0.693795633
[5,] -0.4345445 -0.2781684 -0.05920540 0.514258649 0.669482574 0.132737714
[6,] -0.4401501 -0.2256982 -0.04573469 0.468581964 -0.706953466 -0.184076846
> tot.var<-sum(eigen(R)$values)
> eigen.values<-eigen(R)$values
> eigen.values/tot.var
[1] 0.761261799 0.119020544 0.068688163 0.028864817 0.012643121 0.009521557
> prop.var<-round(eigen.values/tot.var,2)*100
> prop.var
[1] 76 12 7 3 1 1
> cum.var<-
c(prop.var[1],sum(prop.var[1:2]),sum(prop.var[1:3]),sum(prop.var[1:4]),sum(prop.
var[1:5]),sum(prop.var[1:6]))
>
> cum.var
[1] 76 88 95 98 99 100
Thefirstcomponentaloneexplainsthe3/4ofthetotalvariabilityandsoistheprincipalsourceofvariation
betweenchickens.CoefficientsofeachXiinthiscomponentareapproximatelyequal(about0.4).It
characterizeschickenswithrespecttotheirdimension,sincebigchickenstendtohavelargevaluesofXi
(i=1,...,6)andsoofY1too,whereassmallchickenswillhavelowvaluesofXiandsoasmallY1too.
Thesecondcomponent,evenifitisnotsorelevantsinceitsvarianceissmallerthantheunit,canhavea
specificmeaning.PositiveandsimilarcoefficientsarethoseassociatedtoX1andX2,whereasnegativeand
similararethecoefficientsassociatedwiththeotherfourvariables.Positiveandlargevaluesofthis
componentareassumedbyunitshavinglargevaluesofX1andX2(andsohavingbighead)andsmallvalues
ontheothervariables(smallwingsandshortlegs);ontheotherhand,negativeandlargevaluesofY2are
assumedbyindividualswithsmallheadandbigwingsandlonglegs.HenceY2isanindicatoroftheshapeof
thechicken,meaningthecomparisonwithsizeoftheheadandsizeofthebody.
13

PCA Lab1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PCA Lab1

Uploaded by

Copyright:

Available Formats

PrincipalComponentAnalysis

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

> cor(scores[,1],sparrows.m [,1:5])

> cor(scores[,2],sparrows.m [,1:5])

You might also like