This document is about EJML. Efficient Java Matrix Library. EJML is used in development of PCA. PCA is a face recognition technique used in developing an indentification and veificaton syste…Full description
EDUCACIONDescripción completa
Descripción completa
Asignatura a Discresión para BGU
Documento amelia gallegos
Planificación Curricular Anual de Investigación.Descripción completa
diseño de pavimentos por el metodo PCADescripción completa
Descripción: Cultura Estetica
UAS Teknik Komputasi PCA Aris Haryanto
plan
mateDescripción completa
PLAN CURRICULAR ANUAL DE DIBUJO TÉCNICO TERCERO BGUDescripción completa
PCA.Descripción completa
PCA EDUCACION CULTURAL Y ARTISTICADescripción completa
Descripción: PCA DE 2 DE BACHILLERATO
PLAN ANUALDescripción completa
Cálculo de espesores de pavimentos por el método de la Portland Cement Association de los EE.UU. Planilla de cálculo para evitar el uso de los ábacos y tablas del método. Se recomienda sól…Descripción completa
Practical Guide to Principal Component Analysis (PCA) in R & Python PY!"# %
SHARE MANI SH S ARA SA! " MARCH #$" #%$ ' %
Introduction Too much of anything is good for nothing!
hat happens hen a data set has too many *aria+les , Here are -e possi+le situations hich you mi.ht come across/ $0 1o 1ou u -ind that most most o- the the *aria+le *aria+les s are correla correlated0 ted0 #0 1ou 1ou lose patience patience and and decide decide to run a model model on hole hole data0 data0 !his !his returns returns poor accuracy and you -eel terri+le0 0 1o 1ou u +ecome +ecome inde indecisi*e cisi*e a+ou a+outt hat hat to do 20 1o 1ou u start thin3in. thin3in. o- some strate strate.ic .ic method method to -ind -e important important *aria+le *aria+les s
!rust !r ust me" dea dealin lin. . it ith h suc such h sit situat uation ions s isn isn4t 4t as didi--ic -icult ult as it sou sounds nds00 Sta Statis tistic tical al techni5ues such as -actor analysis" principal component analysis help to o*ercome such di--iculties0
In this post" I4*e e6plained the concept o- principal component analysis in detail0 I4*e 3ept the e6planation to +e simple and in-ormati*e0 7or practical understandin." I4*e also demonstrated usin. this techni5ue in R ith interpretations0 Note: Understanding this concept requires prior knowledge of statistics.
hat is Principal Component Analysis , In simple ords" principal component analysis is a method o- e6tractin. important *aria+les -rom a lar.e set o- *aria+les a*aila+le in a data set0 It e6tracts lo dimensional set o- -eatures -rom a hi.h dimensional data set ith a moti*e to capture as much in-ormation as possi+le0 ith -eer *aria+les" *isuali8ation also +ecomes much more meanin.-ul0 PCA is more use-ul hen dealin. ith or hi.her dimensional data0 It is alays per-ormed on a symmetric correlation or co*ariance matri60 !his means the matri6 should +e numeric and ha*e standardi8ed data0 9et4s understand it usin. an e6ample/ 9et4s say e ha*e a data set o- dimension %% (n) : ;% ( p)0 n represents the num+er o- o+ser*ations and p represents num+er o- predictors0 Since e ha*e a lar.e p < ;%" there can +e p&p-1'/2 scatter plots i0e more than $%%% plots possi+le to analy8e
the *aria+le relationship0 ouldn4t is +e a tedious =o+ to per-orm e6ploratory analysis on this data , In this case" it ould +e a lucid approach to select a su+set o- p (p << 5 predictor hich captures as much in-ormation0 7olloed +y plottin. the o+ser*ation in the resultant lo dimensional space0 !he ima.e +elo shos the trans-ormation o- a hi.h dimensional data ( dimension) to lo dimensional data (# dimension) usin. PCA0 Not to -or.et" each resultant dimension is a linear com+ination o- p -eatures
Source/ nlpca
hat are principal components , A principal component is a normali8ed linear com+ination o- the ori.inal predictors in a data set0 In ima.e a+o*e" "#$ and "#% are the principal components0 9et4s say e ha*e a set o- predictors as()* (+...*(
p
!he principal component can +e ritten as/ ,) ))() +)(+ )( .... >p)(p
here" •
?@ is -irst principal component
•
p) is the loadin. *ector comprisin. o- loadin.s ( )* +..) o- -irst principal
component0 !he loadin.s are constrained to a sum o- s5uare e5uals to $0 !his is +ecause lar.e ma.nitude o- loadin.s may lead to lar.e *ariance0 It also de-ines the direction o- the principal component (?@) alon. hich data *aries the most0 It results in a line in p dimensional space hich is closest to the n o+ser*ations0 Closeness is measured usin. a*era.e s5uared euclidean distance0 •
()..(p are normali8ed predictors0 Normali8ed predictors ha*e mean e5uals to 8ero
and standard de*iation e5uals to one0
!here-ore" First principal component is a linear com+ination o- ori.inal predictor *aria+les
hich captures the ma6imum *ariance in the data set0 It determines the direction o- hi.hest *aria+ility in the data0 9ar.er the *aria+ility captured in -irst component" lar.er the in-ormation captured +y component0 No other component can ha*e *aria+ility hi.her than -irst principal component0 !he -irst principal component results in a line hich is closest to the data i0e0 it minimi8es the sum o- s5uared distance +eteen a data point and the line0 Similarly" e can compute the second principal component also0
Second principal component (,+) is also a linear com+ination o- ori.inal predictors
hich captures the remainin. *ariance in the data set and is uncorrelated ith ,)0 In other ords" the correlation +eteen -irst and second component should is 8ero0 It can +e represented as/ ,+ )+() ++(+ +( .... p2(p
I- the to components are uncorrelated" their directions should +e ortho.onal (ima.e +elo)0 !his ima.e is +ased on a simulated data ith # predictors0 Notice the
direction o- the components" as e6pected they are ortho.onal0 !his su..ests the correlation +' these components in 8ero0
All succeedin. principal component -ollos a similar concept i0e0 they capture the remainin. *ariation ithout +ein. correlated ith the pre*ious component0 In .eneral" -or n & p dimensional data" min(n'$ p principal component can +e constructed0 !he directions o- these components are identi-ied in an unsuper*ised ay i0e0 the response *aria+le(1) is not used to determine the component direction0 !here-ore" it is an unsuper*ised approach0 Note: "artial least square (")* is a super+ised alternati+e to "#,. ")* assigns higher weight to +aria-les which are strongly related to response +aria-le to determine principal components.
hy is normali8ation o- *aria+les necessary ,
!he principal components are supplied ith normali8ed *ersion o- ori.inal predictors0 !his is +ecause" the ori.inal predictors may ha*e di--erent scales0 7or e6ample/ Ima.ine a data set ith *aria+les4 measurin. units as .allons" 3ilometers" li.ht years etc0 It is de-inite that the scale o- *ariances in these *aria+les ill +e lar.e0 Per-ormin. PCA on unnormali8ed *aria+les ill lead to insanely lar.e loadin.s -or *aria+les ith hi.h *ariance0 In turn" this ill lead to dependence o- a principal component on the *aria+le ith hi.h *ariance0 !his is undesira+le0 As shon in ima.e +elo" PCA as run on a data set tice (ith unscaled and scaled predictors)0 !his data set has B2% *aria+les0 1ou can see" -irst principal component is dominated +y a *aria+le ItemMRP0 And" second principal component is dominated +y a *aria+le Itemei.ht0 !his domination pre*ails due to hi.h *alue o- *ariance associated ith a *aria+le0 hen the *aria+les are scaled" e .et a much +etter representation o- *aria+les in #D space0
Implement PCA in R & Python (ith interpretation)
Ho many principal components to choose , I could di*e deep in theory" +ut it ould +e +etter to anser these 5uestion practically0 7or this demonstration" I4ll +e usin. the data set -rom i. Mart Prediction Challen.e0 Remem+er" PCA can +e applied only on numerical data0 !here-ore" i- the data has cate.orical *aria+les they must +e con*erted to numerical0 Also" ma3e sure you ha*e done the +asic data cleanin. prior to implementin. this techni5ue0 9et4s 5uic3ly -inish ith initial data loadin. and cleanin. steps/ directory path path - 4.../5ata/ig78art79ales4 set woring directory setwd&path' load train and test ;le train - read.csv&4train7ig.csv4' test - read.csv&4test7ig.csv4' add a column test<=tem7"utlet79ales - 1 combine the data set combi - rbind&train* test' impute missing values with median combi<=tem7>eight?is.na&combi<=tem7>eight'@ median&combi<=tem7>eight* na.rm %AB' impute 0 with median combi<=tem7Cisibility - iDelse&combi<=tem7Cisibility 0* median&combi<=tem7Cisibility'*
!ill here" e4*e imputed missin. *alues0 No e are le-t ith remo*in. the dependent (response) *aria+le and other identi-ier *aria+les( i- any)0 As e said
a+o*e" e are practicin. an unsuper*ised learnin. techni5ue" hence response *aria+le must +e remo*ed0 remove the dependent and identi;er variables my7data - subset&combi* select -c&=tem7"utlet79ales* =tem7=denti;er* "utlet7=denti;er''
9et4s chec3 the a*aila+le *aria+les ( a030a predictors) in the data set0 chec available variables colnames&my7data'
Since PCA or3s on numeric *aria+les" let4s see i- e ha*e any *aria+le other than numeric0 chec variable class str&my7data' Fdata.DrameF: 1G20G obs. oD H variables: < =tem7>eight : num H.3 I.H2 1J.I 1H.2 K.H3 ... < =tem7Lat7Montent : Lactor w/ I levels 4NL4*4low Dat4*..: 3 I 3 I 3 I I 3 I I ... < =tem7Cisibility : num 0.016 0.01H3 0.016K 0.0IG 0.0IG ... < =tem7ype : Lactor w/ 16 levels 4aing Ooods4*..: I 1I 11 J 10 1 1G 1G 6 6 ... < =tem78%P : num 2GH.K GK.3 1G1.6 1K2.1 I3.H ... < "utlet7Bstablishment7Year: int 1HHH 200H 1HHH 1HHK 1HKJ 200H 1HKJ 1HKI 2002 200J ... < "utlet79iEe : Lactor w/ G levels 4"ther4*4!igh4*..: 3 3 3 1 2 3 2 3 1 1 ... < "utlet7Nocation7ype : Lactor w/ 3 levels 4ier 14*4ier 24*..: 1 3 1 3 3 3 3 3 2 2 ... < "utlet7ype : Lactor w/ G levels 4Orocery 9tore4*..: 2 3 2 1 2 3 2 G 2 2 ...
Sadly" out o- F *aria+les are cate.orical in nature0 e ha*e some additional or3 to do no0 e4ll con*ert these cate.orical *aria+les into numeric usin. one hot encodin.0 load library library&dummies'
create a dummy data Drame new7my7data - dummy.data.Drame&my7data* names c&4=tem7Lat7Montent4*4=tem7ype4*
4"utlet7Bstablishment7Year4*4"utlet79iEe4*
4"utlet7Nocation7ype4*4"utlet7ype4''
!o chec3" i- e no ha*e a data set o- inte.er *alues" simple rite/ chec the data set str&new7my7data'
And" e no ha*e all the numerical *alues0 e can no .o ahead ith PCA0 !he +ase R -unction prcomp() is used to per-orm PCA0 y de-ault" it centers the *aria+le to ha*e mean e5uals to 8ero0 ith parameter scale. " e normali8e the *aria+les to ha*e standard de*iation e5uals to $0 principal component analysis prin7comp - prcomp&new7my7data* scale. ' names&prin7comp' ?1@ 4sdev4
4rotation4 4center4 4scale4
44
!he prcomp() -unction results in ; use-ul measures/ $0 center and scale re-ers to respecti*e mean and standard de*iation o- the *aria+les that are used -or normali8ation prior to implementin. PCA outputs the mean oD variables prin7comp
#0 !he rotation measure pro*ides the principal component loadin.0 Each column o- rotation matri6 contains the principal component loadin. *ector0 !his is the most important measure e should +e interested in0 prin7comp
!his returns 22 principal components loadin.s0 Is that correct , A+solutely0 In a data set" the ma6imum num+er o- principal component loadin.s is a minimum o- (n$" p)0 9et4s loo3 at -irst 2 principal components and -irst ; ros0 prin7compeight
PM2
PM3
PMG
0.00IGG2H22I -0.0012KI666 0.0112G61HG
0.011KKJ106 =tem7Lat7MontentNL
-0.0021HK331G
0.003J6KIIJ -0.00HJH00HG -
0.016JKHGK3 =tem7Lat7Montentlow Dat -0.001H0G2J10
0.001K66H0I -0.003066G1I -
0.01K3H61G3 =tem7Lat7MontentNow Lat
0.002JH36G6J -0.00223G32K 0.02K30HK11
0.0I6K22JGJ =tem7Lat7Montentreg
0.0002H3631H
0.001120H31 0.00H0332IG -
0.00102661I
0 In order to compute the principal component score *ector" e don4t need to multiply the loadin. ith data0 Rather" the matri6 6 has the principal component score *ectors in a $2#%2 : 22 dimension0 dim(princomp6) $J $2#%2
22
9et4s plot the resultant principal components0 biplot&prin7comp* scale 0'
!he parameter scale 0 ensures that arros are scaled to represent the loadin.s0 !o ma3e in-erence -rom ima.e a+o*e" -ocus on the e6treme ends (top" +ottom" le-t" ri.ht) o- this .raph0 e
in-er
than
-irst
principal
component
corresponds
to
a
measure
o-
Kutlet!ypeSupermar3et" KutletEsta+lishment1ear #%%L0 Similarly" it can +e said that the second component corresponds to a measure o- Kutlet9ocation!ype!ier$" KutletSi8eother0 7or e6act measure o- a *aria+le in a component" you should loo3 at rotation matri6(a+o*e) a.ain0
20 !he prcomp() -unction also pro*ides the -acility to compute standard de*iation o- each principal component0 sde+ re-ers to the standard de*iation o- principal components0 compute standard deviation oD each principal component std7dev - prin7comp
e aim to -ind the components hich e6plain the ma6imum *ariance0 !his is +ecause" e ant to retain as much in-ormation as possi+le usin. these components0 So" hi.her is the e6plained *ariance" hi.her ill +e the in-ormation contained in those components0 !o compute the proportion o- *ariance e6plained +y each component" e simply di*ide the *ariance +y sum o- total *ariance0 !his results in/ proportion oD variance eplained prop7vare - pr7var/sum&pr7var' prop7vare?1:20@ ?1@ 0.103J1KI3 0.0J312HIK 0.0623K01G 0.0IJJI20J 0.0GHHIK00 0.0GIK02JG ?J@ 0.0G3H10K1 0.02KI6G33 0.02J3IKKK 0.026IGJJG 0.02IIHKJ6 0.02II6JHJ ?13@ 0.02IGHI16 0.02I0KK31 0.02GH3H32 0.02GH0H3K 0.02G6K313 0.02GG6016 ?1H@ 0.023H036J 0.023J111K
!his shos that -irst principal component e6plains $%0 *ariance0 Second component e6plains L0 *ariance0 !hird component e6plains 0# *ariance and so on0 So" ho do e decide ho many components should e select -or modelin. sta.e ,
!he anser to this 5uestion is pro*ided +y a scree plot0 A scree plot is used to access components or -actors hich e6plains the most o- *aria+ility in the data0 It represents *alues in descendin. order0 scree plot plot&prop7vare* lab 4Principal Momponent4* ylab 4Proportion oD Cariance Bplained4* type 4b4'
!he plot a+o*e shos that B % components e6plains around F02 *ariance in the data set0 In order ords" usin. PCA e ha*e reduced 22 predictors to % ithout compromisin. on e6plained *ariance0 !his is the poer o- PCA 9et4s do a con-irmation chec3" +y plottin. a cumulati*e *ariance plot0 !his ill .i*e us a clear picture o- num+er o- components0 cumulative scree plot plot&cumsum&prop7vare'* lab 4Principal Momponent4* ylab 4Mumulative Proportion oD Cariance Bplained4* type 4b4'
!his plot shos that % components results in *ariance close to B F0 !here-ore" in this case" e4ll select num+er o- components as % PC$ to PC%J and proceed to the modelin. sta.e0 !his completes the steps to implement PCA in R0 7or modelin." e4ll use these % components as predictor *aria+les and -ollo the normal procedures0
For Python Users: !o implement PCA in python" simply import PCA -rom s3learn
li+rary0 !he interpretation remains same as e6plained -or R users a+o*e0 K-course" the result is some as deri*ed a-ter usin. R0 !he data set used -or Python is a cleaned *ersion here missin. *alues ha*e +een imputed" and cate.orical *aria+les are con*erted into numeric0 import numpy as np Drom slearn.decomposition import PMR import pandas as pd import matplotlib.pyplot as plt Drom slearn.preprocessing import scale Smatplotlib inline
Noad data set data pd.read7csv&Fig78art7PMR.csvF' convert it to numpy arrays (data.values 9caling the values ( scale&(' pca PMR&n7componentsGG' pca.;t&(' he amount oD variance that each PM eplains var pca.eplained7variance7ratio7 Mumulative Cariance eplains var1np.cumsum&np.round&pca.eplained7variance7ratio7* decimalsG'T100' print var1 ? 10.3J 1J.6K 23.H2 2H.J
7or more in-ormation on PCA in python" *isit sci3it learn documentation0
Points to Remem+er $0 PCA is used to e6tract important -eatures -rom a data set0 #0 !hese -eatures are lo dimensional in nature0 0 !hese -eatures a030a components are a resultant o- normali8ed linear com+ination o- ori.inal predictor *aria+les0 20 !hese components aim to capture as much in-ormation as possi+le ith hi.h e6plained *ariance0 ;0 !he -irst component has the hi.hest *ariance -olloed +y second" third and so on0 0 !he components must +e uncorrelated (remem+er ortho.onal direction , )0 See a+o*e0 L0 Normali8in. data +ecomes e6tremely important hen the predictors are measured in di--erent units0 0 PCA or3s +est on data set ha*in. or hi.her dimensions0 ecause" ith hi.her dimensions" it +ecomes increasin.ly di--icult to ma3e interpretations -rom the resultant cloud o- data0 F0 PCA is applied on a data set ith numeric *aria+les0 $%0 PCA is a tool hich helps to produce +etter *isuali8ations o- hi.h dimensional data0