Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Principal Component Analysis: Understanding Data Variation and Dimensionality Reduction, Study notes of Mathematical Statistics

Principal component analysis (pca) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The ideas, steps, and applications of pca, including gene expression analysis and data visualization.

Typology: Study notes

2023/2024

Uploaded on 12/26/2023

abhinandan-uk
abhinandan-uk 🇮🇳

1 document

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Principal Component
Analyasis
by- ABHINANDAN
usn-22BTRCB002
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Principal Component Analysis: Understanding Data Variation and Dimensionality Reduction and more Study notes Mathematical Statistics in PDF only on Docsity!

Principal Component

Analyasis

by- ABHINANDAN

usn-22BTRCB

Principal Components Analysis Ideas ( PCA)

  • (^) Does the data set ‘span’ the whole of d dimensional space?
  • (^) For a matrix of m samples x n genes, create a new covariance matrix of size n x n.
  • (^) Transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs).
  • (^) developed to capture as much of the variation in data as possible

X X Principal Component Analysis Note: Y1 is the first eigen vector, Y2 is the second. Y ignorable. Y Y x x x x x x x x x x x x x x x x x x x x x x x x x Key observation: variance = largest!

Principal Component Analysis: one attribute first

  • (^) Question: how much spread is in the data along the axis? (distance to the mean)
  • (^) Variance=Standard deviation^ 30 40 30 35 30 15 30 15 18 15 30 24 40 42 Temperat ure

More than two attributes: covariance matrix

  • (^) Contains covariance values between all possible dimensions (=attributes):
  • (^) Example for three attributes (x,y,z):

Eigenvalues & eigenvectors

  • (^) Vectors x having same direction as A x are called eigenvectors of A ( A is an n by n matrix).
  • (^) In the equation A xx , λ is called an eigenvalue of A.

Principal components

  • (^) 1. principal component (PC1)
    • (^) The eigenvalue with the largest absolute value will indicate that the data have the largest variance along its eigenvector, the direction along which there is greatest variation
  • (^) 2. principal component (PC2)
    • (^) the direction with maximum variation left in data, orthogonal to the 1. PC
  • (^) In general, only few directions manage to capture most of the variability in the data.

Steps of PCA

  • (^) Let be the mean vector (taking the mean of all rows)
  • (^) Adjust the original data by the mean X’ = X –
  • (^) Compute the covariance matrix C of adjusted X
  • (^) Find the eigenvectors and eigenvalues of C. - (^) For matrix C, v ectors e (=column vector) having same direction as C e : - (^) eigenvectors of C is e such that C ee , - (^) λ is called an eigenvalue of C. - (^) C ee ⇔ ( C -λI) e = - (^) Most data mining packages do this for you.

Principal components - Variance

Transformed Data

  • Eigenvalues λ j corresponds to variance on each component j
  • Thus, sort by λ j
  • Take the first p eigenvectors e i; where p is the number of top eigenvalues
  • (^) These are the directions with the largest variances

Covariance Matrix

  • (^) C=
  • (^) Using MATLAB, we find out:
    • (^) Eigenvectors:
    • (^) e1=(-0.98,-0.21), λ1=51.
    • (^) e2=(0.21,-0.98), λ2=560.
    • (^) Thus the second eigenvector is more important! 106 482 75 106

Principal components

  • (^) General about principal components
    • (^) summary variables
    • (^) linear combinations of the original variables
    • (^) uncorrelated with each other
    • (^) capture as much of the original variance as possible

Two Way (Angle) Data Analysis Genes 10 3

10 4 Samples 10 1

2 Gene expression matrix Sample space analysis Gene space analysis Conditions 10 1

2 Genes 10 3

10 4 Gene expression matrix

PCA - example