Principal component analysis (PCA) is a powerful linear algebra-based statistical method used to reduce the dimensionality of datasets while retaining important information. It simplifies complex datasets, making them easier to analyze and visualize.
Suppose we have n individuals and measure m variables for each. Each individual’s measurements form an m-dimensional vector. For example, data collected from five individuals might look like this:
Name | A | B | C | D | E |
---|---|---|---|---|---|
Age | 24 | 50 | 17 | 35 | 65 |
Height (cm) | 152 | 175 | 160 | 170 | 155 |
IQ | 108 | 102 | 95 | 97 | 87 |
Here, n = 5 and m = 3. Each individual's data can be written as a vector, for instance: x₁ = [24, 152, 108]ᵗ
.
PCA helps answer questions such as:
- Which variables are correlated?
- Can we visualize this high-dimensional data more easily?
- Which variables contribute most to the variation in the dataset?
Linear Transformations
Multiplying a matrix by a vector results in a linear transformation of that vector. This operation is key in PCA and is defined as: Av = w
.
Eigenvectors and Eigenvalues
An eigenvector v
of a matrix A
satisfies Av = λv
, where λ
is the eigenvalue. These vectors indicate directions of data variance, remaining unchanged under transformation by A
.
Spectral Theorem
For symmetric matrices, the spectral theorem ensures real eigenvalues and orthogonal eigenvectors. This property is fundamental to PCA since the covariance matrix is symmetric.
Covariance Matrix
The covariance matrix captures the variance and correlation of the dataset's variables. Its entries Sₖₗ
represent the covariance between variables k
and l
. Diagonal entries are variances; off-diagonal entries are covariances.
Steps in PCA
- Organize the dataset into an
m × n
matrix where each column is a sample. - Subtract the mean of each variable from the dataset (mean centering).
- Compute the covariance matrix
S = (1/n−1) · BBᵗ
, whereB
is the mean-centered matrix. - Apply the spectral theorem to get eigenvalues and eigenvectors.
- Select the top k eigenvectors based on the highest eigenvalues. These are the principal components.
Dimensionality Reduction
By projecting the data onto the first k principal components, we reduce the dimensions while retaining most of the dataset's variance. This simplifies analysis and visualization.
Interpreting Eigenvalues
- Each eigenvalue indicates the variance captured by its corresponding eigenvector.
- The sum of all eigenvalues is the total variance of the dataset.
- The ratio
λᵢ / (λ₁ + λ₂ + ... + λₘ)
shows the proportion of variance explained by the i-th component.
Applications of PCA
- Data visualization
- Noise reduction
- Face recognition (eigenfaces)
- Genomics and bioinformatics
- Market segmentation
In face recognition, PCA can reduce high-dimensional image data to a small number of significant components (eigenfaces), allowing for efficient and accurate identification based on stored components.
In the next article, we will explore gradient descent, an optimization technique commonly used in machine learning.