Principal component analysis with linear algebra
Principal component analysis, or PCA, is a powerful tool which is used to analyze data sets and is formulated in the language of linear algebra and statistics. It is an eigenvalue method used to reduce the dimension of a data set while preserving important information.
Suppose we have \(n\) individuals and we measure the same m variables for each one of them. In this case, we say that we have \(n\) samples of an \(m\)dimensional data. We record the \(m\) measurements for the \(i^{th}\) individual as a vector \(x_i\).
For instance, suppose we talk to 5 individuals and obtain the following data:
Name  A  B  C  D  E 
Age  24  50  17  35  65 
Height(in cm)  152  175  160  170  155 
IQ  108  102  95  97  87 
In this case, we have \(n=5\) and \(m=3\). So, the measurement for an individual would look like this:
\(\vec{x}_1=\begin{bmatrix}24\\152\\108\end{bmatrix}\)
You can visualize each data set as points in space and the entire data as a plot of these individual points.
Here are a few questions that we aim to answer through this technique:
 Which variables are correlated? For instance, in the example we would probably expect age and height to be somewhat correlated and age and IQ to be a little less correlated. This could be established through PCA.
 Is there an easier way of visualizing the data?
 Which variables play the most significant role in portraying the full data set?
Through this article, we will learn about linear transformations, eigenvectors, eigenvalues, spectral theorem and look at a little statistics and principal component analysis.
Linear Transformations
In the previous blog, Prerequisites of linear algebra for machine learning, we saw how to multiply two matrices and two vectors. What happens if we multiply a matrix by a vector? Suppose we have an \(m\times n\) matrix, \(A\), and an \(n\times 1\) vector, \(v\). If we multiply \(v\) by \(A\), we get another vector and that is when we say that matrix \(A\) has performed the linear transformation on the input vector \(v\).
\(Av=w\)
For example,
\(\begin{bmatrix}2&3\\1&2\end{bmatrix}\begin{bmatrix}2\\5\end{bmatrix}=\begin{bmatrix}19\\12\end{bmatrix}\)
Eigenvectors and Eigenvalues
An eigenvector of a matrix \(A\) is a vector that when multiplied by \(A\) returns a vector which is a scalar multiple of itself, i.e.,
\(Av=\lambda v\)
Here, \(v\) is called the eigenvector of \(A\), and \(\lambda\) is the scalar coefficient, which is called the eigenvalue.
You can imagine a matrix like a breeze blowing in a certain direction, an invisible driving force that produces a visible result. Therefore, out of all the vectors affected by a matrix, which one is the eigenvector? The vector that does not change its direction is the eigenvector as it is already pointing in the direction in which the matrix is driving other vectors. An \(n\times n\) square matrix has \(n\) eigenvectors.
>>> import numpy as np >>> A=np.array[[1 1],[1 1]] >>> print(np.linalg.eig(A)) [0 2] [[0.5 0.7] [ 0.5 0.7]]
Spectral Theorem
Background

Symmetric Matrices: An \(m\times n\) matrix is said to be a symmetric matrix if \(A^T=A\). For example,
\(\begin{bmatrix}1&2&3\\2&4&5\\3&5&6\end{bmatrix}\)

Orthogonal Vectors: Two vectors \(u\) and \(v\) are said to be orthogonal when the dot product
\(\vec{u}.\vec{v}=0\)
Theorem
Let \(A\) be a symmetric matrix. Then, there exist real eigenvalues \(\lambda_1, \lambda_2, \ldots , \lambda_n\) and nonzero orthogonal eigenvectors \(\vec{v}_1, \vec{v}_2, \ldots, \vec{v}_n\) such that
\(A\vec{v}_i=\lambda_i\vec{v}_i \mbox{ for } i=1, 2,\ldots, n\)
We will not be stating the proof of spectral theorem, rather we will use it later directly. The following observations will also be useful later:

Let \(A\) be any \(m\times n\) matrix of real numbers. Then both \(A.A^T\) and \(A^T.A\) are symmetric.
Proof
\((AA^T)^T=(A^T)^T.A^T=A.A^T\)
\((A^TA)^T=A^T.(A^T)^T=A^TA\)  Let \( A\) be an \(m\times n\) matrix. Then the matrix \(AA^T\) and \(A^TA\) have the same non zero eigenvalues.
Proof
Let \(v\) be a nonzero eigenvector of \(A^TA\) with eigenvalue \(\lambda\neq 0\), i.e.,
\(\displaystyle (A^TA)v=\lambda v\)
Now, multiply the above equation by A and arrange the equation to obtain \((AA^T)(Av)=\lambda(Av)\), which implies that \(Av\) is an eigenvector of \(AA^T\) with eigenvalue with only one detail left to check that \(Av\) is not the zero vector. Indeed, from the first equation if \(Av\) were zero, then \(\lambda v\) would be zero as well. However, we have speci?cally said that \(v\neq 0\) and \(\lambda\neq 0 \), so this can’t happen. Therefore, the nonzero eigenvalues of \(A^TA\) are the same as \(AA^T\).
This proposition is very useful when \(n\) and \(m\) are hugely different. Suppose \(A\) is a \(500\times 3\) matrix. Then \(AA^T\) is a \(500\times500\) matrix with 500 eigenvalues, which will be very difficult to find. But \(A^TA\) is just a \(3\times 3\) matrix, and it is easy to find its eigenvalues. So, \(AA^T\) will have the rest 497 eigenvalues as 0!

The eigenvalues of \(A^TA\) and \(AA^T\) are nonnegative numbers.
Statistics Background

Suppose we have a single variable \(X\) (say age) and \(n\) measurements denoted by \(a_1,a_2,\ldots, a_n\) , then the mean of \(X\) is
\(\displaystyle\mu_X=\frac{1}{n}(a_1+a_2+\ldots+a_n)\)
numpy.mean(a, axis=None, dtype=None)
a : array_like(Array that contains numbers whose mean is desired)
axis : int, optional(Represents the axis along which the means are calculated; the default is to calculate the mean of the flattened array)
dtype : datatype, optional(Data type that is used in computing the mean; the default data type is float64 for input integers, and it is the same as the input dtype for floating point inputs)>>> import numpy as np >>> A=np.array[[1 4],[3 4]] >>> print(np.mean(A)) 3.0 >>> print(np.mean(A, axis=0)) [2., 4.] >>> print(np.mean(A, axis=1)) [2.5, 3.5]

After calculating the average of variable \(X\), you would like to know how spread out the measurements are? That is quantified using the variance of \(X\).
\(\displaystyle Var(X)=\frac{1}{n1}((a_1\mu_X)^2+(a_2\mu_X)^2+\ldots+(a_n\mu_X)^2)\)
numpy.var(a, axis=None, dtype=None, ddof=0)
Parameters are the same as numpy.mean except
ddof : int, optional(ddof stands for delta degrees of freedom. It is the divisor used in the calculation is N – ddof, where N is the number of elements. The default value of ddof is 0)
Since we deal with sample variance and we always divide by n1, we will always set ddof = 1.>>> import numpy as np >>> A=np.array[[1 4],[3 4]] >>> print(np.var(A, ddof=1)) 2.0 >>> print(np.mean(A, axis=0, ddof=1)) [2., 0.] >>> print(np.mean(A, axis=1, ddof=1)) [4.5, 0.5]
Standard deviation of a variable \(X\) is the square root of \(Var(X)\). There is a function in the numpy library to calculate standard deviation.
numpy.std(a, axis=None, dtype=None, out=None, ddof=0)
>>> import numpy as np >>> A=np.array[[1 4],[3 4]] >>> print(np.var(A, ddof=1)) ?1.414213562373 >>> print(np.mean(A, axis=0, ddof=1)) [ ?1.414213562373, 0.]
If we are measuring two variables \(X\) and \(Y\) in a population, it’s natural to ask if there is some relation between \(X\) and \(Y\). A way to capture this is with the covariance of \(X\) and \(Y\) defined as:
\(\displaystyle Cov(X,Y)=\frac{1}{n1}((a_1\mu_X)(b_1\mu_Y)+(a_2\mu_X)(b_2\mu_Y)+\ldots+(a_n\mu_X)(b_n\mu_Y))\)If the covariance is positive, then both variables \(X\) and \(Y\) increase or decrease together. If the covariance is negative, it means that when one variable increases, the other decreases, and vice versa.
Principal Component Analysis
Let’s begin by understanding the technique first and then we will look at an example to understand it. We would like to store the mean of all our \(m\) variables as a single vector in mdimensional real space \(\mathbb{R}^m\) as
\(\displaystyle\vec{\mu}=\frac{1}{n1}(\vec{x}_1+\vec{x}_2+\ldots+\vec{x}_n)\)
Also, we would like to recenter all our data so that the mean becomes zero which could be achieved by subtracting the mean vector from every measurement. So, we obtain a recentered matrix
\(B=\begin{bmatrix}\vec{x}_1\vec{\mu}\ldots\vec{x}_n\vec{\mu}\end{bmatrix}\)
Next, we compute the covariance matrix \(S\) which is defined as
\(\displaystyle S=\frac{1}{n1}BB^T\)
From the previously stated observations, we can see that \(S \) is an \(m\times m\) symmetric matrix.
Let us consider an example.
\(\vec{x}_1=\begin{bmatrix}a_1\\a_2\\a_3\end{bmatrix}, \vec{x}_2=\begin{bmatrix}b_1\\b_2\\b_3\end{bmatrix}, \vec{x}_3=\begin{bmatrix}c_1\\c_2\\c_3\end{bmatrix}, \vec{\mu}=\begin{bmatrix}\mu_1\\\mu_2\\\mu_3\end{bmatrix}\)
So,
\(B=\begin{bmatrix}a_1\mu_1&b_1\mu_1&c_1\mu_1\\a_2\mu_2&b_2\mu_2&c_2\mu_2\\a_3\mu_3&b_3\mu_3&c_3\mu_3\end{bmatrix}\)
Now, let us look at the entries of \( S \).
\(\displaystyle S_{11}=\frac{1}{n1}((a_1\mu_1)^2+(b_1\mu_1)^2+(c_1\mu_1)^2)=\mbox{ Variance for the first variable }\)
Similarly, the diagonal entries \(S_{22} \mbox{ and } S_{33}\) are precisely the variance for the second and third variable, respectively.
Next, we have
\(\displaystyle S_{12}=\frac{1}{n1}((a_1\mu_1)(a_2\mu_2)+(b_1\mu_1)(b_2\mu_2)+(c_1\mu_1)(c_2\mu_2))=S_{21}\)
which is nothing but the covariance between the first and second variables. To generalize these entries for \(1\leq i,j\leq m\)–
 \(S_{ii}\) is the variance for the \(i^{th}\) variable
 \(S_{ij} \mbox{ for }i\neq j\) is the covariance of \(i^{th}\) and \(j^{th}\) variable
One can also calculate the covariance matrix for a given array using numpy.
numpy.covar(a, y=None, rowvar=True, bias=False, ddof=None)
Parameters are as follows:
a : array_like (An array containing variables and observations)
y : array_like, optional (An additional set of variables and observations)
rowvar : bool, optional (If rowvar is True (default), then each row represents a variable, with observations in the columns: otherwise, the relationship is transposed)
bias : bool, optional (Default normalization (False) is by (N – 1), where N is the number of observations given (unbiased estimate). If bias is True, then normalization is by N. These values can be overridden by using the keyword ddof in numpy versions >= 1.5)
ddof : int, optional (If not None, the default value implied by bias is overridden. Note that ddof=1 will return the unbiased estimate, and ddof=0 will return the simple average. The default value is None)
>>> import numpy as np >>> A=np.array([[1 3],[2 2],[3 1]]).T >>> print(A) [[1 2 3] [3 2 1]] >>> np.cov(A) [[1. 1.] [1. 1.]
Let’s look at a covariance matrix and understand how to interpret it. For example,
\( S=\begin{bmatrix}94&0.8\\0.8&6\end{bmatrix}\)
The covariance matrix S suggests that we have two variables to look at, say \(X\) and \(Y\).\(S_{11}=96\), which is quite large so we expect the plot of \(X\) to be quite spread out. Also, \(S_{22}=4\) is a small number and hence we can expect the plot of variable \(Y\) to restrict to a small range. Similarly, \(S_{12}=S_{21}=0.8\) which is a small positive number, and it suggests that there is a little correlation between \(X\) and \(Y\).
Following can be a possible plot of the data:
Now if we have
\( S=\begin{bmatrix}40&20\\20&50\end{bmatrix}\)
For this, \(S_{11}=40\) and \(S_{22}=50\), which are both large numbers suggesting that both the variables vary over a range of 40 and 50 respectively. But here \(S_{12}=S_{21}=20\), which shows that there is a correlation between \(X\) and \(Y\), and a negative sign suggests that they are indirectly related, that is, as \(X\) increases, \(Y\) decreases.
The plot of the data represented by the covariance matrix above is as follows:
The next step after calculating the covariance matrix S is to apply spectral theorem on S. Since S is symmetric, it will have eigenvalues \(\lambda_1\geq \lambda_2\geq \ldots\geq\lambda_m\geq 0\) and orthogonal vectors, namely \(\vec{v}_1, \vec{v}_2, \ldots, \vec{v}_m\). These eigenvectors are called the principal components of the data set.
Now, we would like to calculate the total variance of the data, T, which is equal to the trace of the covariance matrix, i.e., sum of the diagonal entries of S. Also, the trace of a matrix is equal to the sum of its eigenvalues. Therefore,
\(T=\lambda_1+\lambda_2+\ldots+\lambda_m\)
How do we interpret the results from PCA?
 The direction given by \(\vec{v}_1\) (called the first principal direction) in \(\mathbb{R}^m\) accounts for an amount \(\lambda_1\) of the total variance, T. It’s a \(\displaystyle\frac{\lambda_1}{T}\) fraction of the total variance. Similarly, we can do the same for the second principal direction, \(\vec{v}_2\), and the fraction would be \(\displaystyle\frac{\lambda_2}{T}\).
 The vector \(\vec{v}_1\in \mathbb{R}^m\) is the most significant/important direction of the data set.
 Among the directions that are orthogonal to \(\vec{v}_1\), \(\vec{v}_2\) is the most significant direction and similarly, among the directions orthogonal to both \(\vec{v}_1\) and \(\vec{v}_2\), \(\vec{v}_3\) is the most significant direction and so on.
What exactly has PCA done for the data set?
It has eventually reduced the dimension of the data set. It is often the case that the largest few eigenvalues of S are much greater than the rest. If we assume that we have \(m=15\) and total variance \(T=110\) such that \(\lambda_1=100.1\), \(\lambda_2=9.5\) and \(\lambda_3, \lambda_4, \ldots, \lambda_{15}\) are all less than 0.1. So even though our data points are forming some impossible to imagine figure in \(\mathbb{R}^{15}\), PCA tells us that these points cluster only near a 2D plane spanned by\(\vec{v}_1\) and \(\vec{v}_2\). To be more precise, the data points would be clustered around a plane passing through \(\mu\) and spanned by orthogonal vectors \(\vec{v}_1\) and \(\vec{v}_2\). The data point will look like a rectangular strip inside the plane. Therefore, we have reduced the problem from 15 to 2 dimensional which is in some sense easier to visualize.
Applications of principal component analysis
We already saw that there is a need for principal component analysis in case of large data sets. A very exciting application of PCA is in the field of human face recognition by computers. This idea was ?rst considered by Sirovich and Kirby and implemented by Turk and Pentland. The core idea is to collect a database of images of faces of whole lot of people, say n images. This data from each image is then represented as an mdimensional vector (which would be huge) by sticking the second column of pixels below the second column and so on. Entries in this vector represent the brightness level of each pixel between 0 and 1 (0.0 being black and 1.0 being white).
So each face can be viewed as a vector in \(\mathbb{R}^m\). Now we have a data set on which we can perform PCA, which gives us the principal components that can then be converted back into faces. These faces are called “eigenfaces” and their linear combinations can pretty much represent (up to a good approximation) any face in the data set. It is also found that generally a small number of eigenfaces are needed to give a good basis depicting that only a small number of variances were much larger as compared to others.
But, how is this useful in face recognition? Suppose there is a group of people banned from a place and the owner wants to ensure that nobody from this group enters. So, he feeds the images of the banned people and approximates each image as a combination of eigenfaces. Now when a person enters the place, the security system takes a picture and breaks it down into the eigenface components. This then is compared to all the images in the database and if it is close enough to one in the database, then that person is allowed entry.
Next, we will look at gradient descent, which is an optimization algorithm, its applications in machine learning.
Get advanced recruiting insights delivered every month
Related reads
Virtual Hackathons: All You Need To Know
A Gartner report states that IT executives list talent shortages as the most significant barrier to the adoption of 64% of emerging technologies;…
Internal Hackathons: Drive Innovation And Increase Engagement In Tech Teams
In recent years, multiple avenues have opened up for successful tech hiring and hackathons are one of everyone’s favorites. Apart from helping various…