Find out how HackerEarth can boost your tech recruiting

Learn more

Gradient descent algorithm for linear regression

Rashmi Jain

Author

10 mins read

November 23, 2016

You are probably using machine learning a number of times in a day without even noticing. For instance, whenever you check your mail box, a spam filter saves you from going through spam messages. And that is because it has learned how to distinguish between spam and non-spam mails. That is Machine Learning (ML). To explicitly define the term, machine learning is the science of getting a machine trained, without being explicitly programmed, from past experiences that come through the data collected.

There are two types of ML techniques:

Supervised Machine Learning: It is the most common type of ML in which the system is trained on a set of predefined training data which then reaches an accurate result when given a new set of data.
Unsupervised Learning: In this type of learning, the system is given a bunch of data, and it must find patterns and relationships within. For instance, identifying a group of close friends on Facebook

We will begin by talking about an example of supervised linear regression problem and then understand how to apply a gradient descent algorithm on it.

Supervised Learning

Suppose we have the following dataset which gives the prices of houses in the city of Bengaluru(India).

Living area(in square feet)(x)	Price(in USD)(y)
820	30105
1050	58448
1550	85911
1200	87967
1600	73722
1117	54630
550	42441
1162	79596

How can we predict prices of the houses in Bengaluru as a function of the size of their living areas, given this data?

Let us establish notation for future use. We will denote the input variables, also called features(living area in the example) by \(x^{(i)}\) and the output or the target variable (price) by \(y^{(i)}\). A pair \(\left(x^{\left(i\right)},y^{\left(i\right)}\right)\) is called a training example. We will be using \(m\) such training examples. The list of \(m\) training examples is called a training set. When we’re trying to predict a target variable which is continuous, we say that the learning problem is a regression problem like our housing example.

To define the supervised learning problem more formally, given a training set, the aim is to learn a function \(h\) so that \(h(x)\) is a predictor for the corresponding value of \(y\). This function \(h\) is called a hypothesis.

Next, we need to decide while designing a learning algorithm is the representation if the hypothesis function \(h\) as a function of \(x\).

Let us initially assume that the hypothesis function looks like this:

\(h_{\theta}(x)=\theta_{0}+\theta_{1}x\)

Here, \(\theta_{0}\mbox{ and }\theta_{1}\) are called parameters.

How do we go about choosing these two parameter values, \(\theta_0 \mbox{ and } \theta_1\) so that it best fits our training set?

We will get different hypothesis functions with different choices of parameters.

In linear regression, we have a training set and we want to come up with values for the parameters so that the straight line we get out of \(h(x)\) somehow fits the data well.
Let’s try to choose values for the parameters so that given the \(x\) in the training set, we make reasonable predictions for the \(y\) values. Formally, we want to solve a minimization problem, that is, we want to minimize the difference between \(h_{\theta}(x) \mbox{ and } y\). To achieve that, we solve the following equation:

\(\displaystyle \min_{\theta_0,\theta_1}\frac{1}{2m}\sum_{i=1}^{m} \left(h_{\theta}(x^{(i)})-y^{(i)}\right)^2 \)

Here, \(m\) is the number of training examples. To make the math a little bit easier, we put a factor of \(\displaystyle\frac{1}{2m}\), and it gives us the same value of the process.

By convention, we define a cost function:

\(\displaystyle J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m} \left(h_{\theta}(x^{(i)})-y^{(i)}\right)^2 \)

This cost function is also called the squared error function.
The expression \( \min\limits_{\theta_0,\theta_1}\) means that we want to find the values of \(\theta_{0}\mbox{ and }\theta_{1}\) so that the cost function \( J(\theta_0,\theta_1) \) is minimized.
To understand cost function more intuitively, let us consider a simple hypothesis function with a single parameter.
Let the training set be \(S=\{(1,1),(2,2),(3,3)\}\).

Say, \(\theta_{0}=0\), then our hypothesis function is \(h_{\theta}(x)=\theta_1x\) and cost function becomes

\(\displaystyle J(\theta_1)= \frac{1}{6}\sum_{i=1}^3\left(\theta_1x^{(i)}-y^{(i)}\right)^2 \).

This is a plot showing how \(J(\theta_1)\) varies with \(\theta_1\).

This shows that the cost function for \(S\) is minimum at \(\theta_1=1\) and that is what we expect as it will give \(y=x\), which perfectly matches our training set.

Now, if we assume that \(h_{\theta}(x)=\theta_0+\theta_1x\) and plot \(J(\theta_0,\theta_1)\) versus \(\theta_0\mbox{ and }\theta_1\), we get the following contour plot:

contour1 — Contour plot for [latex]\displaystyle J(\theta_0,\theta_1)=\frac{1}{6}\sum_{i=1}^{3} \left(\left(\theta_0+\theta_1x^{(i)}\right)-y^{(i)}\right)^2 [/latex] and training set [latex]\{(1,1),(2,2),(3,3)\}[/latex]

Here x-axis represents \(\theta_1\), y-axis represents \(\theta_0\) and z-axis represents \(J(\theta_0,\theta_1)\). So, the point where the height of the plot is least represents the point where the cost function is minimized. In this case, we expect it to be at \(y=0, x=1\) and that is what the contour plot shows.

Contour lines for cost function — Contour lines for above contour plot

Our objective is to minimize the cost function.

Gradient Descent

Gradient descent is an algorithm that is used to minimize a function. Gradient descent is used not only in linear regression; it is a more general algorithm.

We will now learn how gradient descent algorithm is used to minimize some arbitrary function f and, later on, we will apply it to a cost function to determine its minimum.

We will start off by some initial guesses for the values of \(\theta_0 \mbox{ and }\theta_1\) and then keep on changing the values according to the formula:

\(\displaystyle \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}f(\theta_0,\theta_1)\mbox{ for j=0,1}\)

Here, \(\alpha\) is called the learning rate, and it determines how big a step needs to be taken when updating the parameters. The learning rate is always a positive number.

We want to simultaneously update \(\theta_0 \mbox{ and }\theta_1\), that is, calculate the right-hand-side of the above equation for both \( j=0 \mbox{ as well as }j=1\) and then update the values of the parameters to the newly calculated ones. This process is repeated till convergence is achieved.

Let us understand the role of learning rate. If \(\alpha\) is too small, then we will end up taking tiny baby steps, which means a lot of steps before we get anywhere near the global minimum. Now, if \(\alpha\) is too large, then there is a possibility that we miss the minimum entirely. It may fail to converge or it can even diverge.

How will one really know that he has already reached the minimum? It turns out that at the local minimum, derivative will be equal to zero because the slope of the tangent line at this point will be equal to zero. So, if the parameters are already at a local minimum then one step with gradient descent does absolutely nothing and that is what we are looking for.

Also, gradient descent converges to the local minimum even when learning rate \(\alpha\) is fixed. Because as the minimum is approached, derivative gets closer and closer to zero. Therefore, after one step of descent, the new derivative is smaller than the previous one. For another step in gradient descent, one will take a somewhat smaller step from the previous. Now, the derivative term is even smaller and so the magnitude of the update to \(\theta_1\) is even smaller and as gradient descent runs, one will automatically end up taking smaller and smaller steps, so there is no need to decrease \(\alpha\) every time.

To summarize our gradient descent algorithm:

\(\displaystyle\mbox{ repeat until convergence }\left\{\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}f(\theta_0,\theta_1)\mbox{ for j=0,1} \right\}\)

Gradient descent for univariate linear regression

For the linear regression model that we have discussed so far, we have the following hypothesis and cost function:

\(\displaystyle h_{\theta}(x)=\theta_0+\theta_1x \mbox{ and } J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m} \left(h_{\theta}(x^{(i)})-y^{(i)}\right)^2 \)

Now, we are going to apply the gradient descent algorithm to minimize the cost function for the linear regression model. So we have:

\(\displaystyle\theta_j:=\theta_j-\alpha \frac{\partial}{\partial\theta_j}\left(\frac{1}{2m}\sum_{i=1}^{m} \left(\theta_0+\theta_1x^{(i)}-y^{(i)}\right)^2 \right)\)

The most important term in the equation above is the derivative term, so let’s calculate it for both \(j=0\) and \(j=1\).

\(\displaystyle \frac{\partial}{\partial\theta_0}\left(\frac{1}{2m}\sum_{i=1}^{m} \left(\theta_0+\theta_1x^{(i)}-y^{(i)}\right)^2 \right) = \frac{1}{m}\sum_{i=1}^{m} \left(\theta_0+\theta_1x^{(i)}-y^{(i)}\right) = \frac{1}{m}\sum_{i=1}^{m} \left(h_{\theta}(x^{(i)})-y^{(i)}\right)\)
\(\displaystyle \frac{\partial}{\partial\theta_1}\left(\frac{1}{2m}\sum_{i=1}^{m} \left(\theta_0+\theta_1x^{(i)}-y^{(i)}\right)^2 \right) = \frac{1}{m}\sum_{i=1}^{m} \left(\theta_0+\theta_1x^{(i)}-y^{(i)}\right).x^{(i)} = \frac{1}{m}\sum_{i=1}^{m} \left(h_{\theta}(x^{(i)})-y^{(i)}\right).x^{(i)} \)

Now, we will look at gradient descent in action. We know that as we go toward the center of the contour plot, we get those values of \(\theta_0,\theta_1\) which yield the minimum value of the cost function. Let’s now approach the center of the contour plot and see how the hypothesis function fits the training set.

Every data point on the contour plot corresponds to \((\theta_1,\theta_0)\), and we have plotted the hypothesis function corresponding to every point. The one that is closest to the training data set is the center of the contour plot.

Gradient Descent with Python

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> x = np.random.uniform(-4,4,500)       
>>> y = x + np.random.standard_normal(500)+2.5
>>> plt.plot(x, y, 'o')
>>> plt.show()
>>> def cost(X, Y, theta):
>>>     J=np.dot((np.dot(X,theta) - Y).T,(np.dot(X,theta) - Y))/(2*len(Y))
>>>     return J
>>> alpha = 0.1 # Specify the learning rate
>>> theta =  np.array([[0,0]]).T # Initial values of theta
>>> X = np.c_[np.ones(500),x]
>>> Y = np.c_[y]
>>> X_1=np.c_[x].T
>>> num_iters = 1000
>>> cost_history=[]
>>> theta_history=[]
>>> for i in range(num_iters):
>>>     a=np.sum(theta[0]- alpha * (1/len(Y)) * np.sum((np.dot(X,theta)- Y)))
>>>     b=np.sum(theta[1] - alpha * (1/len(Y)) * np.sum(np.dot(X_1,(np.dot(X,theta)-Y))))
>>>     theta= np.array([[a],[b]])
>>>     cost_history.append(cost(X,Y,theta))
>>>     theta_history.append(theta)
>>>     if i in (1,3,7,10,14,num_iters): 
>>>         plt.plot(x, a+x*b)
>>>         plt.suptitle('Linear regression by gradient descent')
>>>         plt.xlabel('x')
>>>         plt.ylabel('y')
>>>         plt.show()
>>>     elif i in range(20,num_iters,10):
>>>         plt.plot(x, a+x*b)
>>> print(theta)
[[ 2.51030574]
 [ 0.95243058]]

Following is the plot that is displayed when we execute the code above in Python:

Gradient Descent with R

# Generate random data
> x <- runif(500, -4, 4)  
> y <- x + rnorm(500) + 2.5 
# Define the squared error cost function	
> cost <- function(X, y, theta) {
> sum( (X %*% theta - y)^2 ) / (2*length(y))
> }
> alpha <- 0.1 # Specify the learning rate
> num_iters <- 1000 # Specify the number of iterations
> cost_history <- rep(0,num_iters) # will be used to store the value of cost function after
                                   # every iteration 
> theta_history <- list(num_iters) # will be used to store the value of theta after every 
                                   # iteration 
> theta <-  c(0,0) # Initial values of theta
> X <- cbind(1,x) # Add a column vector with all values  to be 1 to x so that hypothesis 
                  # function has an intercept 
> for (i in 1:num_iters) {
>   theta[1] <- theta[1] - alpha * (1/length(y)) * sum(((X%*%theta)- y))
>   theta[2] <- theta[2] - alpha * (1/length(y)) * sum(((X%*%theta)- y)*X[,2])
>   cost_history[i] <- cost(X, y, theta)
>   theta_history[[i]] <- theta
> } 
> print(theta)
[1] 2.470261 1.020998
# Plots the training dataset
> plot(x,y, col=rgb(0.2,0.4,0.6,0.4), main='Linear regression by gradient descent')
# Plots various lines during the course of convergence
> for (i in c(1,3,6,10,14,seq(20,num_iters,by=10))) {
>  abline(coef=theta_history[[i]], col=rgb(0.8,0,0,0.3))
> }
> abline(coef=theta, col='blue') # Plots a straight line with intercept as theta[1] and slope
                                 # as theta[2]

Following is the plot that is displayed when we execute the code above in R:

Linear regression via gradient descent is conceptually a simple algorithm. Although, for advanced learning algorithms, the basic concepts remain same but the linear model is replaced by a much more complex model and, correspondingly, a much more complex cost function.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

By Role

By need

Product

Content

Find out how HackerEarth can boost your tech recruiting

Gradient descent algorithm for linear regression

Supervised Learning

Gradient Descent

Gradient descent for univariate linear regression

Gradient Descent with Python

Gradient Descent with R

Get advanced recruiting insights delivered every month

Get advanced recruiting insights delivered every month

Get insightful articles from the world of tech recruiting straight to your inbox

Related reads

The complete guide to hiring a Full-Stack Developer using HackerEarth Assessments

Best Interview Questions For Assessing Tech Culture Fit in 2024

Best Hiring Platforms in 2024: Guide for All Recruiters

Best Assessment Software in 2024 for Tech Recruiting

Top Video Interview Softwares for Tech and Non-Tech Recruiting in 2024: A Comprehensive Review

8 Top Tech Skills to Hire For in 2024

Get advanced recruiting insights delivered every month

Top Products

Engage global developers through innovation

AI-driven advanced coding assessments

Real-time code editor for effective coding interviews

Tailored learning paths for continuous assessments

For Businesses

Solutions

Features

Knowledge

Company

Enterprise Readiness