You are probably using machine learning multiple times a day without realizing it. For instance, when checking your mailbox, a spam filter automatically filters out junk mail—thanks to Machine Learning (ML). ML is the science of training a machine to learn from past data, without being explicitly programmed.
There are two main types of ML techniques:
- Supervised Machine Learning: The system learns from predefined training data to predict future outcomes.
- Unsupervised Learning: The system identifies hidden patterns in data without prior labels—for example, finding close friend groups on Facebook.
Supervised Learning
Consider the following dataset showing house prices in Bengaluru, India:
Living area (sq ft) | Price (USD) |
---|---|
820 | 30105 |
1050 | 58448 |
1550 | 85911 |
1200 | 87967 |
1600 | 73722 |
1117 | 54630 |
550 | 42441 |
1162 | 79596 |
To predict housing prices based on living area, we define a hypothesis function: hθ(x) = θ₀ + θ₁x. Here, θ₀ and θ₁ are parameters we aim to optimize. Our goal is to minimize the error between predicted and actual prices using a cost function:
J(θ₀, θ₁) = (1/2m) Σ(hθ(x(i)) − y(i))²
This cost function measures the squared error over m training examples. Our objective is to find θ₀ and θ₁ that minimize this cost.
Gradient Descent
Gradient descent is an optimization algorithm used to minimize functions like our cost function. Starting with initial guesses for θ₀ and θ₁, we iteratively update them using:
θj := θj − α ∂/∂θj J(θ₀, θ₁)
Here, α is the learning rate. The updates continue until convergence—when changes become negligible.
Applying Gradient Descent to Linear Regression
The partial derivatives of the cost function with respect to θ₀ and θ₁ are:
- ∂J/∂θ₀ = (1/m) Σ(hθ(x(i)) − y(i))
- ∂J/∂θ₁ = (1/m) Σ(hθ(x(i)) − y(i))x(i)
Using these, we can apply gradient descent and iteratively update θ₀ and θ₁ to minimize the cost function.
Gradient Descent with Python
import numpy as np
import matplotlib.pyplot as plt
x = np.random.uniform(-4, 4, 500)
y = x + np.random.standard_normal(500) + 2.5
plt.plot(x, y, 'o')
plt.show()
def cost(X, Y, theta):
return np.dot((np.dot(X, theta) - Y).T, (np.dot(X, theta) - Y)) / (2 * len(Y))
alpha = 0.1
theta = np.array([[0, 0]]).T
X = np.c_[np.ones(500), x]
Y = np.c_[y]
X_1 = np.c_[x].T
num_iters = 1000
cost_history = []
theta_history = []
for i in range(num_iters):
a = np.sum(theta[0] - alpha * (1 / len(Y)) * np.sum((np.dot(X, theta) - Y)))
b = np.sum(theta[1] - alpha * (1 / len(Y)) * np.sum(np.dot(X_1, (np.dot(X, theta) - Y))))
theta = np.array([[a], [b]])
cost_history.append(cost(X, Y, theta))
theta_history.append(theta)
if i in (1, 3, 7, 10, 14, num_iters):
plt.plot(x, a + x * b)
plt.title('Linear regression by gradient descent')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
elif i in range(20, num_iters, 10):
plt.plot(x, a + x * b)
print(theta)
Gradient Descent with R
x <- runif(500, -4, 4)
y <- x + rnorm(500) + 2.5
cost <- function(X, y, theta) {
sum((X %*% theta - y)^2) / (2 * length(y))
}
alpha <- 0.1
num_iters <- 1000
cost_history <- rep(0, num_iters)
theta_history <- list(num_iters)
theta <- c(0, 0)
X <- cbind(1, x)
for (i in 1:num_iters) {
theta[1] <- theta[1] - alpha * (1 / length(y)) * sum((X %*% theta - y))
theta[2] <- theta[2] - alpha * (1 / length(y)) * sum((X %*% theta - y) * X[,2])
cost_history[i] <- cost(X, y, theta)
theta_history[[i]] <- theta
}
print(theta)
plot(x, y, col=rgb(0.2,0.4,0.6,0.4), main='Linear regression by gradient descent')
for (i in c(1,3,6,10,14,seq(20,num_iters,by=10))) {
abline(coef=theta_history[[i]], col=rgb(0.8,0,0,0.3))
}
abline(coef=theta, col='blue')
Linear regression via gradient descent is simple and intuitive. Although advanced learning algorithms may use more complex models and cost functions, the underlying principles remain the same.