Machine Learning
Topics:
Univariate linear regression
• Statistics
• Data Manipulation & Visualisa...
• Machine Learning Algorithms
• Machine Learning Projects
• Challenges Winning Approach
• Transfer Learning

# Univariate linear regression

• Tutorial

Introduction
When we start talking about regression analysis, the main aim is always to develop a model that helps us visualize the underlying relationship between variables under the reach of our survey. Univariate linear regression focuses on determining relationship between one independent (explanatory variable) variable and one dependent variable. Regression comes handy mainly in situation where the relationship between two features is not obvious to the naked eye. For example, it could be used to study how the terrorist attacks frequency affects the economic growth of countries around the world or the role of unemployment in a country in the bankruptcy of the government.

Simple linear regression
Given a dataset of variables $(x_i,y_i)$ where $x_i$ is the explanatory variable and $y_i$ is the dependent variable that varies as $x_i$ does, the simplest model that could be applied for the relation between two of them is a linear one. Simple linear regression model is as follows:

$$y_i = \alpha+ \beta*x_i + \epsilon_i$$

$\epsilon_i$ is the random component of the regression handling the residue, i.e. the lag between the estimation and actual value of the dependent parameter. If Y is the estimation value of the dependent variable, it is determined by two parameters:
1. The core parameter term $\alpha+\beta*x_i$ which is not random in nature. $\alpha$ is known as the constant term or the intercept (also is the measure of the y-intercept value of regression line). $\beta$ is the coefficient term or slope of the intercept line.
2. Above explained random component, $\epsilon_i$.

Parameter Estimation
After hypothesizing that Y is linearly related to X, the next step would be estimating the parameters $\alpha$ & $\beta$. While doing this our main aim always remains in the core idea that Y must be the best possible estimate of the real data. Hence we use OLS (ordinary least squares) method to estimate the parameters.
In this method, the main function used to estimate the parameters is the sum of squares of error in estimate of Y, i.e. sum of squares of $\epsilon_i$ values. The equation is as follows:

$$E(\alpha,\beta) = \sum\epsilon_{i}^{2} = \sum_{i=1}^{n}(Y_{i}-y_{i})^2$$

The above equation is to be minimized to get the best possible estimate for our model and that is done by equating the first partial derivatives of the above equation w.r.t $\alpha$ and $\beta$ to 0.

$$\frac{\partial E(\alpha,\beta)}{\partial \alpha} = -2\sum_{i=1}^{n}(y_i-\alpha-\beta*x_{i}) = 0$$

$$\frac{\partial E(\alpha,\beta)}{\partial \beta} = -2\sum_{i=1}^{n}(y_i-\alpha-\beta*x_{i})x_{i} = 0$$ Solving the system of equations for $\alpha$ & $\beta$ leads to the following values,

$$\beta = \frac{Cov(x,y)}{Var(x)} = \frac{\sum_{i=1}^{n}(y_i-y^{'})(x_i-x^{'})}{\sum_{i=1}^{n}(x_i-x^{'})^2}$$ $$\alpha = y^{'}-\beta*x^{'}$$

To verify that the parameters indeed minimize the function, second order partial derivatives should be taken (Hessian matrix) and its value must be greater than 0.

Evaluating our model
To evaluate the estimation model, we use coefficient of determination which is given by the following formula:

$$R^{2} = 1-\frac{\mbox{Residual Square Sum}}{\mbox{Total Square Sum}} = 1-\frac{\sum_{i=1}^{n}(y_i-Y_i)^{2}}{\sum_{i=1}^{n}(y_i-y^{'})^{2}}$$ where $y^{'}$ is the mean value of $y$.

In case of OLS model, $\mbox{Residual Square Sum - Total Square Sum = Explained Square Sum }= \sum_{i=1}^{n}(Y_i-y^{'})^{2}$ and hence $$R^{2} = \frac{\sum_{i=1}^{n}(Y_i-y^{'})^{2}}{\sum_{i=1}^{n}(y_i-y^{'})^{2}}$$

Contributed by: Shubhakar Reddy Tipireddy