Logistic regression

Tutorial

Introduction
Till now we have seen regression problems where the prediction was all about the value of a parameter. Logistic regression is used for a different class of problems known as classification problems. Here the aim is to predict the group to which the current object under observation belongs to. Classification is all about portioning the data with us into groups based on certain features. Let us dwell into the most commonly used form of example for this: A tumor has to be classified malignant or benign based on various features like size, location, etc. Let us look into simple logistic regression in which is used for binary classification purposes.

Logistic Regression
As discussed above, in linear regression our aim is to achieve binary classification and hence our hypothesis must be chosen accordingly. Modifying our previous regression models our hypothesis function, we can write $$$Y = C^{T}X$$$ where $$$C = \begin{bmatrix} \alpha \\ \beta_{1} \\ \beta_{2} \\ .. \\ \beta_{n} \\ \end{bmatrix} $$$ and $$$ X = \begin{bmatrix} 1\\ X^{1}\\ X^{2}\\ ..\\ X^{n}\\ \end{bmatrix} $$$ where $$X^{i}$$ is the vector containing $$i^{th}$$ feature value for all the entries in data set.
A sigmoid function is applied over the general known hypothesis function to get it into a range of (0,1). This will be more clear once we discuss about boundary estimation. Sigmoid function is as follows, $$$sg(y) = \frac{1}{1+e^{-y}}$$$ and therefore our new hypothesis function becomes $$$sg(y) = sg(C^{T}x) = \frac{1}{1+e^{-C^{T}x}}$$$ Boundary estimation
The new hypothesis function gives out a value between 0 & 1 and hence can be interpreted as the probability that y value will be 1 for that particular x. This interpretation can be formally put into the following form: $$$sg(y) = P(y = 1 | x ; C )$$$ and as y can take only 0 & 1, the other value probability is 1 minus the hypothesis value.
With the above interpretation we can safely decide the decision boundary with the following rule: $$y = 1$$ if $$sg(y) \ge 0.5$$, else $$y = 0$$. $$sg(C^Tx) \ge 0.5$$ implies $$C^Tx \ge 0$$ and similarly for less than condition. This constraint will give us the boundary for making a decision. When looked at a linear equation of two independent variables, it will be quite clear how a line cuts the coordinate plane into two halves with each class lying on either side of it.

Cost function
With the modified hypothesis function, taking a square error function won't work as it no longer convex in nature and tedious to minimize. We take up a new form of cost function which is as follows:
$$E(sg(C,x),y) = -log(sg(C,x))$$ if $$y=1$$
$$E(sg(C,x),y) = -log(1-sg(C,x))$$ if $$y=0$$
This can be written in a simpler form as: $$$E(sg(C,x),y) = -ylog(sg(C,x)) - (1-y)log(1-sg(C,x))$$$ and it is quiet evident that it is equivalent to the above cost function. For estimation of parameters, we take the mean of cost function over all points in the training data. So, $$$H(C) = \frac{1}{m}\sum_{i=1}^{m}E(sg(C,x_i),y_i)$$$

Parameter Estimation
For parameter estimation, we use an iterative method called gradient descent that improves the parameters over each step and minimizes the cost function $$H(C)$$ to the most possible value. Gradient descent needs a convex cost function so that the minimization step doesn't get stuck over in a local minima. In gradient descent, you start with random parameter values and then update their values at each step to minimize the cost function by a some amount at each step until we reach a minimum hopefully or until there is negligible change over certain no.of consecutive steps. The steps of gradient descent go as follows: $$$\beta_{i} = \beta_{i} - p \frac{\partial H(C)}{\partial \beta_{i}}$$$ for each $$i=1,...n$$ and $$p$$ is the learning rate at which we move along the slope on the curve to minimize the cost function.
We need to look at the above equation in computation terms, i.e. the $$\beta_{i}$$ on the left hand side is the new value while the one on right hand side is the one calculated in previous step. Also, the updating step of all parameters has to be done simultaneously as the value of $$H(C)$$ is calculated using the parameters and it changes when anyone of them is updated.

Contributed by: Shubhakar Reddy Tipireddy

View all comments