Logistic Regression for Classification in Machine Learning

Logistic regression is an analytical technique in supervised machine learning for performing classification tasks. It is a classification algorithm based on probabilistic approach.

Logistic regression approach can be used to make predictions on a categorical dependent variable. The algorithm extracts inferences from a given set of independent input variables and these inferences are used to make predictions on a target variable. The target variable has to be a dependent categorical variable. The logistic regression predicts the probability of the target variable to belong to a particular class by giving a probabilistic value that lies between 0 and 1.

This method is named logistic regression though it performs classification tasks because of the reason that the underlying techniques are similar to that involved in linear regression. The algorithm uses predictive modeling as regression and uses a logit function for the classification task. In fact, it uses the output obtained from a linear regression function as input and applies a sigmoid function to measure the probability that an instance belongs to a particular class.

Criteria for logistic regression

The independent variables should be free from correlation between themselves, i.e., multicollinearity among input features has to be avoided
The target variable in logistic regression has to a dependent categorical variable

Need of Sigmoid function in logistic regression

Logistic regression works on the principle of linear separability of data points. Like the Support Vector Machine method, a hyperplane separates the data points. The data points belonging to the negative class are denoted by -1 and the data points belonging to the positive class are denoted by +1.

Logistic Regression for Classification in Machine Learning

Figure 1. Schematic of hyperplane separating data points belonging to different categories

Suppose w is the normal vector to the hyperplane (Figure 1), then the distance d of the point $X_i$ from the hyperplane is,

$$d=\frac{W^TX}{||W||}$$

If $W$ is unit normal vector, then, $d=W^TX_i$

Now, the classifier classifies the data points which are in the same direction of the normal vector as positive class and the remaining data points as negative class.

For this, $W^TX_i>0,Y_i=+1$ and $W^TX_i<0,Y_i=-1$

Where, $Y_i$ is the actual or true class label of the data points.

This implies, for correctly classified points, $Y_iW^TX_i$ will always be greater than zero,

Mathematically, $Y_iW^TX_i>0$ for correctly classified points because for positive data points $W^TX_i>0$ and $Y_i = +1$ and for negative points $W^TX_i<0$ and $Y_i = -1$

Similarly, $Y_iW^TX_i<0$ for incorrectly classified points,the positive data points are those with $W^TX_i<0$ and $Y_i = +1$ and for negative points $W^TX_i>0$ and $Y_i = -1$.

The objective of logistic regression is to perform the classification task with minimum number of misclassifications,

Thus, the algorithm has to find a $W$ that maximizes the $Y_iW^TX_i$ because $Y_iW^TX_i>0$ for correctly classified points.

i.e, For a better performance, we have to find a $W$ in which, $W= argmax(\sum_{i=1}^{n} Y_iW^TX_i)$

However, the outliers have large impact on the W value as can be seen in the Figure 2,

Figure 2. Schematic to demonstrate impact of outlier

In this case, $\sum_{i=1}^{n}Y_iW^TX_i = 1+1+1+1+1+1-50$

This situation clearly indicates the model is vulnerable to small changes in the data set due to the presence of outliers. In order to overcome this problem, we need a function which can squash such data points. Sigmoid function is a function with this property of squashing,

Figure 3. Sigmoid function (source: Wikipedia)

Mathematically, $sigmoid(z)= 1/1+exp(-z)$

Here $z = Y_iW^TX_i$ so $sigmoid(z)= 1/1+exp(-Y_iW^TX_i)$

Hence for the points which lies on the hyperplane $W^TX_i = 0$ which gives probability value of 0.5 for a point to belongs to positive or negative class.

From graph, $sigmoid(0) = 0.5$

Using sigmoid function enables logistic regression to be resistant to outliers. The sigmoid function squash the distances from $[-\infty, +\infty]$ to $[0, 1]$.

The sigmoid function helps to solve the optimization problems in logistic regression because it is a differentiable function with probabilistic interpretations.

Optimization equation of logistic regression

Like any other regression and classification method in machine learning, optimization in logistic regression also has two parts, the loss part and the regularization part.

The loss function use in logistic regression is called logistic loss,

$logistic\_loss = argmin \sum_{i=1}^{n} log(1+exp(-Y_iW^TX_i))$

Here $Y_i$ can have values $-1$ or $+1$

Optimization problem with $L_1$ regularizer is,

$W = argmin \sum_{i=1}^{n} log(1+exp(-Y_iW^TX_i))+ \lambda ||W||_1$

Where $\lambda$ is the hyperparameter.

The features with less importance vanish in logistic regression with $L_1$ regularizer because $L_1$ regularizer induces sparsity.

Optimization problem with $L_2$ regularizer is,

$W = argmin \sum_{i=1}^{n} log(1+exp(-Y_iW^TX_i))+ \lambda ||W||_2$

$L_2$ regularizer gives small weight to less important features and thereby preserve it instead of making it zero.

Optimization problem with a combination of $L_1$ and $L_2$ regularizer is called Elastic net,

$W = argmin \sum_{i=1}^{n} log(1+exp(-Y_iW^TX_i))+ \lambda ||W||_1+\lambda ||W||_2$

The above formulation are based on geometric interpretation. The probabilistic formulation of optimization problem in logistic regression is,

$W = argmin \sum_{i=1}^{n} -Y_i log(p_i)-(1-Y_i)log(1-p_i)$

Where $p_i = sigmoid(W^TX_i)$, here $Y_i$ can have values 0 or 1.

Advantages of logistic regression

It is possible to incorporate new data to logistic regression methods using stochastic gradient descent.
Neural network representation can be understood as a collection of small logistic regression modules assembled.
Logistic regression can be extended to multiclass classification using softmax function. This approach is called multinomial logistic regression.
It performs well with low dimensional data sets with a minor chance of overfitting.

Disadvantages of logistic regression

Logistic regression is vulnerable to overfitting if number features are less than number of observations.
Logistic regression requires a large data set and greater number of training samples to do the classification task with a good performance metric.
It cannot be used for a target variable which is not discrete.
It is not a suitable choice for a complex data set.

Logistic regression in machine learning for real world applications

Logistic regression can be used for real world applications such as,

Recognize handwriting
Email filtering like spam or not
Image segmentation in scans
Disease prediction like a cell is malignant or not
Detection of objects in video etc.