The K-nearest neighbor (KNN) is a supervised machine learning algorithm. KNN is used mostly to classify data points although it can perform regression as well. The K-Nearest Neighbors Algorithm classify new data points to a particular category based on its similarity with the other data points in that category. The KNN algorithms grab the freedom to select any functional form from the training data to map the input data to the target data. KNN belongs to the category of nonparametric algorithms as this algorithm doesn’t take any specific assumption about the mapping function.
Real world use of KNN algorithm
Suppose if we want to understand a new review belongs to the category of positive review or negative review. Each word in review will be converted into a vector for the sake of interpretability of a machine learning algorithm.
The KNN algorithm classifies the text vectors into the category of positive and negative review based on the similarity in the text vector it finds during training. The KNN algorithm learns a mathematical function to classify new data points to suitable categories based on its similarity to a particular group of text vectors.
Working principle of KNN
Given a query point, our task is to find the suitable category to which the query point can be assigned. KNN algorithms performs the following steps to execute this task of classification,
- Decide the number $K$ in KNN
$K$ in KNN algorithms is the number of neighborhood points to a query point. The assignment of a class label to a query point depends on the hyperparameter $K$. i.e., if we assign a value of 5 for $K$, then we have to consider 5 nearest neighbor data points to the query point. The class label will be assigned based on the majority class label out of the 5 neighborhood points. $K$ can be assigned to any positive integer value but it is more likely to assign odd positive integers to avoid the tie arise while taking a majority vote to predict class label.
- Find the similarity based on Euclidean distance
The shortest distance between two data points is called Euclidean distance,
In 2 dimension, the distance between $A(x_1, y_1)$ and $B(x_2, y_2)$ is,
In this approach, the points which are closer are considered as more similar and grouped into the same category. The points which are farther away are treated as points belonging to different categories.
- Measure the Euclidean distance of query point to K nearest neighbors
In the case we discussed with $K=5$, the KNN algorithm analyzes 5 data points which are at the shortest distance from the query point using Euclidean distance measure.
- Assign the class label to the query point
The algorithm predicts the class label of the query point as the same as that of the majority data points in the neighborhood. It estimates out of the $K$ nearest neighbors, to which class the majority of the neighboring points belongs. It is presumed that the query point is more similar to the majority data points in the neighborhood based on Euclidean distance.
Other distance Measures in KNN
Though we discussed Euclidean distance as a distance measure for KNN algorithm in working principle. There are a couple of more distance measures which the KNN algorithm employs to find the distance and thereby the similarity between data points. The distance measures other than the widely used Euclidean distance are as follows,
- Manhattan distance
Manhattan distance is also called city block distance as it is more suitable to measure distance between points on a uniform grid such as a city block in Manhattan or like a chess board.
The Manhattan distance can be expressed as,
$D_M$=$ |X_1-X_2| + |Y_1-Y_2|$
As it is clear from the mathematical formulation, the Manhattan distance is the $L_1$ norm of vectors.
- Minkowski Distance
Minkowski distance is a generalized version of both Euclidean distance and Manhattan distance.
The mathematical formulation of Minkowski distance is,
- Cosine distance
Cosine similarity between two data points is the cosine of the angle between them.
i.e cosine similarity between $x_1$ and $x_2$ separated by an angle $\theta$ is $cos (\theta)$
The cosine similarity and cosine distance are related through the equation,
Cosine distance= 1-cosine similarity
KNN algorithms use cosine distance to find the similarity between nonlinear data points.
KNN algorithms for regression
The procedure for regression using KNN algorithm is almost the same as that of classification. The only difference is, in regression the algorithm takes the mean or median of nearest neighbors to assign the class label whereas for classification problems, the class label is assigned based on majority vote.
Advantages of KNN
- The effectiveness of KNN increase with large training data and so it supports sufficient data representation
- KNN can be used for both classification and regression
- KNN is non parametric and hence no assumptions associated with the underlying data. Hence, it can be used for nonlinear data as well
- Implementation of KNN is simple
Disadvantages of KNN algorithm
- Measuring the distance between all data points in training samples is a computationally expensive task
- If the training data is excessively randomized, then assigning a class label based on distance measure is messy
- KNN is not a suitable option for query points at farther distance from the training data
Underfitting and overfitting in KNN
When the $K$ value is very small like $K=1$ or 2, the decision surface which separates different classes will not be smooth. The decision surface tries to make predictions with high accuracy in this case and it will lead to overfitting.
On the other hand, when the value of $K$ is too high like $K=n$, the decision surface itself vanishes and it results in a situation of classifying every query point as the majority class. This overly simplified assumption causes high bias and underfitting.
Bias-variance trade-off has to be done with hyperparameter tuning of $K$ values in order to get a smooth decision surface. Smooth decision surfaces can guarantee an optimal model which neither overfit nor underfit. Such a decision surface will be less prone to noise.
Space and time complexity of KNN algorithms
Space complexity of any algorithm is a measure of memory space required by an algorithm to process the input for a particular task. The space complexity of KNN algorithm is of the order of $n \times d$, denoted as $O(nd)$. Here, $n$ is the total number of data points used for training and d is the total number of features present.
Time complexity of any algorithm is a measure of how much time it takes to process the input data. Here, in the case of KNN algorithms, the time complexity is also $O(nd)$. However, we can reduce the time complexity of KNN algorithms to $O(log(n)) $ using the kd tree.
If you are looking for more such tutorial style articles in machine learning, you can be find it here.