Follow

# .css-ecb9sr{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;width:16rem;}  Follow # K-Nearest Neighbor(KNN) Algorithm

## A data point is classified using the KNN algorithm based on similarity after all the existing data has been stored.

Indirakumar S
·Aug 10, 2022·

• K-Nearest Neighbor Algorithm
• Algorithm
• Why KNN?
• Example
• Conclusion

# K-Nearest Neighbor Algorithm

The k-nearest neighbours (KNN) technique calculates the likelihood that a data point will belong to one group or another based on which group the data points closest to it do.

The k-nearest neighbour algorithm is an example of a supervised machine learning technique used to solve classification and regression problems. Its primary application, however, is in classification issues.

KNN is a slow, non-parametric learning method. It is known as a lazy learning algorithm or lazy learner because it does not perform any training when you provide the training data. During the training period, it does not perform any calculations and simply stores the data. A model is not created before a query is run on the dataset.

As a result, KNN is ideal for data mining. It is classified as a non-parametric method because it makes no assumptions about the underlying data distribution. KNN, in short, seeks to identify the group to which a data point belongs by examining the data points around it.

Think about the two groups, A and B. The algorithm examines the nearby data points' states to determine if a data point belongs to group A or group B. It is quite likely that the data point in question is in group A if the bulk of the data points is in group A, and vice versa. KNN also called the closest neighbour, is a technique for categorizing data points by comparing them to their nearest annotated data point.

K-NN classification should not be confused with K-means clustering. A supervised classification system called KNN sorts new data points according to the nearby ones. K-means clustering, on the other hand, divides data into a K number of clusters and is an unsupervised clustering algorithm.

# Algorithm

The KNN algorithm is implemented using programming languages like Python and R. The KNN pseudocode is as follows:

1.Fill the data up

2.Select K value

3.For every informational point in the data:

3.1.All training data samples' Euclidean distance should be determined.

3.2.Place the distances on a list that is ordered, then sort it.

3.3.Pick the first K items in the sorted list.

3.4.Based on the vast majority of classes contained in the chosen points, label the test point.

4.End

# Why KNN?

A crucial issue in data science and machine learning is classification. One of the earliest and most precise algorithms for pattern categorization and regression models is the KNN.

The following are some applications for the k-nearest neighbour algorithm:

• Credit rating

• Loan approval

• Data preprocessing

• Pattern recognition

KNN can be also used in recommendation systems because it can help find users with similar characteristics. It can, for example, be used in an online video streaming platform to recommend content that a user is more likely to watch based on what similar users watch.

For image classification, the KNN algorithm is used in computer vision. It's helpful in a variety of computer vision applications because it can group similar data points, such as grouping cats together and dogs in a different class.

• It is simple to understand and implement.
• It applies to both classification and regression problems.
• It is ideal for non-linear data because it makes no assumptions about the underlying data.
• It is naturally capable of handling multi-class cases.
• With enough representative data, it can perform well.

• Because it stores all of the training data, the associated computation cost is high.
• High memory storage is required.
• K's value must be determined.
• If N is large, prediction takes a long time.
• Sensitive to irrelevant characteristics

# Example

## Image classification

``````import numpy as np // linear algebra
import pandas as pd // data processing
import matplotlib.pyplot as plt  // to plot image, graph
import time
%matplotlib inline
``````
``````// dataset for digit (0-9)
``````

``````digits = load_digits()
``````
``````digits.keys()
#dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])
``````
``````// dataset description
digits.DESCR
``````

``````
digits.images

'''array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
[ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
[ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
[ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
[ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
[ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
[ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
[ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])'''
``````

Predictors, independent variables, features

``````digits.data

'''array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
[ 0.,  0.,  0., ..., 10.,  0.,  0.],
[ 0.,  0.,  0., ..., 16.,  9.,  0.],
...,
[ 0.,  0.,  1., ...,  6.,  0.,  0.],
[ 0.,  0.,  2., ..., 12.,  0.,  0.],
[ 0.,  0., 10., ..., 12.,  1.,  0.]])'''
``````

Target variable, class, dependent variable

``````digits.target

#array([0, 1, 2, ..., 8, 9, 8])
``````

These 1797 images (8 by 8 for a dimension of 64)

``````print('Image Data Shape', digits.images.shape)
# Image Data Shape (1797, 8, 8)

# 1797 labels
print('Label Data Shape', digits.target.shape)
#Label Data Shape (1797,)
``````
``````X = digits.images
plt.figure(figsize=(20,10))
columns = 5
for i in range(5):
plt.subplot(5 / columns + 1, columns, i + 1)
plt.imshow(X[i],cmap=plt.cm.gray_r,interpolation='nearest')

from sklearn.metrics import accuracy_score,confusion_matrix # metrics error
from sklearn.model_selection import train_test_split # resampling method
X = digits.data
y = digits.target
``````

Since its a multi-class prediction, to prevent error, we need some library

``````from sklearn.multiclass import OneVsRestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
knn = OneVsRestClassifier(KNeighborsClassifier())
knn.fit(X_train,y_train)
//OneVsRestClassifier(estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
//          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
//         weights='uniform'),
//    n_jobs=None)
``````

Predict for one observation

``````knn.predict(X_test.reshape(1,-1))
//array()
``````

Predict for multiple observations (images) at once

``````knn.predict(X_test[0:10])
//array([2, 8, 2, 6, 6, 7, 1, 9, 8, 5])
``````

Make predictions on the entire test data

``````predictions = knn.predict(X_test)
%time
# 98%
print('KNN Accuracy: %.3f' % accuracy_score(y_test,predictions))
//CPU times: user 0 ns, sys: 0 ns, total: 0 ns
//Wall time: 10 µs
//KNN Accuracy: 0.980
``````

To create nice confusion metrics

``````import seaborn as sns
cm = confusion_matrix(y_test,predictions)
plt.figure(figsize=(9,9))
sns.heatmap(cm,annot=True, fmt='.3f', linewidths=.5, square=True,cmap='Blues_r')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'Accuracy Score: {0}'.format(accuracy_score(y_test,predictions))
plt.title(all_sample_title,size=15)
#Text(0.5, 1.0, 'Accuracy Score: 0.98')
``````