K-Nearest Neighbor(KNN) Algorithm

K-Nearest Neighbor(KNN) Algorithm

A data point is classified using the KNN algorithm based on similarity after all the existing data has been stored.

Indirakumar S's photo
Indirakumar S
·Aug 10, 2022·

5 min read

Subscribe to our newsletter and never miss any upcoming articles

Play this article

Table of contents

  • K-Nearest Neighbor Algorithm
  • Algorithm
  • Why KNN?
  • Advantage
  • Disadvantage
  • Example
  • Conclusion

K-Nearest Neighbor Algorithm

The k-nearest neighbours (KNN) technique calculates the likelihood that a data point will belong to one group or another based on which group the data points closest to it do.

The k-nearest neighbour algorithm is an example of a supervised machine learning technique used to solve classification and regression problems. Its primary application, however, is in classification issues.

KNN is a slow, non-parametric learning method. It is known as a lazy learning algorithm or lazy learner because it does not perform any training when you provide the training data. During the training period, it does not perform any calculations and simply stores the data. A model is not created before a query is run on the dataset.

As a result, KNN is ideal for data mining. It is classified as a non-parametric method because it makes no assumptions about the underlying data distribution. KNN, in short, seeks to identify the group to which a data point belongs by examining the data points around it.

Think about the two groups, A and B. The algorithm examines the nearby data points' states to determine if a data point belongs to group A or group B. It is quite likely that the data point in question is in group A if the bulk of the data points is in group A, and vice versa. KNN also called the closest neighbour, is a technique for categorizing data points by comparing them to their nearest annotated data point.

K-NN classification should not be confused with K-means clustering. A supervised classification system called KNN sorts new data points according to the nearby ones. K-means clustering, on the other hand, divides data into a K number of clusters and is an unsupervised clustering algorithm.

Algorithm

The KNN algorithm is implemented using programming languages like Python and R. The KNN pseudocode is as follows:

1.Fill the data up

2.Select K value

3.For every informational point in the data:

3.1.All training data samples' Euclidean distance should be determined.

3.2.Place the distances on a list that is ordered, then sort it.

3.3.Pick the first K items in the sorted list.

3.4.Based on the vast majority of classes contained in the chosen points, label the test point.

4.End

Why KNN?

A crucial issue in data science and machine learning is classification. One of the earliest and most precise algorithms for pattern categorization and regression models is the KNN.

The following are some applications for the k-nearest neighbour algorithm:

  • Credit rating

  • Loan approval

  • Data preprocessing

  • Pattern recognition

KNN can be also used in recommendation systems because it can help find users with similar characteristics. It can, for example, be used in an online video streaming platform to recommend content that a user is more likely to watch based on what similar users watch.

For image classification, the KNN algorithm is used in computer vision. It's helpful in a variety of computer vision applications because it can group similar data points, such as grouping cats together and dogs in a different class.

Advantage

  • It is simple to understand and implement.
  • It applies to both classification and regression problems.
  • It is ideal for non-linear data because it makes no assumptions about the underlying data.
  • It is naturally capable of handling multi-class cases.
  • With enough representative data, it can perform well.

Disadvantage

  • Because it stores all of the training data, the associated computation cost is high.
  • High memory storage is required.
  • K's value must be determined.
  • If N is large, prediction takes a long time.
  • Sensitive to irrelevant characteristics

Example

Image classification

import numpy as np // linear algebra
import pandas as pd // data processing
import matplotlib.pyplot as plt  // to plot image, graph
import time
%matplotlib inline
// dataset for digit (0-9)
from sklearn.datasets import load_digits

Load dataset

digits = load_digits()
digits.keys()
#dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])
// dataset description
digits.DESCR

Already processed images


digits.images[0]

'''array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])'''

Predictors, independent variables, features

digits.data

'''array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])'''

Target variable, class, dependent variable

digits.target

#array([0, 1, 2, ..., 8, 9, 8])

These 1797 images (8 by 8 for a dimension of 64)

print('Image Data Shape', digits.images.shape)
# Image Data Shape (1797, 8, 8)


# 1797 labels
print('Label Data Shape', digits.target.shape)
#Label Data Shape (1797,)
X = digits.images
plt.figure(figsize=(20,10))
columns = 5
for i in range(5):
    plt.subplot(5 / columns + 1, columns, i + 1)
    plt.imshow(X[i],cmap=plt.cm.gray_r,interpolation='nearest')

from sklearn.metrics import accuracy_score,confusion_matrix # metrics error
from sklearn.model_selection import train_test_split # resampling method
X = digits.data
y = digits.target

Since its a multi-class prediction, to prevent error, we need some library

from sklearn.multiclass import OneVsRestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
knn = OneVsRestClassifier(KNeighborsClassifier())
knn.fit(X_train,y_train)
//OneVsRestClassifier(estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
 //          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
  //         weights='uniform'),
      //    n_jobs=None)

Predict for one observation

knn.predict(X_test[0].reshape(1,-1))
//array([2])

Predict for multiple observations (images) at once

knn.predict(X_test[0:10])
//array([2, 8, 2, 6, 6, 7, 1, 9, 8, 5])

Make predictions on the entire test data

predictions = knn.predict(X_test)
%time
# 98%
print('KNN Accuracy: %.3f' % accuracy_score(y_test,predictions))
//CPU times: user 0 ns, sys: 0 ns, total: 0 ns
//Wall time: 10 µs
//KNN Accuracy: 0.980

To create nice confusion metrics

import seaborn as sns
cm = confusion_matrix(y_test,predictions)
plt.figure(figsize=(9,9))
sns.heatmap(cm,annot=True, fmt='.3f', linewidths=.5, square=True,cmap='Blues_r')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'Accuracy Score: {0}'.format(accuracy_score(y_test,predictions))
plt.title(all_sample_title,size=15)
#Text(0.5, 1.0, 'Accuracy Score: 0.98')

Conclusion

This article clearly explains the process of image classification using KNN in the machine learning model. To learn more about machine learning and how to make machine learning models, check out my machine learning series. If you have any questions or doubts, mention them in this article's comments section, and connect with me to learn more about machine learning.

Learning is an interesting habit..!

 
Share this