K-Nearest Neighbors From Scratch¶

Background¶

Introduced in 1951, K-Nearest Neighbors is a very simple classification algorithm. They are known as instance-based models because they aren't trained or fit to learn parameters, instead they store the entire training dataset and use it to make predictions based on similarity. Our model will use the Euclidean distance formula to determine similarity, however other models may use different distance metrics. The Euclidean distance formula is defined as:

$$ d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} $$

Where:

  • $x$ and $y$ are points in the feature space
  • $n$ is the number of features
  • $i$ is the index of the current feature

The result of this function is interpreted by the model to determine how similar two points are to each other. The lower the distance, the closer the points.

Use cases¶

Although K-Nearest Neighbors is a simple algorithm, it has a variety of use cases in several industries, including:

  • Medical diagnosis
  • Anomaly Detection
  • Text Classification
  • Image recognition

These are just a few of the many real world applications of KNN models, proving that simplicity doesn't limit usefulness.

Implementation¶

Step 1: Data Generation

Before we can train the model, we need a dataset. The function below creates a synthetically generated dataset that is perfect for classification. The function takes input for:

  • samples or the amount of points or rows the dataset will have
  • features or the amount of variables or columns the dataset will have
  • classes or the amount of labels or categories a sample can be

The function starts by randomly generating a center for each class between -10 and 10. Then it randomly assigns a class for each sample. After that, it generates the features by randomly adding noise around the classes center. Finally, it returns the features and their labels.

In [1]:
import numpy as np # NumPy for math

def generate_data(samples=1000, features=3, classes=3):                                        # define function with input for samples, features, and classes
    centers = np.random.uniform(-10, 10, size=(classes, features))                             # random class centers for features
    y = np.random.randint(0, classes, size=samples)                                            # assign each sample to a class randomly
    X = np.array([centers[label] + np.random.normal(scale=1.0, size=features) for label in y]) # generate features by adding noise arround centers
    return X, y                                                                                # return features and classes

Step 2: Splitting Data

Next, we have to split the data into two parts, training data, or the learning set, and testing data, or the evaluation set. This is done so we can evaluate the model using different data than we trained it on. We define the function split_data below to do so. It takes input for:

  • X or the features
  • y or the classes
  • test_size or the percent of data used for testing

The function works by creating a list of indices corresponding to each sample in X. Next, it shuffles the indices to make the split random. Then, it calculates where to split the indices based on test_size, and uses this point to separate training indices from testing indices. Finally, it uses these separated indices to create train and test sets for the features and labels and then returns them.

In [2]:
def split_data(X, y, test_size=0.2):
    # sample indices
    indices = [i for i in range(len(X))]        # list indices corresponding to samples in X
    np.random.shuffle(indices)                  # shuffle indices for random split
    # where to split data
    split_index = int(len(X) * (1 - test_size)) # where to split the data 
    train_index = indices[:split_index]         # select indices for training set
    test_index = indices[split_index:]          # select indices for testing seu
    # train samples
    X_train = X[train_index]                    # get training features
    y_train = y[train_index]                    # get training labels
    # test samples
    X_test = X[test_index]                      # get testing features
    y_test = y[test_index]                      # get testing labels
    return X_train, X_test, y_train, y_test     # return train/test set

Step 3: Visualize Data

Now, we define the visualize_data function to help us understand the structure and distribution of the dataset. The function below takes input for features and labels and creates a scatter plot of the data points using matplotlib.

In [3]:
import matplotlib.pyplot as plt # for plotting

def visualize_data(X, y):                            # function to create graph
    plt.scatter(X[:,0], X[:,1], c=y, cmap='viridis') # make scatter plot of 2 features in X, colors based on label
    plt.title('Training Data')                       # add a title
    plt.xlabel('Feature 1')                          # label the x axis
    plt.ylabel('Feature 2')                          # label the y axis
    plt.show()                                       # print the graph

Step 4: Define Model

Finally, we can create the model function. Because K-Nearest Neighbors models are instance based, they do not have parameters. This means that they don't get trained or fit like other models, instead, they store a copy of the training data and compare it to testing data. They make predictions based on how close the samples are to ones seen in training. Our function takes the following inputs:

  • X_train or the samples the model uses for comparisons
  • y_train or the labels the model uses for comparisons
  • X_test or the samples the model is predicting
  • k or the amount of training samples to compare the test sample to

Our model starts by initializing a list to store predictions in and iterating through each test sample. Then it calculates the distance between the testing sample and all training samples, saving the indices of the k closest samples. Next it gets the labels of those indices, finds the most common one among them, and uses that as its prediction. Finally, it returns the list containing predictions.

In [4]:
from collections import Counter # to find label frequencies

def knn(X_train, y_train, X_test, k=3):                              # define the models function
    predictions = []                                                 # list to store predictions
    for i in X_test:                                                 # iterate through test data
        distances = [np.linalg.norm(i - j) for j in X_train]         # calculate the euclidean distance between this sample and all training samples
        k_indices = np.argsort(distances)[:k]                        # get the indices of the k closest samples  (or K-Nearest Neighbors)
        k_nearest_labels = [y_train[j] for j in k_indices]           # get the labels of the k nearest neigbors
        most_common = Counter(k_nearest_labels).most_common(1)[0][0] # find most common label of the k nearest neighbors
        predictions.append(most_common)                              # append the most common label
    return np.array(predictions)                                     # return predictions

Step 5: Using the Code

Now we get to use the code we wrote and test out our model.

First, we create a dataset, visualize it, and split it into training and testing sets.

In [5]:
X, y = generate_data()                              # generate data
visualize_data(X, y)                                # plot data
X_train, X_test, y_train, y_test = split_data(X, y) # split train/test
No description has been provided for this image

Next, we can get predictions for the test data by calling the model function

In [6]:
predictions = knn(X_train, y_train, X_test) # predict test set

Finally, we calculate the accuracy to evaluate performance

In [7]:
accuracy = np.mean(predictions == y_test)
print(f'Accuracy: {accuracy * 100:.2f}%')
Accuracy: 98.50%

Summary¶

  • K-Nearest Neighbors is an algorithm that makes predictions based on similarity
  • It is an instance based model, meaning no parameters are learned
  • Distance metrics, for example the Euclidean distance formula, are used to determine similarity
  • KNN's have a variety of real world applications in fields such as healthcare, cybersecurity, e-commerce, and more

Author and Liscense¶

This notebook was authored by Aiden Flynn and is available under the Apache 2.0 Liscense.

Kaggle | Github