K-Nearest Neighbors From Scratch¶
Background¶
Introduced in 1951, K-Nearest Neighbors is a very simple classification algorithm. They are known as instance-based models because they aren't trained or fit to learn parameters, instead they store the entire training dataset and use it to make predictions based on similarity. Our model will use the Euclidean distance formula to determine similarity, however other models may use different distance metrics. The Euclidean distance formula is defined as:
$$ d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} $$
Where:
- $x$ and $y$ are points in the feature space
- $n$ is the number of features
- $i$ is the index of the current feature
The result of this function is interpreted by the model to determine how similar two points are to each other. The lower the distance, the closer the points.
Use cases¶
Although K-Nearest Neighbors is a simple algorithm, it has a variety of use cases in several industries, including:
- Medical diagnosis
- Anomaly Detection
- Text Classification
- Image recognition
These are just a few of the many real world applications of KNN models, proving that simplicity doesn't limit usefulness.
Implementation¶
Step 1: Data Generation
Before we can train the model, we need a dataset. The function below creates a synthetically generated dataset that is perfect for classification. The function takes input for:
samplesor the amount of points or rows the dataset will havefeaturesor the amount of variables or columns the dataset will haveclassesor the amount of labels or categories a sample can be
The function starts by randomly generating a center for each class between -10 and 10. Then it randomly assigns a class for each sample. After that, it generates the features by randomly adding noise around the classes center. Finally, it returns the features and their labels.
import numpy as np # NumPy for math
def generate_data(samples=1000, features=3, classes=3): # define function with input for samples, features, and classes
centers = np.random.uniform(-10, 10, size=(classes, features)) # random class centers for features
y = np.random.randint(0, classes, size=samples) # assign each sample to a class randomly
X = np.array([centers[label] + np.random.normal(scale=1.0, size=features) for label in y]) # generate features by adding noise arround centers
return X, y # return features and classes
Step 2: Splitting Data
Next, we have to split the data into two parts, training data, or the learning set, and testing data, or the evaluation set. This is done so we can evaluate the model using different data than we trained it on. We define the function split_data below to do so. It takes input for:
Xor the featuresyor the classestest_sizeor the percent of data used for testing
The function works by creating a list of indices corresponding to each sample in X. Next, it shuffles the indices to make the split random. Then, it calculates where to split the indices based on test_size, and uses this point to separate training indices from testing indices. Finally, it uses these separated indices to create train and test sets for the features and labels and then returns them.
def split_data(X, y, test_size=0.2):
# sample indices
indices = [i for i in range(len(X))] # list indices corresponding to samples in X
np.random.shuffle(indices) # shuffle indices for random split
# where to split data
split_index = int(len(X) * (1 - test_size)) # where to split the data
train_index = indices[:split_index] # select indices for training set
test_index = indices[split_index:] # select indices for testing seu
# train samples
X_train = X[train_index] # get training features
y_train = y[train_index] # get training labels
# test samples
X_test = X[test_index] # get testing features
y_test = y[test_index] # get testing labels
return X_train, X_test, y_train, y_test # return train/test set
Step 3: Visualize Data
Now, we define the visualize_data function to help us understand the structure and distribution of the dataset. The function below takes input for features and labels and creates a scatter plot of the data points using matplotlib.
import matplotlib.pyplot as plt # for plotting
def visualize_data(X, y): # function to create graph
plt.scatter(X[:,0], X[:,1], c=y, cmap='viridis') # make scatter plot of 2 features in X, colors based on label
plt.title('Training Data') # add a title
plt.xlabel('Feature 1') # label the x axis
plt.ylabel('Feature 2') # label the y axis
plt.show() # print the graph
Step 4: Define Model
Finally, we can create the model function. Because K-Nearest Neighbors models are instance based, they do not have parameters. This means that they don't get trained or fit like other models, instead, they store a copy of the training data and compare it to testing data. They make predictions based on how close the samples are to ones seen in training. Our function takes the following inputs:
X_trainor the samples the model uses for comparisonsy_trainor the labels the model uses for comparisonsX_testor the samples the model is predictingkor the amount of training samples to compare the test sample to
Our model starts by initializing a list to store predictions in and iterating through each test sample. Then it calculates the distance between the testing sample and all training samples, saving the indices of the k closest samples. Next it gets the labels of those indices, finds the most common one among them, and uses that as its prediction. Finally, it returns the list containing predictions.
from collections import Counter # to find label frequencies
def knn(X_train, y_train, X_test, k=3): # define the models function
predictions = [] # list to store predictions
for i in X_test: # iterate through test data
distances = [np.linalg.norm(i - j) for j in X_train] # calculate the euclidean distance between this sample and all training samples
k_indices = np.argsort(distances)[:k] # get the indices of the k closest samples (or K-Nearest Neighbors)
k_nearest_labels = [y_train[j] for j in k_indices] # get the labels of the k nearest neigbors
most_common = Counter(k_nearest_labels).most_common(1)[0][0] # find most common label of the k nearest neighbors
predictions.append(most_common) # append the most common label
return np.array(predictions) # return predictions
Step 5: Using the Code
Now we get to use the code we wrote and test out our model.
First, we create a dataset, visualize it, and split it into training and testing sets.
X, y = generate_data() # generate data
visualize_data(X, y) # plot data
X_train, X_test, y_train, y_test = split_data(X, y) # split train/test
Next, we can get predictions for the test data by calling the model function
predictions = knn(X_train, y_train, X_test) # predict test set
Finally, we calculate the accuracy to evaluate performance
accuracy = np.mean(predictions == y_test)
print(f'Accuracy: {accuracy * 100:.2f}%')
Accuracy: 98.50%
Summary¶
K-Nearest Neighborsis an algorithm that makes predictions based on similarity- It is an instance based model, meaning no parameters are learned
- Distance metrics, for example the Euclidean distance formula, are used to determine similarity
- KNN's have a variety of real world applications in fields such as healthcare, cybersecurity, e-commerce, and more
Author and Liscense¶
This notebook was authored by Aiden Flynn and is available under the Apache 2.0 Liscense.