numpy implements KNN code

Code reference: Portal


The basic idea of KNN is to determine the predicted value of the sample based on the label of the nearest k samples adjacent to the test sample. KNN has three elements: k value selection, distance measurement and decision criteria.

KNN has no explicit training process, and the calculation is basically in the prediction stage.

1. K value selection

Generally, a smaller k value is selected first, and then the value of k is determined by cross validation.

2. Distance measurement

Euclidean distance, distance or cosine similarity are generally chosen.

3. Decision criteria

Generally, the majority voting method is used for classification and the average method is used for regression.


Generally, we traverse the samples directly, calculate the distance between the test samples and each sample in the training set, and then select the nearest K samples. However, the efficiency is not high when the sample set is very large. The optimization scheme is to use KD tree or sphere tree to find k-nearest neighbors. Details can be referred to: Portal



import numpy as np
from collections import Counter

class KNN:
    def __init__(self, task_type='classification'):
        self.train_data = None
        self.train_label = None
        self.task_type = task_type

    def fit(self, train_data, train_label):
        self.train_data = np.array(train_data)
        self.train_label = np.array(train_label)

    def predict(self, test_data, k=3, distance='l2'):
        test_data = np.array(test_data)
        preds = []
        for x in test_data:
            if distance == 'l1':
                dists = self.l1_distance(x)
            elif distance == 'l2':
                dists = self.l2_distance(x)
                raise ValueError('wrong distance type')
            sorted_idx = np.argsort(dists)
            knearnest_labels = self.train_label[sorted_idx[:k]]
            pred = None
            if self.task_type == 'regression':
                pred = np.mean(knearnest_labels)
            elif self.task_type == 'classification':
                pred = Counter(knearnest_labels).most_common(1)[0][0]
        return preds

    def l1_distance(self, x):
        return np.sum(np.abs(self.train_data-x), axis=1)

    def l2_distance(self, x):
        return np.sum(np.square(self.train_data-x), axis=1)

if __name__ == '__main__':
    train_data = [[1, 1, 1], [2, 2, 2], [10, 10, 10], [13, 13, 13]]
    # train_label = ['aa', 'aa', 'bb', 'bb']
    train_label = [1, 2, 30, 60]
    test_data = [[3, 2, 4], [9, 13, 11], [10, 20, 10]]
    knn = KNN(task_type='regression'), train_label)
    preds = knn.predict(test_data, k=2)


Posted on Tue, 08 Oct 2019 16:21:13 -0700 by BobRoberts