Code reference: Portal

Introduction:

The basic idea of KNN is to determine the predicted value of the sample based on the label of the nearest k samples adjacent to the test sample. KNN has three elements: k value selection, distance measurement and decision criteria.

KNN has no explicit training process, and the calculation is basically in the prediction stage.

1. K value selection

Generally, a smaller k value is selected first, and then the value of k is determined by cross validation.

2. Distance measurement

Euclidean distance, distance or cosine similarity are generally chosen.

3. Decision criteria

Generally, the majority voting method is used for classification and the average method is used for regression.

Generally, we traverse the samples directly, calculate the distance between the test samples and each sample in the training set, and then select the nearest K samples. However, the efficiency is not high when the sample set is very large. The optimization scheme is to use KD tree or sphere tree to find k-nearest neighbors. Details can be referred to: Portal

Code:

import numpy as np from collections import Counter class KNN: def __init__(self, task_type='classification'): self.train_data = None self.train_label = None self.task_type = task_type def fit(self, train_data, train_label): self.train_data = np.array(train_data) self.train_label = np.array(train_label) def predict(self, test_data, k=3, distance='l2'): test_data = np.array(test_data) preds = [] for x in test_data: if distance == 'l1': dists = self.l1_distance(x) elif distance == 'l2': dists = self.l2_distance(x) else: raise ValueError('wrong distance type') sorted_idx = np.argsort(dists) knearnest_labels = self.train_label[sorted_idx[:k]] pred = None if self.task_type == 'regression': pred = np.mean(knearnest_labels) elif self.task_type == 'classification': pred = Counter(knearnest_labels).most_common(1)[0][0] preds.append(pred) return preds def l1_distance(self, x): return np.sum(np.abs(self.train_data-x), axis=1) def l2_distance(self, x): return np.sum(np.square(self.train_data-x), axis=1) if __name__ == '__main__': train_data = [[1, 1, 1], [2, 2, 2], [10, 10, 10], [13, 13, 13]] # train_label = ['aa', 'aa', 'bb', 'bb'] train_label = [1, 2, 30, 60] test_data = [[3, 2, 4], [9, 13, 11], [10, 20, 10]] knn = KNN(task_type='regression') knn.fit(train_data, train_label) preds = knn.predict(test_data, k=2) print(preds)