Kmeans
Brief introduction of Kmeans algorithm
 Kmeans algorithm, also known as kmeans or Kmeans, is generally the first algorithm to master clustering algorithm.
 Here K is constant, which needs to be set in advance. Generally speaking, this algorithm is to aggregate M samples without labels into K clusters by iteration.
 In the process of clustering samples, the distance between samples is often used as an indicator to divide.
 Simple Demo description
As shown in the figure above, K is 2, and the sample set is M to describe the KMean algorithm. The algorithm steps are as follows:
 K points are selected as the cluster centers of initial aggregation (or non sample points);
 Calculate the distance from each sample point to K cluster cores (the distance here is generally taken as Euclidean distance or cosine distance), find the nearest cluster core, and assign it to the corresponding cluster;
 After all points belong to the cluster, M points are divided into K clusters. Then the center of gravity (average distance Center) of each cluster is recalculated, and it is defined as a new "cluster core";
 Repeat steps 2  3 until a certain abort condition is reached.
 Note: the commonly used stopping conditions are iteration times, MSE of minimum square error and change rate of cluster center point;
From the above Demo, we can see that there are three important factors to consider for KMean algorithm, as follows:;
Thinking of Kmeans algorithm
Selection of k value: k value is very important to the final result, but it must be given in advance. Given a suitable k value, it needs a priori knowledge, so it is difficult to estimate it from nothing, or it may lead to poor results.
The existence of outliers: Kmeans algorithm uses the mean of all points as the new particle (center point) in the process of iteration. If there are outliers in the cluster, the mean deviation will be more serious. For example, if there are five data in a cluster: 2, 4, 6, 8 and 100, then the new particle is 24. Obviously, this particle is far away from most of the points. In the current situation, the idea of using median 6 may be better than that of using mean. The method of using median is called KMediods clustering (Kmedian clustering).
Initial value sensitivity: kmeans algorithm is initial value sensitive, choosing different initial values may lead to different clustering rules. In order to avoid the abnormality of the final result caused by this sensitivity, we can initialize multiple sets of initial nodes to construct different classification rules, and then select the optimal construction rules. In view of this, we derive two parts of KMeans algorithm, KMeans++ algorithm, KMeans algorithm, Canopy algorithm and so on.
Several commonly used distance calculation methods
Generally, in clustering algorithm, the attributes of samples are mainly represented by their relative distance in the feature space.
This makes the concept of distance very important for clustering. Here are some of the most common distance calculation methods.

European distance (also known as 2norm distance)
In Euclidean space, point x=(x1,...) , xn) and y=(y1 The Euclidean distance between, yn) is:
In Euclidean metric, the line segment between two points is the shortest. 
Cosine distance (also called cosine similarity)
The cosine value between two vectors can be calculated by using the Euclidean dot product formula:
a⋅b=∥a∥∥b∥cosθ
So:
That is to say, given two attribute vectors A and B, the remaining chord distances (which can also be understood as cosine of the angle between two vectors) are given by point product and vector length, as follows:
Ai and Bi here represent the components of vectors A and B respectively. 
Manhattan Distance (also known as 1norm distance)
Manhattan distance is defined by calculating the shortest path to drive in a city (such as Manhattan) that is planned as a square building block.
Suppose that a city is a complete block division. From one point to another, it must follow the edge of the block separated from each other. There is no other shortcut (as shown below):
Therefore, Manhattan distance is the sum of the projection length of the line segment formed by two points on the x and y axes in the rectangular coordinate system.
From point (x1,y1) to point (x2,y2), Manhattan distance is:
x1−x2+y1−y2
Advantages and disadvantages of KMean algorithm and applicable scenarios
Advantage
 It is easy to understand and clustering effect is good;
 When dealing with large data sets, the algorithm can ensure better scalability and efficiency;
 When the cluster approximates Gaussian distribution, the effect is very good.
shortcoming
 Kvalue is given by users. Before data processing, kvalue is unknown. Given a suitable kvalue, prior knowledge is required, so it is difficult to estimate by using the method of space, or the effect may be poor.
 It is sensitive to the center of the initial cluster.
 It is not suitable to find non convex clusters or clusters with large size differences.
 Special values (outliers or outliers) have great influence on the model.
Code
2D data
% Initialize workspace clc; clear; % Load data load fisheriris % Twodimensional data % Scatter diagram of petal length and width(True mark) figure; speciesNum = grp2idx(species); gscatter(meas(:,3),meas(:,4),speciesNum,['r','g','b']) xlabel('Petal length') ylabel('petal width') title('True mark') set(gca,'FontSize',12) set(gca,'FontWeight','bold') % Scatter plot of petal length and width (without markers) figure; scatter(meas(:,3),meas(:,4),150,'.') xlabel('Petal length') ylabel('petal width') title('Unmarked') set(gca,'FontSize',12) set(gca,'FontWeight','bold') % kmeans clustering data=[meas(:,3), meas(:,4)]; K=3; [idx,cen]=kmeans(data,K,'Distance','sqeuclidean','Replicates',5,'Display','Final'); % Adjustment label dist=sum(cen.^2,2); [dump,sortind]=sort(dist,'ascend'); newidx=zeros(size(idx)); for i =1:K newidx(idx==i)=find(sortind==i); end % Scatter diagram of petal length and width(kmeans classification) figure; gscatter(data(:,1),data(:,2),newidx,['r','g','b']) hold on scatter(cen(:,1),cen(:,2),300,'m*') hold off xlabel('Petal length') ylabel('petal width') title('kmeans classification') set(gca,'FontSize',12) set(gca,'FontWeight','bold') % Scatter diagram of petal length and width(True mark:disc+kmeans classification:circle) figure; gscatter(meas(:,3),meas(:,4),speciesNum,['r','g','b']) hold on gscatter(data(:,1),data(:,2),newidx,['r','g','b'],'o',10) scatter(cen(:,1),cen(:,2),300,'m*') hold off xlabel('Petal length') ylabel('petal width') title('True mark:disc+kmeans classification:circle') set(gca,'FontSize',12) set(gca,'FontWeight','bold') %{ % Confusion matrix ConfusionMatrix confMat=confusionmat(speciesNum,newidx) error23=speciesNum==2&newidx==3; errDat23=data(error23,:) error32=speciesNum==3&newidx==2; errDat32=data(error32,:) %}
3D data
% Initialize workspace clc; clear; % Load data load fisheriris % Three dimensional data % Scatter diagram of petal length, petal width and calyx length(Unmarked) figure; speciesNum = grp2idx(species); plot3(meas(:,3),meas(:,4),meas(:,1),'.','MarkerSize',20) grid on view([60,24]) xlabel('Petal length') ylabel('petal width') zlabel('sepal length') title('Unmarked') set(gca,'FontSize',12) set(gca,'FontWeight','Bold') % Scatter diagram of petal length, petal width and calyx length(True mark) figure; hold on colorArray=['r','g','b']; for i= 1:3 plotind=speciesNum==i; plot3(meas(plotind,3),meas(plotind,4),meas(plotind,1),'.','MarkerSize',20,'MarkerEdgeColor',colorArray(i)) end hold off grid on view([60,24]) xlabel('Petal length') ylabel('petal width') zlabel('sepal length') title('True mark') set(gca,'FontSize',12) set(gca,'FontWeight','Bold') % kmeans clustering data2=[ meas(:,3), meas(:,4),meas(:,1)]; K=3; [idx2,cen2]=kmeans(data2,K,'Distance','sqeuclidean','Replicates',5,'Display','Final'); % Adjustment label dist2=sum(cen2.^2,2); [dump2,sortind2]=sort(dist2,'ascend'); newidx2=zeros(size(idx2)); for i =1:K newidx2(idx2==i)=find(sortind2==i); end % Scatter diagram of petal length and width(True mark:disc+kmeans classification:circle) figure; hold on colorArray=['r','g','b']; for i= 1:3 plotind=speciesNum==i; plot3(meas(plotind,3),meas(plotind,4),meas(plotind,1),'.','MarkerSize',15,'MarkerEdgeColor',colorArray(i)) end for i= 1:3 plotind=newidx2==i; plot3(meas(plotind,3),meas(plotind,4),meas(plotind,1),'o','MarkerSize',10,'MarkerEdgeColor',colorArray(i)) end for i=1:3 plot3(cen2(i,1),cen2(i,2),cen2(i,3),'*m') end hold off grid on view([60,24]) xlabel('Petal length') ylabel('petal width') zlabel('sepal length') title('True mark:disc+kmeans classification:circle') set(gca,'FontSize',12) set(gca,'FontWeight','Bold') %{ % Confusion matrix ConfusionMatrix confMat=confusionmat(speciesNum,newidx2) error23=speciesNum==2&newidx2==3; errDat23=data2(error23,:) error32=speciesNum==3&newidx2==2; errDat32=data2(error32,:) [dump, errdatSort]=sort(errDat32(:,3)); errDat32Sort=errDat32(errdatSort,:) %}
Code reference: https://www.bilibili.com/video/av38476149?from=search&seid=3151051302140581746