Brief introduction of K-means algorithm

1. K-means algorithm, also known as k-means or K-means, is generally the first algorithm to master clustering algorithm.
2. Here K is constant, which needs to be set in advance. Generally speaking, this algorithm is to aggregate M samples without labels into K clusters by iteration.
3. In the process of clustering samples, the distance between samples is often used as an indicator to divide.
• Simple Demo description

As shown in the figure above, K is 2, and the sample set is M to describe the KMean algorithm. The algorithm steps are as follows:
1. K points are selected as the cluster centers of initial aggregation (or non sample points);
2. Calculate the distance from each sample point to K cluster cores (the distance here is generally taken as Euclidean distance or cosine distance), find the nearest cluster core, and assign it to the corresponding cluster;
3. After all points belong to the cluster, M points are divided into K clusters. Then the center of gravity (average distance Center) of each cluster is recalculated, and it is defined as a new "cluster core";
4. Repeat steps 2 - 3 until a certain abort condition is reached.
• Note: the commonly used stopping conditions are iteration times, MSE of minimum square error and change rate of cluster center point;

From the above Demo, we can see that there are three important factors to consider for KMean algorithm, as follows:;

Thinking of K-means algorithm

Selection of k value: k value is very important to the final result, but it must be given in advance. Given a suitable k value, it needs a priori knowledge, so it is difficult to estimate it from nothing, or it may lead to poor results.

• The existence of outliers: K-means algorithm uses the mean of all points as the new particle (center point) in the process of iteration. If there are outliers in the cluster, the mean deviation will be more serious. For example, if there are five data in a cluster: 2, 4, 6, 8 and 100, then the new particle is 24. Obviously, this particle is far away from most of the points. In the current situation, the idea of using median 6 may be better than that of using mean. The method of using median is called K-Mediods clustering (K-median clustering).

• Initial value sensitivity: k-means algorithm is initial value sensitive, choosing different initial values may lead to different clustering rules. In order to avoid the abnormality of the final result caused by this sensitivity, we can initialize multiple sets of initial nodes to construct different classification rules, and then select the optimal construction rules. In view of this, we derive two parts of K-Means algorithm, K-Means++ algorithm, K-Means|| algorithm, Canopy algorithm and so on.

• Several commonly used distance calculation methods

Generally, in clustering algorithm, the attributes of samples are mainly represented by their relative distance in the feature space.
This makes the concept of distance very important for clustering. Here are some of the most common distance calculation methods.

• European distance (also known as 2-norm distance)
In Euclidean space, point x=(x1,...) , xn) and y=(y1 The Euclidean distance between, yn) is:

In Euclidean metric, the line segment between two points is the shortest.

• Cosine distance (also called cosine similarity)
The cosine value between two vectors can be calculated by using the Euclidean dot product formula:
a⋅b=∥a∥∥b∥cosθ
So:

That is to say, given two attribute vectors A and B, the remaining chord distances (which can also be understood as cosine of the angle between two vectors) are given by point product and vector length, as follows:

Ai and Bi here represent the components of vectors A and B respectively.

• Manhattan Distance (also known as 1-norm distance)
Manhattan distance is defined by calculating the shortest path to drive in a city (such as Manhattan) that is planned as a square building block.
Suppose that a city is a complete block division. From one point to another, it must follow the edge of the block separated from each other. There is no other shortcut (as shown below):

Therefore, Manhattan distance is the sum of the projection length of the line segment formed by two points on the x and y axes in the rectangular coordinate system.
From point (x1,y1) to point (x2,y2), Manhattan distance is:
|x1−x2|+|y1−y2|

• It is easy to understand and clustering effect is good;
• When dealing with large data sets, the algorithm can ensure better scalability and efficiency;
• When the cluster approximates Gaussian distribution, the effect is very good.

shortcoming

• K-value is given by users. Before data processing, k-value is unknown. Given a suitable k-value, prior knowledge is required, so it is difficult to estimate by using the method of space, or the effect may be poor.
• It is sensitive to the center of the initial cluster.
• It is not suitable to find non convex clusters or clusters with large size differences.
• Special values (outliers or outliers) have great influence on the model.

Code

2D data

```% Initialize workspace
clc;
clear;

% Two-dimensional data
% Scatter diagram of petal length and width(True mark)
figure;
speciesNum = grp2idx(species);
gscatter(meas(:,3),meas(:,4),speciesNum,['r','g','b'])
xlabel('Petal length')
ylabel('petal width')
title('True mark')
set(gca,'FontSize',12)
set(gca,'FontWeight','bold')

% Scatter plot of petal length and width (without markers)
figure;
scatter(meas(:,3),meas(:,4),150,'.')
xlabel('Petal length')
ylabel('petal width')
title('Unmarked')
set(gca,'FontSize',12)
set(gca,'FontWeight','bold')

% kmeans clustering
data=[meas(:,3), meas(:,4)];
K=3;
[idx,cen]=kmeans(data,K,'Distance','sqeuclidean','Replicates',5,'Display','Final');
dist=sum(cen.^2,2);
[dump,sortind]=sort(dist,'ascend');
newidx=zeros(size(idx));
for i =1:K
newidx(idx==i)=find(sortind==i);
end

% Scatter diagram of petal length and width(kmeans classification)
figure;
gscatter(data(:,1),data(:,2),newidx,['r','g','b'])
hold on
scatter(cen(:,1),cen(:,2),300,'m*')
hold off
xlabel('Petal length')
ylabel('petal width')
title('kmeans classification')
set(gca,'FontSize',12)
set(gca,'FontWeight','bold')

% Scatter diagram of petal length and width(True mark:disc+kmeans classification:circle)
figure;
gscatter(meas(:,3),meas(:,4),speciesNum,['r','g','b'])
hold on
gscatter(data(:,1),data(:,2),newidx,['r','g','b'],'o',10)
scatter(cen(:,1),cen(:,2),300,'m*')
hold off
xlabel('Petal length')
ylabel('petal width')
title('True mark:disc+kmeans classification:circle')
set(gca,'FontSize',12)
set(gca,'FontWeight','bold')

%{
% Confusion matrix ConfusionMatrix
confMat=confusionmat(speciesNum,newidx)
error23=speciesNum==2&newidx==3;
errDat23=data(error23,:)
error32=speciesNum==3&newidx==2;
errDat32=data(error32,:)
%}
```

3D data

```% Initialize workspace
clc;
clear;

% Three dimensional data
% Scatter diagram of petal length, petal width and calyx length(Unmarked)
figure;
speciesNum = grp2idx(species);
plot3(meas(:,3),meas(:,4),meas(:,1),'.','MarkerSize',20)
grid on
view([60,24])
xlabel('Petal length')
ylabel('petal width')
zlabel('sepal length')
title('Unmarked')
set(gca,'FontSize',12)
set(gca,'FontWeight','Bold')

% Scatter diagram of petal length, petal width and calyx length(True mark)
figure;
hold on
colorArray=['r','g','b'];
for  i= 1:3
plotind=speciesNum==i;
plot3(meas(plotind,3),meas(plotind,4),meas(plotind,1),'.','MarkerSize',20,'MarkerEdgeColor',colorArray(i))
end
hold off
grid on
view([60,24])
xlabel('Petal length')
ylabel('petal width')
zlabel('sepal length')
title('True mark')
set(gca,'FontSize',12)
set(gca,'FontWeight','Bold')

% kmeans clustering
data2=[ meas(:,3), meas(:,4),meas(:,1)];
K=3;
[idx2,cen2]=kmeans(data2,K,'Distance','sqeuclidean','Replicates',5,'Display','Final');
dist2=sum(cen2.^2,2);
[dump2,sortind2]=sort(dist2,'ascend');
newidx2=zeros(size(idx2));
for i =1:K
newidx2(idx2==i)=find(sortind2==i);
end

% Scatter diagram of petal length and width(True mark:disc+kmeans classification:circle)
figure;
hold on
colorArray=['r','g','b'];
for  i= 1:3
plotind=speciesNum==i;
plot3(meas(plotind,3),meas(plotind,4),meas(plotind,1),'.','MarkerSize',15,'MarkerEdgeColor',colorArray(i))
end

for  i= 1:3
plotind=newidx2==i;
plot3(meas(plotind,3),meas(plotind,4),meas(plotind,1),'o','MarkerSize',10,'MarkerEdgeColor',colorArray(i))
end
for i=1:3
plot3(cen2(i,1),cen2(i,2),cen2(i,3),'*m')
end

hold off
grid on
view([60,24])
xlabel('Petal length')
ylabel('petal width')
zlabel('sepal length')
title('True mark:disc+kmeans classification:circle')
set(gca,'FontSize',12)
set(gca,'FontWeight','Bold')

%{
% Confusion matrix ConfusionMatrix
confMat=confusionmat(speciesNum,newidx2)
error23=speciesNum==2&newidx2==3;
errDat23=data2(error23,:)
error32=speciesNum==3&newidx2==2;
errDat32=data2(error32,:)
[dump, errdatSort]=sort(errDat32(:,3));
errDat32Sort=errDat32(errdatSort,:)
%}
```
Published 4 original articles, won praise 0, visited 51

Tags: Attribute

Posted on Sat, 01 Feb 2020 00:44:06 -0800 by bodge