Unsupervised learning and sklearn Library

1, Unsupervised learning Fundamentals

Using unlabeled data to learn the distribution of data or the relationship between data and data is called unsupervised learning

The biggest difference between supervised learning and unsupervised learning is whether the data is labeled

The most commonly used scenarios of unsupervised learning are Clustering and Dimension Reduction

Two. Clustering

Clustering is the process of classifying data into multiple categories according to the "similarity" of data. To evaluate the "similarity" between two different samples, the usual method is to calculate the "distance" between two samples. Using different methods to calculate the distance between samples will affect the quality of clustering results

1. Euclidean distance

Euclidean distance is the most commonly used distance measurement method, which originates from the distance between two points in Euclidean space

2. Manhattan distance

Manhattan distance, also known as "city block distance", is similar to the distance from one intersection to another when driving in a city

3. Mahalanobis distance

Mahalanobis distance, which represents the covariance distance of data, is a scale independent measure. Mahalanobis distance standardizes the attributes of sample points, and then calculates the distance between samples

4. Angle cosine

Cosine similarity uses the cosine value of the angle between two vectors as a measure of the difference between two samples. The closer the cosine value is to 1, the closer the angle between the two vectors is to 0 degree, and the more similar the two vectors are

5. sklearn Library

The common clustering algorithm functions provided by skikit learn library are included in the module of sklearn.cluster, such as k-means, neighbor propagation algorithm, DBSCAN, etc. When the same data set is applied to different algorithms, different results may be obtained, and the time consumed by the algorithm is not the same, which is determined by the characteristics of the algorithm

Each clustering algorithm function provided by the sklearn.cluster module can use different data forms as input:

Standard data input format: matrix form defined by [number of samples, number of features]

Similarity matrix input format: the matrix form defined by [number of samples], each element of the matrix is the similarity of two samples, such as DBSCAN, affinity propagation (nearest neighbor propagation algorithm) receives such input. Taking cosine similarity as an example, all diagonal elements are 1, and the value range of each element in the matrix is [0,1]

sklearn.cluster module

Algorithm name Parameter Extensibility Similarity measurement
k-means Cluster number Large scale data Distance between points
DBSCAN neighborhood size
Large scale data Distance between points
Gaussian Mixtures Number of clusters and other super parameters High complexity, not suitable for large-scale data processing mahalanobis distance
Birch Branching factor, threshold and other super parameters Large scale data Euclidean distance between two points

Three, dimension reduction

Dimensionality reduction is the process of transforming high-dimensional data into low latitude data under the condition of ensuring the characteristics or distribution of data. Commonly used for data visualization and data reduction

Clustering and classification are typical tasks of unsupervised learning, and there are correlations between tasks. For example, some high-dimensional data classification can be better obtained through dimension reduction processing. In addition, academic research also shows that representative classification algorithms, such as k-means and dimension reduction algorithms, such as NMF, have equivalence

Dimensionality reduction is an important research content in the field of machine learning. There are many typical algorithms accepted by industry and academia. Up to now, sklearn provides seven dimensionality reduction algorithms

The dimensionality reduction process can also be understood as the process of decomposing the components of the data set. Therefore, sklearn is named as the dimensionality reduction module, and the sklearn.decomposition module is needed to call the dimensionality reduction algorithm

Sklearn.composition module

Algorithm name parameter Extensibility Applicable tasks
PCA Reduced dimensions and other super parameters Large scale data Signal processing, etc
FastICA Reduced dimensions and other super parameters Large scale data Image feature extraction
NMF Reduced dimensions and other super parameters Large scale data Image feature extraction
LDA Reduced dimensions and other super parameters Large scale data Text data, topic mining

Four, algorithm

1. k-means clustering algorithm

(1) Basic knowledge

k-means algorithm divides n objects into k clusters with k as the parameter, which makes the similarity between clusters higher and lower

(2) k-means clustering algorithm flow

① Randomly select k points as the initial clustering center;

② For the remaining points, they are classified into the nearest cluster according to their distance from the cluster center

③ For each cluster, the mean value of all points is calculated as the new cluster center

④ Repeat ② and ③ until the cluster center no longer changes

(3) Specific application

① Data introduction

At present, there are eight main variables of annual consumption expenditure per capita of urban households in 31 provinces of China in 1999, which are respectively: food, clothing, household equipment supplies and services, health care, transportation and communication, entertainment, education and cultural services, housing and miscellaneous goods and services. Cluster 31 provinces using existing data

② Purpose of the experiment

Through clustering, we can understand the consumption level of each province in China in 1999

③ Data instance

④ Text content

⑤ Implementation process

a. Build project and import sklearn related package

b. Load the data, create an instance of k-means algorithm, train it, and get the label

c. Output labels, viewing results

⑥ Results demonstration

Clustered into 4 categories

Clustered into 2 categories

Original code

1 ......
2 km = KMeans(n_clusters=4) #Clustered into 4 categories
3 ......
4 CityCluster = [[], [], [], []] #Press the city label Divide into set clusters
5 ......

Modified to

1 ......
2 km = KMeans(n_clusters=2) #Clustered into 2 categories
3 ......
4 CityCluster = [[], []] #Press the city label Divide into set clusters
5 ......

The result is

Clustered into 3 categories

Original code

1 ......
2 km = KMeans(n_clusters=4) #Clustered into 4 categories
3 ......
4 CityCluster = [[], [], [], []] #Press the city label Divide into set clusters
5 ......

Modified to

1 ......
2 km = KMeans(n_clusters=3) #Clustered into 3 categories
3 ......
4 CityCluster = [[], [], []] #Press the city label Divide into set clusters
5 ......

The result is

(4) Code analysis

① Full code

 1 import numpy as np
 2 from sklearn.cluster import KMeans
 3 
 4 def loadData(filePath):
 5     fr = open(filePath, 'r+')
 6     lines = fr.readlines()
 7     retData = [] #Store various consumption information of the city
 8     retCityName = [] #Storage city name
 9     for line in lines:
10         items = line.strip().split(",")
11         retCityName.append(items[0])
12         retData.append([float(items[i]) for i in range(1, len(items))])
13     return retData, retCityName
14 
15 
16 if __name__ == '__main__':
17     data, cityName = loadData('city.txt') #utilize loadData()Method to read data
18     km = KMeans(n_clusters=4) #Clustered into 4 categories
19     label = km.fit_predict(data) #label Is the label of each data after clustering, fit_predict()Method to calculate cluster center and assign sequence number to cluster
20     expenses = np.sum(km.cluster_centers_, axis=1) #expenses Is the sum of the values at the center of the cluster, i.e. the average consumption level
21     # print(expenses)
22     CityCluster = [[], [], [], []] #Press the city label Divide into set clusters
23     for i in range(len(cityName)):
24         CityCluster[label[i]].append(cityName[i]) #Output the city of each cluster
25     for i in range(len(CityCluster)):
26         print("Expenses:%.2f" % expenses[i])
27         print(CityCluster[i]) #Output the average cost per cluster

② Introduction to related packages

NumPy is an extended library of Python. It supports a large number of high-level dimension array and matrix operations, and also provides a large number of mathematical function libraries for array operations.

Use sklearn.cluster.KMeans to call K-means algorithm for clustering

③ Parameters required to call KMeans() method

N clusters: used to specify the number of cluster centers

init: initialization method of initial cluster center

Max ITER: the maximum number of iterations

In general, only n clusters are given. init defaults to k-means + +, and Max ITER defaults to 300

④ loadData() method interpretation

fr = open(filePath,'r+')

lines = fr.readlines()

r +: read write open a text file

. read() reads the entire file at a time, which is usually used to put the contents of the file into a string variable

. readlines() reads the entire file at once (similar to. read())

. readline() reads only one line at a time, which is usually much slower than. readlines(). . readline() should only be used if there is not enough memory to read the entire file at once

2. DBSCAN density clustering

(1) Basic knowledge

DBSCAN algorithm is a clustering algorithm based on density

When clustering, you do not need to specify the number of clusters in advance

Indefinite number of final clusters

DBSCAN algorithm divides data points into three categories:

Core points: points with more than MinPts in radius Eps

Boundary point: the number of points within radius Eps is less than MinPts, but falls within the neighborhood of the core point

Noise point: a point that is neither a core point nor a boundary point

 

(2) DBSCAN algorithm flow

① Mark all points as core, boundary, or noise

② Delete noise point

③ Give an edge to all the cores within Eps

④ Each group of connected core points forms a cluster

⑤ Assign each boundary point to a cluster of core points associated with it (within the radius of which core point)

Example:

There are 13 sample points as follows, clustering with DBSCAN

Take Eps=3, MinPts=3, cluster all points according to DBSACN (Manhattan distance)

Calculate the set of points within the neighborhood Eps=3 for each point

The number of points in the set exceeding MinPts=3 is the core point

Check whether the remaining points are in the neighborhood of the core point. If they are, they are boundary points. Otherwise, they are noise points

Connect the points whose distance is no more than Eps=3 to form a cluster, and the points in the neighborhood of the core point will also be added to the cluster, forming three clusters

(3) Specific application

① Data introduction

The existing university campus network log data, 290 university students' campus network usage data, including user ID, MAC address of equipment, IP address, start time, stop time, online time, campus network package, etc. Using the existing data to analyze the mode of students' access to the Internet

② Purpose of the experiment

Analyzing the pattern of students' online time by DBSCAN clustering

③ Data instance

④ Text content

⑤ Implementation process

a. Build project and import sklearn related package

b. Read in and process data

c. Online time clustering, creating DBSCAN algorithm instance, training and obtaining label

d. Output labels, viewing results

e. Draw histogram and analyze the experimental results

⑥ Results demonstration

Labels is the classification of the clusters that each data is divided into

Noise raito is the ratio of noise data

Estimated number of clusters

Silhouette coefficent as clustering effect evaluation index

Most of the Internet time is between 22 and 24

(4) Code analysis

① Full code

 1 import numpy as np
 2 import sklearn.cluster as skc
 3 from sklearn import metrics
 4 import matplotlib.pyplot as plt
 5 
 6 mac2id = dict()
 7 onlinetimes = []
 8 f = open('TestData.txt', encoding='utf-8')
 9 for line in f:
10     mac = line.split(',')[2] #Read the mac address
11     onlinetime = int(line.split(',')[6]) #Length of Internet access
12     starttime = int(line.split(',')[4].split(' ')[1].split(':')[0]) #Start time online
13     if mac not in mac2id: #mac2id It's a dictionary, key yes mac Address, value Correspond to mac Time spent online and time spent online
14         mac2id[mac] = len(onlinetimes)
15         onlinetimes.append((starttime, onlinetime))
16     else:
17         onlinetimes[mac2id[mac]] = [(starttime, onlinetime)]
18 real_X = np.array(onlinetimes).reshape((-1, 2))
19 
20 X = real_X[:, 0:1]
21 
22 db = skc.DBSCAN(eps=0.01, min_samples=20).fit(X) #call DBSCAN Methods training
23 labels = db.labels_ #labels Cluster label for each data
24 
25 print('Labels:')
26 print(labels)
27 raito = len(labels[labels[:] == -1]) / len(labels)
28 print('Noise raito:', format(raito, '.2%')) #Print the label on which the data is recorded, and calculate the label as-1,That is, the proportion of noise data
29 
30 n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
31 
32 print('Estimated number of clusters: %d' % n_clusters_)
33 print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels)) #Calculate the number of clusters and print to evaluate the clustering effect
34 
35 for i in range(n_clusters_): #Print cluster labels and data within clusters
36     print('Cluster ', i, ':')
37     print(list(X[labels == i].flatten()))
38 
39 plt.hist(X, 24)
40 plt.show() #Drawing histogram

② DBSCAN() method main parameters

eps: the maximum distance between two samples considered as neighbor nodes

min_samples: number of samples in the cluster

metric: distance calculation method

3. PCA (principal component analysis)

(1) Basic knowledge

Principal Component Analysis (PCA) is one of the most commonly used dimensionality reduction methods, which is usually used in the exploration and visualization of high-dimensional data sets, as well as data compression and preprocessing. PCA can synthesize the high-dimensional variables with correlation into the low-dimensional variables with linear independence, which are called the main components. The main components can retain the information of the original data as much as possible

Before introducing the principle of PCA, we need to review the related terms:

Variance

Variance is the mean value of the sum of squares of the difference between each sample and the mean value of the sample, which is used to measure the dispersion of a group of data

Covariance

Covariance is used to measure the degree of linear correlation between two variables. If the covariance of two variables is 0, it can be considered that they are linearly independent

③ Covariance matrix

Covariance matrix is composed of covariance difference of variables (symmetric matrix)

④ Eigenvectors and eigenvalues

The eigenvector of the matrix is a non-zero vector that describes the structure of the dataset, and satisfies the following formula:

A is the square matrix, the eigenvector, the eigenvalue

The principal component of a matrix is the eigenvector corresponding to its covariance matrix, which is sorted according to the size of the corresponding eigenvalues. The largest eigenvalue is the first principal component, followed by the second principal component, and so on

(2) PCA algorithm flow

(3) Specific application

① Data introduction

Iris dataset comes from Python's sklearn Library

② Purpose of the experiment

It is known that the data of iris is 4-dimensional and there are three kinds of samples. Using PCA to reduce the dimension of iris data and realize the visualization on two-dimensional plane

③ Data instance

④ Results demonstration

It can be seen that the data after dimensionality reduction can still be clearly divided into three categories. This can not only reduce the dimension of data, reduce the workload of classification tasks, but also ensure the quality of classification

(4) Full code

 1 import matplotlib.pyplot as plt #Load matplotlib Visualization for data
 2 from sklearn.decomposition import PCA #Load PCA Algorithm package
 3 from sklearn.datasets import load_iris #Load iris dataset import function
 4 
 5 data = load_iris()  #Loading iris datasets as dictionaries
 6 y = data.target     #Use y Represents a label in a dataset
 7 X = data.data       #Use X Represents attribute data in a dataset
 8 pca = PCA(n_components=2) #Load PCA Algorithm, set the main score item to 2 after dimension reduction
 9 reduced_X = pca.fit_transform(X) #Reduce the dimension of the original data and save it in reduced_X in
10 
11 red_x, red_y = [], []       #Data points of the first type
12 blue_x, blue_y = [], []     #Data points of the second type
13 green_x, green_y = [], []   #Data points of the third type
14 
15 for i in range(len(reduced_X)): #Save the reduced dimension data points in different lists according to the category of iris
16     if y[i] == 0:
17         red_x.append(reduced_X[i][0])
18         red_y.append(reduced_X[i][1])
19     elif y[i] == 1:
20         blue_x.append(reduced_X[i][0])
21         blue_y.append(reduced_X[i][1])
22     else:
23         green_x.append(reduced_X[i][0])
24         green_y.append(reduced_X[i][1])
25 
26 plt.scatter(red_x, red_y, c='r', marker='x')        #Data points of the first type
27 plt.scatter(blue_x, blue_y, c='b', marker='D')      #Data points of the second type
28 plt.scatter(green_x, green_y, c='g', marker='.')    #Data points of the third type
29 plt.show()  #visualization

4. NMF method (non negative matrix decomposition)

(1) Basic knowledge

Nonnegative matrix factorization (NMF) is a method of matrix factorization under the condition that all elements in a matrix are nonnegative

Basic idea: given a nonnegative matrix V, NMF can find a nonnegative matrix W and a nonnegative matrix H, so that the product of matrix W and H is approximately equal to the value in matrix v

 

 

W matrix: basic image matrix, equivalent to the feature extracted from the original matrix V

H matrix: coefficient matrix

NMF can be widely used in image analysis, text mining, speech processing and other fields

The above figure is taken from NMF author's paper. The W matrix is on the left. We can see the features extracted from the original image, and the H matrix is in the middle. It can be found that the product result is very similar to the original result

Matrix decomposition optimization objective: to minimize the difference between the product of W matrix H matrix and the original matrix

The objective function is as follows:

Based on the optimization objective of KL divergence, the loss function is as follows:

(2) Specific application

① Data introduction

Olivetti has 400 face data, each of which is 64 * 64

② Purpose of the experiment

Since the W matrix obtained by NMF decomposition is equivalent to the feature extracted from the original matrix, then we can use NMF to extract the feature of 400 Olivetti face data

③ Data instance

④ Experimental method

Set the number of extracted features by setting the size of k. In this experiment, set k=6, and then display the extracted features in the form of images

⑤ Results demonstration

(3) Code analysis

① Full code

 1 from numpy.random import RandomState #Load RandomState Used to create random seeds
 2 import matplotlib.pyplot as plt #Load matplotlib Visualization for data
 3 from sklearn.datasets import fetch_olivetti_faces #Load Olivetti Face dataset import function
 4 from sklearn import decomposition #Load PCA Algorithm package
 5 
 6 n_row, n_col = 2, 3 #Set the arrangement of image display
 7 n_components = n_row * n_col #Set the number of extracted features
 8 image_shape = (64, 64) #Set the size of face data pictures
 9 
10 ###############################################################################
11 # Load faces data
12 dataset = fetch_olivetti_faces(shuffle=True, random_state=RandomState(0))
13 faces = dataset.data #Load data and break order
14 
15 
16 ###############################################################################
17 def plot_gallery(title, images, n_col=n_col, n_row=n_row):
18     plt.figure(figsize=(2. * n_col, 2.26 * n_row)) #Create a picture and specify the picture size
19     plt.suptitle(title, size=16) #Set title and font size
20 
21     for i, comp in enumerate(images):
22         plt.subplot(n_row, n_col, i + 1) #Choose a sub drawing
23         vmax = max(comp.max(), -comp.min())
24 
25         plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray,
26                    interpolation='nearest', vmin=-vmax, vmax=vmax) #Normalize the values and display them in the form of gray-scale image
27         plt.xticks(())
28         plt.yticks(()) #Remove axis labels from subgraphs
29     plt.subplots_adjust(0.01, 0.05, 0.99, 0.94, 0.04, 0.) #Adjust the position and interval of the sub map
30 
31 
32 plot_gallery("First centered Olivetti faces", faces[:n_components])
33 ###############################################################################
34 
35 estimators = [
36     ('Eigenfaces - PCA using randomized SVD',
37      decomposition.PCA(n_components=6, whiten=True)), #PCA Example
38 
39     ('Non-negative components - NMF',
40      decomposition.NMF(n_components=6, init='nndsvda', tol=5e-3)) #NMF Example
41 ] #Store them in a list
42 
43 ###############################################################################
44 
45 for name, estimator in estimators:
46     print("Extracting the top %d %s..." % (n_components, name))
47     print(faces.shape) #Call separately PCA and NMF
48     estimator.fit(faces) #call PCA or NMF Extraction feature
49     components_ = estimator.components_ #Get extracted features
50     plot_gallery(name, components_[:n_components]) #Arrange by fixed format
51 
52 plt.show() #visualization

② NMF() method parameter

In the sklearn library, you can use sklearn.composition.nmf to load the NMF algorithm. The main parameters are:

N? Components: used to specify the single dimension k of the decomposed matrix

init: initialization mode of W matrix and H matrix, default to 'nndsvdar'

Source: Python machine learning application - Lixin, Songtian, Beijing University of technology, MOOC

Tags: Python Mac network less

Posted on Mon, 30 Mar 2020 05:45:56 -0700 by Bou