Data discretization

Data discretization (also known as data grouping) refers to grouping continuous data into a segment of discretized interval.

According to whether the class attribute is considered in the discretization process, the discretization algorithm can be divided into supervised algorithm and unsupervised algorithm. Because the supervised algorithm (such as discretization of data based on information entropy) makes full use of the information of Category attribute, it can get high accuracy in classification.

The data grouping methods described below all need to sort the data, and assume that the data to be discretized is arranged in ascending order.

1. Equal width grouping

The principle of equal width grouping is: the fixed width is obtained according to the number of groups, and the width of variables divided into each group is equal.

For example, divide a set of variables (1, 7, 12, 12, 22, 30, 34, 38, 46) into three groups

(1) The width is 15, that is, subtract the minimum value (1) from the maximum value (46) in the variable, and then divide the difference by the number of groups 3

(2) The range of each group is: (1,16], (19,31], (31,46]

(3) The results were: (1, 7, 12, 12), (22, 30), (34, 38, 46)

2. Equal frequency grouping

Equal frequency grouping is also called quantile grouping, that is, the number of variables in each group is the same after grouping.

The above variables were grouped and the results were (1, 7, 12), (12, 22, 30), (34, 38, 46)

Be careful:

The realization of equal width packet and equal frequency packet is relatively simple, but the number of packets needs to be specified manually.

The disadvantage of equal width grouping is that it is sensitive to outliers and distributes attribute values unevenly to each interval. Some intervals contain more variables, while others contain less variables.

Although equal frequency grouping can avoid the disadvantage of equal width grouping, the same variable value will be divided into different groups (to meet the same number of variables in each group)

3. Single variable grouping

Univariate grouping is also called rank grouping. The principle is: all variables are sorted in descending or ascending order, and the sorting rank is the sorting result, that is, the variables with the same value are divided into a group.

The results are: (1), (7), (12,12), (22), (30), (34), (38), (46)

4. Grouping based on information entropy

Let's talk about the concept of information quantity and entropy.

(1) Amount of information

Shannon, known as "the father of information theory", is regarded as "information is used to eliminate random uncertainty". That is to say, measuring the amount of information depends on the degree to which the information eliminates uncertainty.

The amount of information is inversely proportional to the probability of the event. The amount of information can be expressed as:

Information quantity measures the information brought by "a specific event".

(2) entropy

Entropy is the expectation of the amount of information that may be generated before the result comes out - considering all possible values of the random variable, that is, the expectation of the amount of information that all possible events bring.

Represents the probability of the event, n is the number of all categories in x.

According to all possible values of random variables, the total entropy E of data is the weighted average of entropy of all events:

Where, is the proportion of the occurrence of the x-th event, is the number of occurrences of the i-th possible value, and m is the total number of occurrences of all values.

Be careful:

Entropy represents the uncertainty of the sample set. The greater the entropy, the greater the uncertainty of the sample.

The specific method of data grouping based on information entropy is as follows:

(1) Sort all values of attribute A from small to large;

(2) Each value of attribute A is traversed, and the value of attribute A is divided into two intervals, so that the entropy S is the minimum after dividing the data set as A separation point.

(3) When the entropy after partition is greater than the set threshold (yu, four tone) value and less than the specified number of data, recursively partition S1 and S2 in step (2).

# -*-coding:utf-8-*- import numpy as np import math class DiscreteByEntropy: def __init__(self, group, threshold): self.maxGroup = group # Maximum number of groups self.minInfoThreshold = threshold # Minimum entropy of stopping partition self.result = dict() # Save partition results # Preparation data def loadData(self): data = np.array( [ [56, 1], [87, 1], [129, 0], [23, 0], [342, 1], [641, 1], [63, 0], [2764, 1], [2323, 0], [453, 1], [10, 1], [9, 0], [88, 1], [222, 0], [97, 0], [2398, 1], [592, 1], [561, 1], [764, 0], [121, 1], ] ) return data # This step is to calculate the information entropy of data and prepare for the next step of data set segmentation. # Calculate Shannon entropy after data grouping def calEntropy(self, data): numData = len(data) labelCounts = {} for feature in data: # Get labels oneLabel = feature[-1] # If the tag does not create the tag value in the newly defined dictionary labelCounts.setdefault(oneLabel, 0) # The number of data under this type of label labelCounts[oneLabel] += 1 shannonEnt = 0.0 for key in labelCounts: # Probability of similar labels prob = float(labelCounts[key]) / numData # Base 2 for logarithm shannonEnt -= prob * math.log(prob, 2) return shannonEnt # The best way to find a group of data segmentation points is to traverse all the attribute values, and divide the data according to the attribute to minimize the average entropy. # Data set segmentation based on the principle of harmonic entropy minimization def split(self, data): # inf is positive infinity minEntropy = np.inf # Record final split index index = -1 # Sort the data in ascending order according to the first column sortData = data[np.argsort(data[:, 0])] # Entropy after initialization of final segmented data lastE1, lastE2 = -1, -1 # Data structure returned, including data and corresponding entropy S1 = dict() S2 = dict() for i in range(len(sortData)): # Split data set splitData1, splitData2 = sortData[: i + 1], sortData[i + 1:] entropy1, entropy2 = ( self.calEntropy(splitData1), self.calEntropy(splitData2), ) # Calculating information entropy entropy = entropy1 * len(splitData1) / len(sortData) + \ entropy2 * len(splitData2) / len(sortData) # If the harmonic mean entropy is less than the minimum if entropy < minEntropy: minEntropy = entropy index = i lastE1 = entropy1 lastE2 = entropy2 S1["entropy"] = lastE1 S1["data"] = sortData[: index + 1] S2["entropy"] = lastE2 S2["data"] = sortData[index + 1:] return S1, S2, entropy # Discretization of data # Group data def train(self, data): # key to traverse needSplitKey = [0] # Treat the entire data as a group self.result.setdefault(0, {}) self.result[0]["entropy"] = np.inf self.result[0]["data"] = data group = 1 for key in needSplitKey: S1, S2, entropy = self.split(self.result[key]["data"]) # If the conditions are met if entropy > self.minInfoThreshold and group < self.maxGroup: self.result[key] = S1 newKey = max(self.result.keys()) + 1 self.result[newKey] = S2 needSplitKey.extend([key]) needSplitKey.extend([newKey]) group += 1 else: break if __name__ == "__main__": dbe = DiscreteByEntropy(group=6, threshold=0.5) data = dbe.loadData() dbe.train(data) print("result is {}".format(dbe.result))