# Data preprocessing -- data discretization

Data discretization
Data discretization (also known as data grouping) refers to grouping continuous data into a segment of discretized interval.
According to whether the class attribute is considered in the discretization process, the discretization algorithm can be divided into supervised algorithm and unsupervised algorithm. Because the supervised algorithm (such as discretization of data based on information entropy) makes full use of the information of Category attribute, it can get high accuracy in classification.
The data grouping methods described below all need to sort the data, and assume that the data to be discretized is arranged in ascending order.
1. Equal width grouping
The principle of equal width grouping is: the fixed width is obtained according to the number of groups, and the width of variables divided into each group is equal.
For example, divide a set of variables (1, 7, 12, 12, 22, 30, 34, 38, 46) into three groups
(1) The width is 15, that is, subtract the minimum value (1) from the maximum value (46) in the variable, and then divide the difference by the number of groups 3
(2) The range of each group is: (1,16], (19,31], (31,46]
(3) The results were: (1, 7, 12, 12), (22, 30), (34, 38, 46)
2. Equal frequency grouping
Equal frequency grouping is also called quantile grouping, that is, the number of variables in each group is the same after grouping.
The above variables were grouped and the results were (1, 7, 12), (12, 22, 30), (34, 38, 46)
Be careful:
The realization of equal width packet and equal frequency packet is relatively simple, but the number of packets needs to be specified manually.
The disadvantage of equal width grouping is that it is sensitive to outliers and distributes attribute values unevenly to each interval. Some intervals contain more variables, while others contain less variables.
Although equal frequency grouping can avoid the disadvantage of equal width grouping, the same variable value will be divided into different groups (to meet the same number of variables in each group)
3. Single variable grouping
Univariate grouping is also called rank grouping. The principle is: all variables are sorted in descending or ascending order, and the sorting rank is the sorting result, that is, the variables with the same value are divided into a group.
The results are: (1), (7), (12,12), (22), (30), (34), (38), (46)
4. Grouping based on information entropy
Let's talk about the concept of information quantity and entropy.
(1) Amount of information
Shannon, known as "the father of information theory", is regarded as "information is used to eliminate random uncertainty". That is to say, measuring the amount of information depends on the degree to which the information eliminates uncertainty.
The amount of information is inversely proportional to the probability of the event. The amount of information can be expressed as:
Information quantity measures the information brought by "a specific event".

(2) entropy
Entropy is the expectation of the amount of information that may be generated before the result comes out - considering all possible values of the random variable, that is, the expectation of the amount of information that all possible events bring.

Represents the probability of the event, n is the number of all categories in x.
According to all possible values of random variables, the total entropy E of data is the weighted average of entropy of all events:

Where, is the proportion of the occurrence of the x-th event, is the number of occurrences of the i-th possible value, and m is the total number of occurrences of all values.
Be careful:
Entropy represents the uncertainty of the sample set. The greater the entropy, the greater the uncertainty of the sample.
The specific method of data grouping based on information entropy is as follows:
(1) Sort all values of attribute A from small to large;
(2) Each value of attribute A is traversed, and the value of attribute A is divided into two intervals, so that the entropy S is the minimum after dividing the data set as A separation point.
(3) When the entropy after partition is greater than the set threshold (yu, four tone) value and less than the specified number of data, recursively partition S1 and S2 in step (2).

```# -*-coding:utf-8-*-

import numpy as np
import math

class DiscreteByEntropy:
def __init__(self, group, threshold):
self.maxGroup = group  # Maximum number of groups
self.minInfoThreshold = threshold  # Minimum entropy of stopping partition
self.result = dict()  # Save partition results

# Preparation data
data = np.array(
[
[56, 1], [87, 1], [129, 0], [23, 0], [342, 1],
[641, 1], [63, 0], [2764, 1], [2323, 0], [453, 1],
[10, 1], [9, 0], [88, 1], [222, 0], [97, 0],
[2398, 1], [592, 1], [561, 1], [764, 0], [121, 1],
]
)
return data

# This step is to calculate the information entropy of data and prepare for the next step of data set segmentation.

# Calculate Shannon entropy after data grouping
def calEntropy(self, data):
numData = len(data)
labelCounts = {}
for feature in data:
# Get labels
oneLabel = feature[-1]
# If the tag does not create the tag value in the newly defined dictionary
labelCounts.setdefault(oneLabel, 0)
# The number of data under this type of label
labelCounts[oneLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
# Probability of similar labels
prob = float(labelCounts[key]) / numData
# Base 2 for logarithm
shannonEnt -= prob * math.log(prob, 2)
return shannonEnt

# The best way to find a group of data segmentation points is to traverse all the attribute values, and divide the data according to the attribute to minimize the average entropy.

# Data set segmentation based on the principle of harmonic entropy minimization
def split(self, data):
# inf is positive infinity
minEntropy = np.inf
# Record final split index
index = -1
# Sort the data in ascending order according to the first column
sortData = data[np.argsort(data[:, 0])]
# Entropy after initialization of final segmented data
lastE1, lastE2 = -1, -1
# Data structure returned, including data and corresponding entropy
S1 = dict()
S2 = dict()
for i in range(len(sortData)):
# Split data set
splitData1, splitData2 = sortData[: i + 1], sortData[i + 1:]
entropy1, entropy2 = (
self.calEntropy(splitData1),
self.calEntropy(splitData2),
)  # Calculating information entropy
entropy = entropy1 * len(splitData1) / len(sortData) + \
entropy2 * len(splitData2) / len(sortData)
# If the harmonic mean entropy is less than the minimum
if entropy < minEntropy:
minEntropy = entropy
index = i
lastE1 = entropy1
lastE2 = entropy2
S1["entropy"] = lastE1
S1["data"] = sortData[: index + 1]
S2["entropy"] = lastE2
S2["data"] = sortData[index + 1:]
return S1, S2, entropy

# Discretization of data

# Group data
def train(self, data):
# key to traverse
needSplitKey = 
# Treat the entire data as a group
self.result.setdefault(0, {})
self.result["entropy"] = np.inf
self.result["data"] = data
group = 1
for key in needSplitKey:
S1, S2, entropy = self.split(self.result[key]["data"])
# If the conditions are met
if entropy > self.minInfoThreshold and group < self.maxGroup:
self.result[key] = S1
newKey = max(self.result.keys()) + 1
self.result[newKey] = S2
needSplitKey.extend([key])
needSplitKey.extend([newKey])
group += 1
else:
break

if __name__ == "__main__":
dbe = DiscreteByEntropy(group=6, threshold=0.5)  