Data preprocessing -- data discretization

Data discretization
Data discretization (also known as data grouping) refers to grouping continuous data into a segment of discretized interval.
According to whether the class attribute is considered in the discretization process, the discretization algorithm can be divided into supervised algorithm and unsupervised algorithm. Because the supervised algorithm (such as discretization of data based on information entropy) makes full use of the information of Category attribute, it can get high accuracy in classification.
The data grouping methods described below all need to sort the data, and assume that the data to be discretized is arranged in ascending order.
1. Equal width grouping
The principle of equal width grouping is: the fixed width is obtained according to the number of groups, and the width of variables divided into each group is equal.
For example, divide a set of variables (1, 7, 12, 12, 22, 30, 34, 38, 46) into three groups
(1) The width is 15, that is, subtract the minimum value (1) from the maximum value (46) in the variable, and then divide the difference by the number of groups 3
(2) The range of each group is: (1,16], (19,31], (31,46]
(3) The results were: (1, 7, 12, 12), (22, 30), (34, 38, 46)
2. Equal frequency grouping
Equal frequency grouping is also called quantile grouping, that is, the number of variables in each group is the same after grouping.
The above variables were grouped and the results were (1, 7, 12), (12, 22, 30), (34, 38, 46)
Be careful:
The realization of equal width packet and equal frequency packet is relatively simple, but the number of packets needs to be specified manually.
The disadvantage of equal width grouping is that it is sensitive to outliers and distributes attribute values unevenly to each interval. Some intervals contain more variables, while others contain less variables.
Although equal frequency grouping can avoid the disadvantage of equal width grouping, the same variable value will be divided into different groups (to meet the same number of variables in each group)
3. Single variable grouping
Univariate grouping is also called rank grouping. The principle is: all variables are sorted in descending or ascending order, and the sorting rank is the sorting result, that is, the variables with the same value are divided into a group.
The results are: (1), (7), (12,12), (22), (30), (34), (38), (46)
4. Grouping based on information entropy
Let's talk about the concept of information quantity and entropy.
(1) Amount of information
Shannon, known as "the father of information theory", is regarded as "information is used to eliminate random uncertainty". That is to say, measuring the amount of information depends on the degree to which the information eliminates uncertainty.
The amount of information is inversely proportional to the probability of the event. The amount of information can be expressed as:
Information quantity measures the information brought by "a specific event".

(2) entropy
Entropy is the expectation of the amount of information that may be generated before the result comes out - considering all possible values of the random variable, that is, the expectation of the amount of information that all possible events bring.

Represents the probability of the event, n is the number of all categories in x.
According to all possible values of random variables, the total entropy E of data is the weighted average of entropy of all events:

Where, is the proportion of the occurrence of the x-th event, is the number of occurrences of the i-th possible value, and m is the total number of occurrences of all values.
Be careful:
Entropy represents the uncertainty of the sample set. The greater the entropy, the greater the uncertainty of the sample.
The specific method of data grouping based on information entropy is as follows:
(1) Sort all values of attribute A from small to large;
(2) Each value of attribute A is traversed, and the value of attribute A is divided into two intervals, so that the entropy S is the minimum after dividing the data set as A separation point.
(3) When the entropy after partition is greater than the set threshold (yu, four tone) value and less than the specified number of data, recursively partition S1 and S2 in step (2).

# -*-coding:utf-8-*-


import numpy as np
import math


class DiscreteByEntropy:
    def __init__(self, group, threshold):
        self.maxGroup = group  # Maximum number of groups
        self.minInfoThreshold = threshold  # Minimum entropy of stopping partition
        self.result = dict()  # Save partition results

    # Preparation data
    def loadData(self):
        data = np.array(
            [
                [56, 1], [87, 1], [129, 0], [23, 0], [342, 1],
                [641, 1], [63, 0], [2764, 1], [2323, 0], [453, 1],
                [10, 1], [9, 0], [88, 1], [222, 0], [97, 0],
                [2398, 1], [592, 1], [561, 1], [764, 0], [121, 1],
            ]
        )
        return data

    # This step is to calculate the information entropy of data and prepare for the next step of data set segmentation.

    # Calculate Shannon entropy after data grouping
    def calEntropy(self, data):
        numData = len(data)
        labelCounts = {}
        for feature in data:
            # Get labels
            oneLabel = feature[-1]
            # If the tag does not create the tag value in the newly defined dictionary
            labelCounts.setdefault(oneLabel, 0)
            # The number of data under this type of label
            labelCounts[oneLabel] += 1
        shannonEnt = 0.0
        for key in labelCounts:
            # Probability of similar labels
            prob = float(labelCounts[key]) / numData
            # Base 2 for logarithm
            shannonEnt -= prob * math.log(prob, 2)
        return shannonEnt

    # The best way to find a group of data segmentation points is to traverse all the attribute values, and divide the data according to the attribute to minimize the average entropy.

    # Data set segmentation based on the principle of harmonic entropy minimization
    def split(self, data):
        # inf is positive infinity
        minEntropy = np.inf
        # Record final split index
        index = -1
        # Sort the data in ascending order according to the first column
        sortData = data[np.argsort(data[:, 0])]
        # Entropy after initialization of final segmented data
        lastE1, lastE2 = -1, -1
        # Data structure returned, including data and corresponding entropy
        S1 = dict()
        S2 = dict()
        for i in range(len(sortData)):
            # Split data set
            splitData1, splitData2 = sortData[: i + 1], sortData[i + 1:]
            entropy1, entropy2 = (
                self.calEntropy(splitData1),
                self.calEntropy(splitData2),
            )  # Calculating information entropy
            entropy = entropy1 * len(splitData1) / len(sortData) + \
                      entropy2 * len(splitData2) / len(sortData)
            # If the harmonic mean entropy is less than the minimum
            if entropy < minEntropy:
                minEntropy = entropy
                index = i
                lastE1 = entropy1
                lastE2 = entropy2
        S1["entropy"] = lastE1
        S1["data"] = sortData[: index + 1]
        S2["entropy"] = lastE2
        S2["data"] = sortData[index + 1:]
        return S1, S2, entropy

    # Discretization of data

    # Group data
    def train(self, data):
        # key to traverse
        needSplitKey = [0]
        # Treat the entire data as a group
        self.result.setdefault(0, {})
        self.result[0]["entropy"] = np.inf
        self.result[0]["data"] = data
        group = 1
        for key in needSplitKey:
            S1, S2, entropy = self.split(self.result[key]["data"])
            # If the conditions are met
            if entropy > self.minInfoThreshold and group < self.maxGroup:
                self.result[key] = S1
                newKey = max(self.result.keys()) + 1
                self.result[newKey] = S2
                needSplitKey.extend([key])
                needSplitKey.extend([newKey])
                group += 1
            else:
                break


if __name__ == "__main__":
    dbe = DiscreteByEntropy(group=6, threshold=0.5)
    data = dbe.loadData()
    dbe.train(data)
    print("result is {}".format(dbe.result))

Published 7 original articles, won praise 1, visited 386
Private letter follow

Tags: Attribute less

Posted on Wed, 29 Jan 2020 20:30:31 -0800 by Garion