# Basic notes of decision tree in machine learning practice

#### Advantages and disadvantages of decision tree

• Low computational complexity
• Output is easy to understand
• Insensitive to missing median
• Can process irrelevant feature data
###### shortcoming
• Over matching problems may occur

#### Decision tree principle

One example of A game with 20 problems is in the book "machine learning practice": one party who participates in the game thinks something in his mind. Other participants can ask him 29 questions, but the answer can only be right or wrong. For example, the simplest guessing game. I think A number is 7. Then A says that the number in your mind is smaller than 100. Then I said it right. Then B said that the number in your mind is greater than 10, and I said that the answer is wrong. Then C, D, E and others continue to ask questions until the range of that number becomes smaller and smaller, until the answer is guessed. The principle of decision tree is that the user gives A series of data, and then gives the answer of the game.
Compared with the email example in the book, I prefer the example of Jack Cui, which is easy to understand. So all the examples below are mainly Jack Cui's.
The following is a tree view of a decision tree. The structure of decision tree is mainly composed of nodes and directed edges. Nodes are divided into internal nodes and leaf nodes. Internal nodes are used to represent a feature or an attribute. A leaf node represents a class. Rectangle and ellipse are nodes. Rectangle belongs to internal node, and there are branches or leaf nodes below. The ellipse is a leaf node with no branches below. It's the end.

Explain some of the elements in the figure above:

• Rectangle represents judgment module
• The ellipse represents the termination module, which is used to get the conclusion
• The arrow from the rectangle (judgment module) is a branch that can reach another module or terminate the module
###### Flow chart explanation

This is a simple classification model of mother-in-law dating. When your mother-in-law comes, ask if you have a house or a car. If so, you can be considered.
But if you don't have a car or a house, but you are very motivated and hard-working, you can still take it into consideration. And then it's a spare wheel. If there's no car, no room and no ambition, it certainly doesn't meet the standard of mother-in-law. I'll see you soon. Of course, this is just a simple classification model with only two attributes.

###### Summary

The model of decision tree is actually a combination of many conditions or rules. Every node from root to leaf is a rule, or a class of people corresponding to the above model. Each path or rule of decision tree is mutually exclusive and covers.

#### General process of decision tree

• collecting data
• Prepare and process data
• Analyze data: you can use any method to check whether the graph meets our expectations after the tree is constructed
• Training algorithm: data structure of construction tree
• Test algorithm: use experience tree to calculate error rate
• Using algorithm and decision tree to solve practical problems, this step is applicable to any supervised learning algorithm

#### Construction of decision tree

Before constructing a decision tree, we first use information theory to divide the data set, then write code to the specific data set, and finally build the decision tree.

###### feature selection

Before constructing a decision tree, we first need to select features. What is the feature selection? **Feature selection is to select features that can classify training data. **What we choose is that the feature of the current dataset plays a decisive role in classifying the dataset. So if we want to distinguish the best results, we need to evaluate each feature. After the test, the original data set will be divided into several data subsets, which are distributed on all branches of the first decision point. If the data on a branch belongs to the same type, the data has been correctly divided, and further data division is not necessary. If the data in the data subset is not of the same type, it is necessary to divide the data subset again until the data type under the branch is the same.

If after a partition, all the branches have the same type of data, for example, the data of youth, middle age and old age, then the partition is successful, and then it can be divided recursively under this data subset.
What are the criteria for feature selection? Generally, the standard of feature selection is information gain or information gain ratio.
Let's learn from an example.

The above array is the data of a loan application. We train the above data to bring a decision tree of loan application. When a new user comes to apply for a loan, our tree can judge whether to lend or not.
Feature selection is to choose which feature is used to divide feature space. We use the example of the loan above to construct two different decision trees.

The above is the decision tree constructed according to the two characteristics of age and whether there is work or not. Which of these two decision trees is better. As we said above, we divide the training data set into our own according to the current feature, so that each of us can better classify under the current conditions, then this feature is what we should choose. In this case, we divide the data set according to the two characteristics of age and work, which is better to classify and which is better.
So that's the concept of information gain. The change of information after data set partition is information gain. It is the change of information after we divide the data set according to age and work. The change of information before and after the data set is called information gain. Knowing how to calculate information gain, we can
Calculating the information gain of each eigenvalue partition data set, the feature with the highest information gain is the best choice.
Before we can evaluate which data partition is the best, we must learn how to calculate the information gain. The measure of set information is called Shannon entropy or entropy for short. The name comes from Claude Shannon, the father of information theory.
Entropy is defined as the expected value of information. Entropy is a variable that represents the uncertainty of random variables. The greater the entropy, the greater the uncertainty of random variables. The formula for calculating information gain and entropy is as follows:

Where p(xi) is the probability of selecting the classification. It is the proportion of this classification in the total data. For example, there are two kinds of goods: A and B. There are 3 in a and 2 in B. Then the probability of p(x) of a is 3 / 5. The probability of p(x) of B is 2 / 5. The base number above can also be based on e.
In order to calculate entropy, we need to calculate the expected value of information contained in all possible values of all categories. It is the process of summing all possible values of all categories. The formula is as follows:

Where n is the number of categories. The greater the entropy, the greater the uncertainty of random variables.
When the probability in entropy is obtained by data statistics (especially maximum likelihood estimation), this entropy is the empirical entropy. Data statistics is the probability of classification obtained by data statistics (in short, we get it according to the number of data).
We define the data in the loan application sample data table as training data set D, then the empirical entropy of training data set D is H(D), | d| represents its sample capacity and sample number. There are k classes Ck, = 1,2,3 , K and|Ck|are the number of samples belonging to the class Ck, so the empirical entropy formula can be written as follows:

The loan data given above are available. There are two types of classification results. There are 15 data in total, 9 results are lending, 6 results are not lending. Then the empirical entropy is as follows:

###### Writing code
• Before writing the code, we annotate the attribute of the data set we lent.
• Age: 0 for youth, 1 for middle age, 2 for old age
• Working condition: 0 means no work, 1 means no work
• Whether you have your own house: 0 means you don't have your own house, 1 means you have your own house
• Credit situation: 0 for average, 1 for good, 2 for very good
• Loan or not: no means no yes means yes
```from math import log
'''
//Create test data set
Returns:
dataSet - data set
labels - Categorical attribute
'''
def createDataSet():
dataSet=[
[0, 0, 0, 0, 'no'],  # data set
[0, 0, 0, 1, 'no'],
[0, 1, 0, 1, 'yes'],
[0, 1, 1, 0, 'yes'],
[0, 0, 0, 0, 'no'],
[1, 0, 0, 0, 'no'],
[1, 0, 0, 1, 'no'],
[1, 1, 1, 1, 'yes'],
[1, 0, 1, 2, 'yes'],
[1, 0, 1, 2, 'yes'],
[2, 0, 1, 2, 'yes'],
[2, 0, 1, 1, 'yes'],
[2, 1, 0, 1, 'yes'],
[2, 1, 0, 2, 'yes'],
[2, 0, 0, 0, 'no']
]
labels=['lending','No lending']
return dataSet,labels

'''
//Calculating empirical entropy
Parameters:
dataSet - data set
Returns:
shannonEnt - Empirical entropy(Shannon entropy)
'''
def calShannonEnt(dataSet):
##Returns the number of rows in the dataset
numEntires=len(dataSet)
#Save the number of times each label appears
labelCounts={}
#Statistics for each eigenvector
for featVec in dataSet:
#Get the label
currentLabel=featVec[-1]
if currentLabel not in labelCounts.keys():
#If the label is not in labelCounts, put it in
labelCounts[currentLabel]=0
#Count if the label is in labelCounts
labelCounts[currentLabel]+=1
#Calculating empirical entropy
ShannonEnt=0.0
for key in labelCounts:
#Calculate the probability of this tag
prob=float(labelCounts[key])/numEntires
ShannonEnt -= prob*log(prob,2)
return ShannonEnt
if __name__ == '__main__':
#Call function to create dataset
dataSet,features=createDataSet()
print(dataSet)
#Calculating empirical entropy
print(calShannonEnt(dataSet))
```

Print results:

#### Partition data set

After calculating the entropy, we use entropy to divide the data set.

#### Conditional entropy

Information gain is used to measure features. The larger the information gain, the greater the impact of features on the final classification results.
We need to understand conditional entropy before learning information gain.
Conditional entropy H(Y|X) indicates the uncertainty of random variable Y under the condition of known random variable X.

When the probability of conditional entropy is obtained from data statistics (especially maximum likelihood estimation), the corresponding conditional entropy is called conditional empirical entropy.

#### information gain

Information gain is used to measure features. The larger the information gain, the greater the impact of features on the final classification results. Therefore, the information gain g(D,A) of feature A for training set D is defined as the difference between the empirical entropy H(D) of set D and the empirical conditional entropy H(D|A) of feature A under A given condition. The formula is as follows:

The gain of information is equivalent to the difference between empirical entropy and conditional empirical entropy.

Let characteristic a have n different values {a1,a2 The training set D is divided into n subsets {D1, D2,Dn}. |Di|is the number of samples of di. Note that the set of samples belonging to Ck in subset Di is Dik, that is, Dik = Di ∩ Ck, an d| dik | is the number of samples of Dik.

Let's use the above example (loan application model) to sort it out.
The data in the age column, feature A1, is divided into three categories: youth, middle age and old age. For the time being, let's look at the youth data. The data of youth is 5, so the probability of the data of age is youth appearing in the training set is 5 / 15. Data on age as middle-aged and old are also five fifths more likely to appear. Young people's data loans are two fifths more likely to succeed. The probability of success for loans aged middle and old is three fifths and four fifths. The information gain for calculating age is as follows:

Calculate the information gain of working g(D,A2), housing g(D,A3) and credit g(D,A4) as follows:

By comparison, the gain of g(D,A3) is the largest, which shows that the feature is the best.

#### Calculate information gain

```from math import log
'''
//Create test data set
Returns:
dataSet - data set
labels - Categorical attribute
'''
def createDataSet():
dataSet=[
[0, 0, 0, 0, 'no'],  # data set
[0, 0, 0, 1, 'no'],
[0, 1, 0, 1, 'yes'],
[0, 1, 1, 0, 'yes'],
[0, 0, 0, 0, 'no'],
[1, 0, 0, 0, 'no'],
[1, 0, 0, 1, 'no'],
[1, 1, 1, 1, 'yes'],
[1, 0, 1, 2, 'yes'],
[1, 0, 1, 2, 'yes'],
[2, 0, 1, 2, 'yes'],
[2, 0, 1, 1, 'yes'],
[2, 1, 0, 1, 'yes'],
[2, 1, 0, 2, 'yes'],
[2, 0, 0, 0, 'no']
]
labels=['lending','No lending']
return dataSet,labels

'''
//Calculating empirical entropy
Parameters:
dataSet - data set
Returns:
shannonEnt - Empirical entropy(Shannon entropy)
'''
def calShannonEnt(dataSet):
##Returns the number of rows in the dataset
numEntires=len(dataSet)
#Save the number of times each label appears
labelCounts={}
#Statistics for each eigenvector
for featVec in dataSet:
#Get the label
currentLabel=featVec[-1]
if currentLabel not in labelCounts.keys():
#If the label is not in labelCounts, put it in
labelCounts[currentLabel]=0
#Count if the label is in labelCounts
labelCounts[currentLabel]+=1
#Calculating empirical entropy
ShannonEnt=0.0
for key in labelCounts:
#Calculate the probability of this tag
prob=float(labelCounts[key])/numEntires
ShannonEnt -= prob*log(prob,2)
return ShannonEnt

def calcShannonEnt(dataSet):
numEntires = len(dataSet)  # Returns the number of rows in the dataset
labelCounts = {}  # Save a dictionary of the number of occurrences of each label
for featVec in dataSet:  # Statistics of each group of eigenvectors
currentLabel = featVec[-1]  # Extract label information
if currentLabel not in labelCounts.keys():  # If the label is not put into the dictionary of statistics times, add it
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1  # Label count
shannonEnt = 0.0  # Empirical entropy (Shannon entropy)
for key in labelCounts:  # Calculating Shannon entropy
prob = float(labelCounts[key]) / numEntires  # Probability of selecting the label
shannonEnt -= prob * log(prob, 2)  # Calculation by formula
return shannonEnt  # Return to empirical entropy (Shannon entropy)

'''
//Divide data set according to features
Parameters:
dataSet - Data set to be divided
axis - Characteristics of partitioned datasets
value - The value of the characteristic to be returned
Returns:
//nothing
'''
def splitDataSet(dataSet, axis, value):
retDataSet = []                                        #Create a list of returned datasets
for featVec in dataSet:                             #Traversal data set
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]                #Remove axis features
reducedFeatVec.extend(featVec[axis+1:])     #Add eligible to returned dataset
retDataSet.append(reducedFeatVec)
return retDataSet                                      #Return the partitioned dataset

'''
//Description: select the best feature
Parameters:
dataSet - data set
Returns:
bestFeature - Maximum information gain(optimal)Index value of characteristic
'''
def chooseBestFeatureToSplit(dataSet):
#Number of features
numFeatures=len(dataSet[0])-1
#Calculating entropy of data set
baseEntropy=calShannonEnt(dataSet)
#Information gain variable
bestInfoGain=0.0
#Optimal characteristic index value
bestFeature=-1
#Traverse all features
for i in range(numFeatures):
#Get the i th feature of dataSet
featList=[example[i] for example in dataSet]
#Note that the elements in the set are not repeatable
uniqueVals = set(featList)
#Empirical conditional entropy
newEntropy=0.0
#Calculate information gain
for value in uniqueVals:
#Partition subset
subDataSet=splitDataSet(dataSet,i,value)
#Calculate subset probability
prob=len(subDataSet)/float(len(dataSet))
#Calculating empirical entropy
newEntropy += prob * calShannonEnt(subDataSet)
#information gain
infoGain=baseEntropy-newEntropy
#Print the information gain of each feature
print("The first%d The gain of the features is%.3f" % (i, infoGain))
#Calculate information gain
if(infoGain > bestInfoGain):
#Update information gain to maximum information gain
bestInfoGain=infoGain
#Index value of the feature with the largest gain of recorded information
bestFeature=i
#Returns the index value of the maximum gain
return bestFeature

if __name__ == '__main__':
#Call function to create dataset
dataSet,features=createDataSet()
print("Optimal characteristic index value:" + str(chooseBestFeatureToSplit(dataSet)))

#print(dataSet)
#Calculating empirical entropy
# print(calShannonEnt(dataSet))
```

#### Learning materials

1. Jack Cui's personal blog
2. Machine learning practice
3. Machine learning
116 original articles published, 72 praised, 210000 visitors+

Tags: Attribute

Posted on Wed, 29 Jan 2020 07:16:00 -0800 by klance