sklearn-Classification Algorithms-Decision Trees

Information entropy (put forward)


= The probability of each category * log probability, sum, multiply by -1

Information entropy measures uncertainty. The smaller the information entropy is, the smaller the uncertainty is. The greater the information entropy, the greater the uncertainty.



Information gain:

Represents the degree to which the information entropy decreases when the information of feature A is known.
= Initial Information Entropy-A Conditional Information Entropy

Initial Information Entropy: Look only at the target value. Yes (9/15) or no (6/15)
Initial Information Entropy=-1 (9/15log 9/15+6/15*log 6/15)
Age Information Entropy: Youth (5/15), Middle Age (5/15) and Old Age (5/15)

Age Information Entropy=-1* [5/15* H (Youth) + 5/15 * H (Middle Age) + 5/15 * H (Old Age)]

H (Youth) = - 1 (2/5log 2/5 + 3/5*log 3/5)





Decision Tree: The more information gain (the more uncertainties decrease) is put in front of the tree.

api :

from sklearn.tree import DecisionTreeClassifier
DecisionTreeClassifier(criterion='gini', max_depth=None,random_state=None)

criterion: The default is the'gini'coefficient. You can also choose'entropy' for information gain.

max_depth: The depth of the tree.

Code

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
def decision_tree():
    data = pd.read_csv('../taitanlike.txt')
    data = data.drop(['row.names'],axis=1)

    # Fill in missing values
    data['age'] = data['age'].fillna(data['age'].mean())
    print(data.info())
    y = data['survived']
    x = data[['pclass','age','sex']]
    #   Chinese characters, to onehot encoding
    x = x.to_dict(orient='records')
    dv =DictVectorizer(sparse=False)
    x = dv.fit_transform(x)
    print(dv.get_feature_names())
    print(x)
    x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=0.8)
    # Training model
    dtc = DecisionTreeClassifier(criterion='gini',max_depth=9)
    dtc.fit(x_train,y_train)
    dtc.predict(x_test)
    score = dtc.score(x_test,y_test)
    print(score)
decision_tree()

['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']
[[29.          1.          0.          0.          1.          0.        ]
 [ 2.          1.          0.          0.          1.          0.        ]
 [30.          1.          0.          0.          0.          1.        ]
 ...
 [31.19418104  0.          0.          1.          0.          1.        ]
 [31.19418104  0.          0.          1.          1.          0.        ]
 [31.19418104  0.          0.          1.          0.          1.        ]]
0.8022813688212928

Tags: encoding

Posted on Sun, 06 Oct 2019 14:22:40 -0700 by PcGeniusProductions