Machine learning feature Engineering

Python machine learning

3-day quick start python machine learning in 2018 [dark horse programmer]

(2) Characteristic Engineering

1. Dictionary feature extraction

from sklearn.feature_extraction import DictVectorizer


def dict_demo():
    '''
    //Dictionary feature extraction
    :return:
    '''
    data = [{'city': 'Beijing', 'temperature': 100},
            {'city': 'Shanghai', 'temperature': 60},
            {'city': 'Shenzhen', 'temperature': 30}]
    # 1. Instantiate a converter class
    transfer = DictVectorizer()  
    # 2. call
    data_new = transfer.fit_transform(data)
    print('data_new:\n', data_new)
    print('Feature name: \n', transfer.get_feature_names())

    return None


if __name__ == '__main__':
    dict_demo()

data_new:
   (0, 1)	1.0
  (0, 3)	100.0
  (1, 0)	1.0
  (1, 3)	60.0
  (2, 2)	1.0
  (2, 3)	30.0
//Feature Name: 
 ['city=Shanghai', 'city=Beijing', 'city=Shenzhen', 'temperature']
 
Process finished with exit code 0

The type of output here is sparse matrix type

  • Sparse matrix: sparse matrix, indicating the coordinates of numbers that are not zero
  • You can add spark = false to the parameter to get a clearer result (usually not)

Let's show the output effect of non sparse matrix

transfer = DictVectorizer(sparse=False)  # Sparse defaults to True, returns a sparse matrix, indicating only the positions (coordinates) with values
data_new:
 [[  0.   1.   0. 100.]
 [  1.   0.   0.  60.]
 [  0.   0.   1.  30.]]
//Feature Name: 
 ['city=Shanghai', 'city=Beijing', 'city=Shenzhen', 'temperature']

2. Text feature extraction

General process:

  • 1. Instantiate a converter class
  • 2. Call fit? Transform

2.1 CountVectorizer

2.1.1 English text. Distinguish words by spaces

from sklearn.feature_extraction.text import CountVectorizer

def count_demo():
    '''
    //Text feature extraction: CountVectorizer
    :return:
    '''
    data = ["life is short,i like like python", "life is too long,i dislike python"]
    # Instantiate a converter class
    transfer = CountVectorizer()

    # 2. Call fit? Transform
    data_new = transfer.fit_transform(data)
    print('data_new:\n', data_new.toarray())
    print('Feature name:\n', transfer.get_feature_names())
    return None


if __name__ == '__main__':
    count_demo()

data_new:
 [[0 1 1 2 0 1 1 0]
 [1 1 1 0 1 1 0 1]]
//Feature Name:
 ['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']

Process finished with exit code 0

2.1.2 Chinese text judgment

from sklearn.feature_extraction.text import CountVectorizer

def count_chinese_demo():
    '''
    //Chinese text feature extraction: CountVectorizer
    :return:
    '''
    data = ["I love Tiananmen, Beijing", 'The sun rises on Tian'anmen Gate']
    # Instantiate a converter class
    transfer = CountVectorizer()

    # 2. Call fit? Transform
    data_new = transfer.fit_transform(data)
    print('data_new:\n', data_new.toarray())
    print('Feature name:\n', transfer.get_feature_names())
    return None

if __name__ == '__main__':
    count_chinese_demo()

data_new:
 [[1 1 0 1]
 [0 1 1 0]]
//Feature Name:
 ['Beijing', 'Tiananmen', 'sunlight', 'I love']

Process finished with exit code 0

But this is what we can distinguish by adding spaces manually. Is there any way to automatically segment Chinese words? Next, use the jieba Library (need to be installed)

2.1.3 automatic segmentation of Chinese text

from sklearn.feature_extraction.text import CountVectorizer
import jieba


def cut_word(text):
    '''
    //Use Chinese participle 'I love Beijing Tian'anmen' - > 'I love Beijing Tian'anmen'
    :param text:
    :return:
    '''
    a = jieba.cut(text)  # Return to a generator
    a = ' '.join(list(a))
    return a


def count_chinese_demo2():
    '''
    //Chinese text feature extraction, automatic segmentation
    :return:
    '''
    data = ["One is that today is cruel, tomorrow is crueler, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today.",
            "We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past.",
            "If you only know something in one way, you won't really know it. The secret of knowing the true meaning of things depends on how we relate them to what we know."]
    # 1. Chinese text is segmented and put into data_new
    # data_new = []
    # for sent in data:
    #     data_new.append(cut_word(sent))
    data_new = [cut_word(sent) for sent in data]
    # 2. Instantiate a converter class
    # Stop? Words can be used to specify which words are disabled and which are not characteristic words
    transfer = CountVectorizer(stop_words=['one kind', 'therefore'])

    # 3. call fit transform
    data_final = transfer.fit_transform(data_new)
    print('data_new:\n', data_final.toarray())
    print('Feature name:\n', transfer.get_feature_names())
    return None


if __name__ == '__main__':
    count_chinese_demo2()

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\14360\AppData\Local\Temp\jieba.cache
Loading model cost 0.668 seconds.
Prefix dict has been built successfully.
data_new:
 [[0 1 0 0 0 2 0 0 0 0 0 1 0 1 0 0 0 0 1 0 2 0 1 0 2 1 0 0 0 1 1 0 0 1 0]
 [0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 3 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0 1]
 [1 0 0 4 3 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 1 0 0 0 2 1 0 0 1 0 0 0]]
//Feature Name:
 ['Can't', 'Do not', 'before', 'understand', 'Thing', 'Today', 'Just in', 'Millions of years', 'Issue', 'Depending on', 'only need', 'The day after tomorrow', 'Meaning', 'Gross', 'How', 'If', 'universe', 'We', 'give up', 'mode', 'Tomorrow', 'Galaxy', 'Night', 'Certain sample', 'cruel', 'each', 'notice', 'real', 'Secret', 'Absolutely', 'fine', 'contact', 'Past times', 'still', 'such']

Process finished with exit code 0

2.2 TfidfVectorizer


from sklearn.feature_extraction.text import TfidfVectorizer
import jieba


def cut_word(text):
    '''
    //Use Chinese participle 'I love Beijing Tian'anmen' - > 'I love Beijing Tian'anmen'
    :param text:
    :return:
    '''
    a = jieba.cut(text)  # Return to a generator
    a = ' '.join(list(a))
    return a


def tfidf_demo():
    '''
    //Text feature extraction with TF-IDF
    :return:
    '''
    data = ["One is that today is cruel, tomorrow is crueler, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today.",
            "We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past.",
            "If you only know something in one way, you won't really know it. The secret of knowing the true meaning of things depends on how we relate them to what we know."]
    # 1. Chinese text is segmented and put into data_new
    data_new = [cut_word(sent) for sent in data]
    # 2. Instantiate a converter class
    # Stop? Words can be used to specify which words are disabled and which are not characteristic words
    transfer = TfidfVectorizer(stop_words=['one kind', 'therefore'])

    # 3. Call fit? Transform
    data_final = transfer.fit_transform(data_new)
    print('data_new:\n', data_final.toarray())
    print('Feature name:\n', transfer.get_feature_names())
    return None


if __name__ == '__main__':
    tfidf_demo()

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\14360\AppData\Local\Temp\jieba.cache
Loading model cost 0.721 seconds.
Prefix dict has been built successfully.
data_new:
 [[0.         0.21821789 0.         0.         0.         0.43643578
  0.         0.         0.         0.         0.         0.21821789
  0.         0.21821789 0.         0.         0.         0.
  0.21821789 0.         0.43643578 0.         0.21821789 0.
  0.43643578 0.21821789 0.         0.         0.         0.21821789
  0.21821789 0.         0.         0.21821789 0.        ]
 [0.         0.         0.2410822  0.         0.         0.
  0.2410822  0.2410822  0.2410822  0.         0.         0.
  0.         0.         0.         0.         0.2410822  0.55004769
  0.         0.         0.         0.2410822  0.         0.
  0.         0.         0.48216441 0.         0.         0.
  0.         0.         0.2410822  0.         0.2410822 ]
 [0.15895379 0.         0.         0.63581516 0.47686137 0.
  0.         0.         0.         0.15895379 0.15895379 0.
  0.15895379 0.         0.15895379 0.15895379 0.         0.12088845
  0.         0.15895379 0.         0.         0.         0.15895379
  0.         0.         0.         0.31790758 0.15895379 0.
  0.         0.15895379 0.         0.         0.        ]]
//Feature Name:
 ['Can't', 'Do not', 'before', 'understand', 'Thing', 'Today', 'Just in', 'Millions of years', 'Issue', 'Depending on', 'only need', 'The day after tomorrow', 'Meaning', 'Gross', 'How', 'If', 'universe', 'We', 'give up', 'mode', 'Tomorrow', 'Galaxy', 'Night', 'Certain sample', 'cruel', 'each', 'notice', 'real', 'Secret', 'Absolutely', 'fine', 'contact', 'Past times', 'still', 'such']

Process finished with exit code 0

3. Feature preprocessing

3.1 normalization


Code:

from sklearn.preprocessing import MinMaxScaler
import pandas as pd
def minmax_demo():
    '''
    //Normalization: poor stability, such as when the maximum value and the minimum value are missing
    :return:
    '''
    data = pd.read_csv('dating.txt')
    data = data.iloc[:, :3]  # Take the first three columns, because the last column is unnecessary data
    # print('data:\n', data)

    # Instantiate a converter class feature range setting range, default is 0,1
    # Calculation formula x - min / max - min min,max is the maximum value of the column
    transfer = MinMaxScaler(feature_range=[2, 3])
    data_new = transfer.fit_transform(data)
    print('data_new:\n', data_new)
    return None


if __name__ == '__main__':
    minmax_demo()

data_new:
 [[2.44832535 2.39805139 2.56233353]
 [2.15873259 2.34195467 2.98724416]
 [2.28542943 2.06892523 2.47449629]
 ...
 [2.29115949 2.50910294 2.51079493]
 [2.52711097 2.43665451 2.4290048 ]
 [2.47940793 2.3768091  2.78571804]]

Process finished with exit code 0

3.2 standardization

from sklearn.preprocessing import StandardScaler
import pandas as pd


def stand_demo():
    '''
    //Standardization: it is more accurate, with a small number of abnormal points having little impact on x - mean / standard deviation (std)
    :return:
    '''
    data = pd.read_csv('dating.txt')
    data = data.iloc[:, :3]

    # Instantiate a converter class
    transfer = StandardScaler()

    data_new = transfer.fit_transform(data)
    print('data_new:\n', data_new)
    return None


if __name__ == '__main__':
    stand_demo()

data_new:
 [[ 0.33193158  0.41660188  0.24523407]
 [-0.87247784  0.13992897  1.69385734]
 [-0.34554872 -1.20667094 -0.05422437]
 ...
 [-0.32171752  0.96431572  0.06952649]
 [ 0.65959911  0.60699509 -0.20931587]
 [ 0.46120328  0.31183342  1.00680598]]

Process finished with exit code 0

4. Feature dimension reduction

4.1 filter variance characteristics

Calculation formula of correlation coefficient:


The first value returned is the correlation coefficient through pearsonr calculation in scipy library

from sklearn.feature_selection import VarianceThreshold
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr


# Feature dimensionality reduction: objects: 2D arrays
# The process of reducing the number of features to obtain a set of uncorrelated main variables
def variance_demo():
    '''
    //Filter variance feature
    :return:
    '''
    # 1. Access to data
    data = pd.read_csv('factor_returns.csv')
    print('data:\n', data)
    data = data.iloc[:, 1:-2]  # Select the part you want
    # 2. Instantiate a converter class
    transfer = VarianceThreshold()
    # transfer = VarianceThreshold(threshold=10)  # Set the threshold value to filter the characteristics with variance less than 10. The default value is 0
    # 3. Call fit? Transform
    data_new = transfer.fit_transform(data)
    print('data_new:\n', data_new, data_new.shape)
    # Calculate the correlation coefficient between two variables
    r1 = pearsonr(data['pe_ratio'], data['pb_ratio'])
    print('correlation coefficient:\n', r1)  # The first value returned is the correlation value
    r2 = pearsonr(data['revenue'], data['total_expense'])
    print('correlation coefficient:\n', r2)

    # Draw a picture
    plt.figure(figsize=(20, 8), dpi=100)  # Set size
    plt.scatter(data['revenue'], data['total_expense'])  # Draw scatter chart, set x,y coordinates
    plt.show()  # Exhibition

    return None


if __name__ == '__main__':
    variance_demo()

data:
             index  pe_ratio  pb_ratio  ...  total_expense        date    return
0     000001.XSHE    5.9572    1.1818  ...   1.088254e+10  2012-01-31  0.027657
1     000002.XSHE    7.0289    1.5880  ...   2.378348e+10  2012-01-31  0.082352
2     000008.XSHE -262.7461    7.0003  ...   1.203008e+07  2012-01-31  0.099789
3     000060.XSHE   16.4760    3.7146  ...   7.935543e+09  2012-01-31  0.121595
4     000069.XSHE   12.5878    2.5616  ...   7.091398e+09  2012-01-31 -0.002681
...           ...       ...       ...  ...            ...         ...       ...
2313  601888.XSHG   25.0848    4.2323  ...   1.041419e+10  2012-11-30  0.060727
2314  601901.XSHG   59.4849    1.6392  ...   1.089783e+09  2012-11-30  0.179148
2315  601933.XSHG   39.5523    4.0052  ...   1.749295e+10  2012-11-30  0.137134
2316  601958.XSHG   52.5408    2.4646  ...   6.009007e+09  2012-11-30  0.149167
2317  601989.XSHG   14.2203    1.4103  ...   4.132842e+10  2012-11-30  0.183629

[2318 rows x 12 columns]
data_new:
 [[ 5.95720000e+00  1.18180000e+00  8.52525509e+10 ...  2.01000000e+00
   2.07014010e+10  1.08825400e+10]
 [ 7.02890000e+00  1.58800000e+00  8.41133582e+10 ...  3.26000000e-01
   2.93083692e+10  2.37834769e+10]
 [-2.62746100e+02  7.00030000e+00  5.17045520e+08 ... -6.00000000e-03
   1.16798290e+07  1.20300800e+07]
 ...
 [ 3.95523000e+01  4.00520000e+00  1.70243430e+10 ...  2.20000000e-01
   1.78908166e+10  1.74929478e+10]
 [ 5.25408000e+01  2.46460000e+00  3.28790988e+10 ...  1.21000000e-01
   6.46539204e+09  6.00900728e+09]
 [ 1.42203000e+01  1.41030000e+00  5.91108572e+10 ...  2.47000000e-01
   4.50987171e+10  4.13284212e+10]] (2318, 9)
//Correlation coefficient:
 (-0.004389322779936271, 0.8327205496564927)
//Correlation coefficient:
 (0.9958450413136115, 0.0)

4.2 principal component analysis



from sklearn.decomposition import PCA


def pca_demo():
    '''
    PCA Dimension reduction
    :return:
    '''
    data = [[2, 8, 4, 5], [6, 3, 0, 8], [5, 4, 9, 1]]
    # 1. Instantiate a converter class
    # n_components:
    #       The decimal represents the percentage reserved
    #       How many features (dimensions) are integers reduced to
    transfer = PCA(n_components=2)
    # 2. Call fit? Transform
    data_new = transfer.fit_transform(data)
    print('data_new:\n', data_new)
    return None

if __name__ == '__main__':
    pca_demo()

data_new:
 [[ 1.28620952e-15  3.82970843e+00]
 [ 5.74456265e+00 -1.91485422e+00]
 [-5.74456265e+00 -1.91485422e+00]]

Process finished with exit code 0

Published 81 original articles, won praise 4, visited 4519
Private letter follow

Tags: Python Spark less

Posted on Mon, 03 Feb 2020 09:03:03 -0800 by MitchEvans