Before the establishment of machine science model, we often do exploratory factor analysis on the characteristics we have. Exploratory factor analysis can be divided into single factor analysis and multi factor analysis. Single factor analysis is mainly used to analyze a certain feature. Statistical indicators (mean, median, mode, skewness coefficient and kurtosis coefficient, etc.) and graphic visualization analysis are often used in the analysis method. Multi factor analysis is mainly used to make joint analysis for two or more features. The analysis methods include test analysis (such as T-test analysis, variance analysis, chi square test Analysis), correlation analysis, principal component analysis, factor analysis and so on

## 1. Hypothesis test

!

## This is a method to test whether the variable presents normal distribution, based on Skewness and kurtosis. import pandas as pd import numpy as np from scipy import stats pts = 1000 np.random.seed(28041990) a = np.random.normal(0, 1, size=pts) ##Generate a random number with a mean value of 0 and a standard deviation of 1 b = np.random.normal(2, 1, size=pts) ##Generate a random number with a mean value of 2 and a standard deviation of 1 x = np.concatenate((a, b)) ##Merge the two arrays k2, p = stats.normaltest(x) ##k2 is the value of the statistic, p is the value of p alpha = 1e-3 ##threshold print("p = {:g}".format(p)) p = 3.27207e-11 if p < alpha: # null hypothesis: x comes from a normal distribution print("The null hypothesis can be rejected") else: print("The null hypothesis cannot be rejected")

### 1.1 t test

It is mainly used to check whether the distribution of two groups is consistent

import pandas as pd import numpy as np from scipy import stats as ss ss.ttest_ind(ss.norm.rvs(size=10), ss.norm.rvs(size=20)) ##out:Ttest_indResult(statistic=1.9250976356002707, pvalue=0.06443061130874687) ss.ttest_ind(ss.norm.rvs(size=10), ss.norm.rvs(loc=1,scale=0.1,size=20)) ## out:Ttest_indResult(statistic=-3.3034115592617534, pvalue=0.002617523871754732)

### Chi square test

Chi square test, which is called four grid test, is mainly used to test whether there is a strong connection between the two factors, as follows: let's see whether there is a relationship between gender and makeup,

H0: there is no relationship between gender and makeup

H1: relationship between gender and make-up or not

Chi squared 129.3 is a chi squared value of 3.841 which is greater than the significance level of 0.05, so we should reject the original hypothesis and accept the alternative hypothesis, that is, there is a strong relationship between gender and whether men and women make-up or not.

import pandas as pd import numpy as np from scipy import stats k2,p,_,_ss.chi2_contingency([[15, 95], [85, 5]], False out:k2=129.29292929292927,p=5.8513140262808924e-30

### 1.3 variance test

SST: sum of total squares or sum of total variation squares

SSM: sum of squares between groups or average sum of squares

SSE: sum of squares in the group or sum of squares of residuals

F0: there is no difference in the average life of the three batteries

F1: there is no difference in the average life of the three batteries

p value is less than the significance level, and the original hypothesis is rejected, that is to say, the average life of the three batteries is different.

from scipy import stats as ss ss.f_oneway([49, 50, 39,40,43], [28, 32, 30,26,34], [38,40,45,42,48])

### 1.4 qq diagram

from statsmodels.graphics.api import qqplot from matplotlib import pyplot as plt qqplot(ss.norm.rvs(size=100))#QQ diagram plt.show()

Theoretical quantile value and sample distribution positive too quantile value on the diagonal

## 2 correlation coefficient

### 2.1 Pearson correlation coefficient

s = pd.Series([0.1, 0.2, 1.1, 2.4, 1.3, 0.3, 0.5]) df = pd.DataFrame([[0.1, 0.2, 1.1, 2.4, 1.3, 0.3, 0.5], [0.5, 0.4, 1.2, 2.5, 1.1, 0.7, 0.1]]) #correlation analysis print(s.corr(pd.Series([0.5, 0.4, 1.2, 2.5, 1.1, 0.7, 0.1]))) print(df.corr())

### 2.2 Spearman correlation coefficient

import pandas as pd df = pd.DataFrame([[0.1, 0.2, 1.1, 2.4, 1.3, 0.3, 0.5], [0.5, 0.4, 1.2, 2.5, 1.1, 0.7, 0.1]]) df.corr(method="spearman")

## 3 composite analysis

### 3.1 cross analysis

(1) The test method is mainly to use HR_data.csv data to observe whether there are differences in employee turnover rate between departments.

##To see whether there is difference in turnover rate between two departments, t-test is used. import pandas as pd import numpy as np import scipy.stats as ss import seaborn as sns sns.set_context(context="poster",font_scale=1.2) import matplotlib.pyplot as plt df=pd.read_csv("./data/HR_data.csv")

dp_indices=df.groupby(by="department").indices sales_values=df["left"].iloc[dp_indices["sales"]].values technical_values=df["left"].iloc[dp_indices["technical"]].values print(ss.ttest_ind(sales_values,technical_values)) dp_keys=list(dp_indices.keys()) dp_t_mat=np.zeros((len(dp_keys),len(dp_keys))) for i in range(len(dp_keys)): for j in range(len(dp_keys)): p_value=ss.ttest_ind(df["left"].iloc[dp_indices[dp_keys[i]]].values,\ df["left"].iloc[dp_indices[dp_keys[j]]].values)[1] if p_value<0.05: dp_t_mat[i][j]=-1 ## Reject the original assumption and consider that there is difference in turnover rate between the two departments else: dp_t_mat[i][j]=p_value ##Accept the original assumption sns.heatmap(dp_t_mat,xticklabels=dp_keys,yticklabels=dp_keys) plt.show()

As shown in the figure above, the black box indicates that there are differences between the departures of the two departments.

(2) Pivot table method

piv_tb=pd.pivot_table(df, values="left", index=["department", "salary"], columns=["time_spend_company"],aggfunc=np.mean) #piv_tb=pd.pivot_table(df, values="left", index=["department", "salary"], columns=["time_spend_company"],aggfunc=np.sum) #piv_tb=pd.pivot_table(df, values="left", index=["department", "salary"], columns=["time_spend_company"],aggfunc=len)

### 3.2 group analysis

(1) Discrete value

sns.barplot(x="salary",y="left",hue="department",data=df) plt.show() #Group by department (legend) hue parameter, salary is x-axis

(2) Continuous values are grouped first and aggregated

according to

Turning point (2nd Street)

Clustering

Gene coefficient

Classify continuous values

sl_s=df["satisfaction_level"] sns.barplot(range(len(sl_s)),sl_s.sort_values())

### 3.3 factor analysis

(1) Exploratory factor analysis

Through the covariance matrix, we can analyze the essential structure of multivariate variables, transform and reduce dimensions, and get the most important factors that affect the target attributes in space, such as principal component analysis.

import pandas as pd import numpy as np import scipy.stats as ss import seaborn as sns sns.set_context(context="poster",font_scale=1.2) import matplotlib.pyplot as plt import math from sklearn.decomposition import PCA df=pd.read_csv("./data/HR.csv") #Correlation diagram sns.heatmap(df.corr()) sns.heatmap(df.corr(), vmax=1, vmin=-1) plt.show() #PCA dimension reduction my_pca=PCA(n_components=7) lower_mat=my_pca.fit_transform(df.drop(labels=["salary","department","left"],axis=1).values) print(my_pca.explained_variance_ratio_) #sns.heatmap(pd.DataFrame(lower_mat).corr()) #plt.show()

The reduced dimension matrix makes all variables orthogonal, and the correlation coefficient is 1

(2) Confirmatory factor analysis

Test whether the relationship between a factor and the corresponding measure term is consistent with the theoretical relationship designed by the researcher.