Exploratory data analysis of multiple factors with the method of basic statistics

Before the establishment of machine science model, we often do exploratory factor analysis on the characteristics we have. Exploratory factor analysis can be divided into single factor analysis and multi factor analysis. Single factor analysis is mainly used to analyze a certain feature. Statistical indicators (mean, median, mode, skewness coefficient and kurtosis coefficient, etc.) and graphic visualization analysis are often used in the analysis method. Multi factor analysis is mainly used to make joint analysis for two or more features. The analysis methods include test analysis (such as T-test analysis, variance analysis, chi square test Analysis), correlation analysis, principal component analysis, factor analysis and so on

1. Hypothesis test

!

## This is a method to test whether the variable presents normal distribution, based on Skewness and kurtosis.
import pandas as pd
import numpy as np
from scipy import stats
pts = 1000
np.random.seed(28041990)
a = np.random.normal(0, 1, size=pts) ##Generate a random number with a mean value of 0 and a standard deviation of 1
b = np.random.normal(2, 1, size=pts) ##Generate a random number with a mean value of 2 and a standard deviation of 1 
x = np.concatenate((a, b)) ##Merge the two arrays
k2, p = stats.normaltest(x) ##k2 is the value of the statistic, p is the value of p
alpha = 1e-3 ##threshold
print("p = {:g}".format(p))
p = 3.27207e-11
if p < alpha: # null hypothesis: x comes from a normal distribution
	print("The null hypothesis can be rejected")
else:
	print("The null hypothesis cannot be rejected")


1.1 t test

It is mainly used to check whether the distribution of two groups is consistent

import pandas as pd
import numpy as np
from scipy import stats as ss
ss.ttest_ind(ss.norm.rvs(size=10), ss.norm.rvs(size=20))
##out:Ttest_indResult(statistic=1.9250976356002707, pvalue=0.06443061130874687)
ss.ttest_ind(ss.norm.rvs(size=10), ss.norm.rvs(loc=1,scale=0.1,size=20))
## out:Ttest_indResult(statistic=-3.3034115592617534, pvalue=0.002617523871754732)

Chi square test

Chi square test, which is called four grid test, is mainly used to test whether there is a strong connection between the two factors, as follows: let's see whether there is a relationship between gender and makeup,
H0: there is no relationship between gender and makeup
H1: relationship between gender and make-up or not



Chi squared 129.3 is a chi squared value of 3.841 which is greater than the significance level of 0.05, so we should reject the original hypothesis and accept the alternative hypothesis, that is, there is a strong relationship between gender and whether men and women make-up or not.

import pandas as pd
import numpy as np
from scipy import stats
k2,p,_,_ss.chi2_contingency([[15, 95], [85, 5]], False
out:k2=129.29292929292927,p=5.8513140262808924e-30

1.3 variance test


SST: sum of total squares or sum of total variation squares
SSM: sum of squares between groups or average sum of squares
SSE: sum of squares in the group or sum of squares of residuals


F0: there is no difference in the average life of the three batteries
F1: there is no difference in the average life of the three batteries

p value is less than the significance level, and the original hypothesis is rejected, that is to say, the average life of the three batteries is different.

from scipy import stats as ss
ss.f_oneway([49, 50, 39,40,43], [28, 32, 30,26,34], [38,40,45,42,48])

1.4 qq diagram

from statsmodels.graphics.api import qqplot
from matplotlib import pyplot as plt
qqplot(ss.norm.rvs(size=100))#QQ diagram
plt.show()


Theoretical quantile value and sample distribution positive too quantile value on the diagonal

2 correlation coefficient

2.1 Pearson correlation coefficient

s = pd.Series([0.1, 0.2, 1.1, 2.4, 1.3, 0.3, 0.5])
df = pd.DataFrame([[0.1, 0.2, 1.1, 2.4, 1.3, 0.3, 0.5], [0.5, 0.4, 1.2, 2.5, 1.1, 0.7, 0.1]])
#correlation analysis
print(s.corr(pd.Series([0.5, 0.4, 1.2, 2.5, 1.1, 0.7, 0.1])))
print(df.corr())

2.2 Spearman correlation coefficient

import pandas as pd
df = pd.DataFrame([[0.1, 0.2, 1.1, 2.4, 1.3, 0.3, 0.5], [0.5, 0.4, 1.2, 2.5, 1.1, 0.7, 0.1]])
df.corr(method="spearman")

3 composite analysis

3.1 cross analysis

(1) The test method is mainly to use HR_data.csv data to observe whether there are differences in employee turnover rate between departments.

##To see whether there is difference in turnover rate between two departments, t-test is used.
import pandas as pd
import numpy as np
import scipy.stats as ss
import seaborn as sns
sns.set_context(context="poster",font_scale=1.2)
import matplotlib.pyplot as plt
df=pd.read_csv("./data/HR_data.csv")

dp_indices=df.groupby(by="department").indices
sales_values=df["left"].iloc[dp_indices["sales"]].values
technical_values=df["left"].iloc[dp_indices["technical"]].values
print(ss.ttest_ind(sales_values,technical_values))
dp_keys=list(dp_indices.keys())
dp_t_mat=np.zeros((len(dp_keys),len(dp_keys)))

for i in range(len(dp_keys)):
   for j in range(len(dp_keys)):
         p_value=ss.ttest_ind(df["left"].iloc[dp_indices[dp_keys[i]]].values,\
                                     df["left"].iloc[dp_indices[dp_keys[j]]].values)[1]
         if p_value<0.05:
             dp_t_mat[i][j]=-1 ## Reject the original assumption and consider that there is difference in turnover rate between the two departments
         else:
             dp_t_mat[i][j]=p_value  ##Accept the original assumption
sns.heatmap(dp_t_mat,xticklabels=dp_keys,yticklabels=dp_keys)
plt.show()


As shown in the figure above, the black box indicates that there are differences between the departures of the two departments.
(2) Pivot table method

piv_tb=pd.pivot_table(df, values="left", index=["department", "salary"], columns=["time_spend_company"],aggfunc=np.mean)
#piv_tb=pd.pivot_table(df, values="left", index=["department", "salary"], columns=["time_spend_company"],aggfunc=np.sum)
#piv_tb=pd.pivot_table(df, values="left", index=["department", "salary"], columns=["time_spend_company"],aggfunc=len)


3.2 group analysis

(1) Discrete value

sns.barplot(x="salary",y="left",hue="department",data=df)
plt.show()   #Group by department (legend) hue parameter, salary is x-axis

(2) Continuous values are grouped first and aggregated

according to
Turning point (2nd Street)
Clustering
Gene coefficient
Classify continuous values

sl_s=df["satisfaction_level"]
sns.barplot(range(len(sl_s)),sl_s.sort_values())

3.3 factor analysis


(1) Exploratory factor analysis
Through the covariance matrix, we can analyze the essential structure of multivariate variables, transform and reduce dimensions, and get the most important factors that affect the target attributes in space, such as principal component analysis.

import pandas as pd
import numpy as np
import scipy.stats as ss
import seaborn as sns
sns.set_context(context="poster",font_scale=1.2)
import matplotlib.pyplot as plt
import math
from sklearn.decomposition import PCA
df=pd.read_csv("./data/HR.csv")
#Correlation diagram
sns.heatmap(df.corr())
sns.heatmap(df.corr(), vmax=1, vmin=-1)
plt.show()
#PCA dimension reduction
my_pca=PCA(n_components=7)
lower_mat=my_pca.fit_transform(df.drop(labels=["salary","department","left"],axis=1).values)
print(my_pca.explained_variance_ratio_)
#sns.heatmap(pd.DataFrame(lower_mat).corr())
#plt.show()


The reduced dimension matrix makes all variables orthogonal, and the correlation coefficient is 1

(2) Confirmatory factor analysis
Test whether the relationship between a factor and the corresponding measure term is consistent with the theoretical relationship designed by the researcher.

Published 11 original articles, won praise 1, visited 1041
Private letter follow

Tags: less

Posted on Fri, 14 Feb 2020 06:30:32 -0800 by jpmoriarty