This chapter is a summary of data cleaning experience, involving the following functions: preview, exception value processing, data type conversion, string operation, row and column selection, rule judgment by defining functions, etc., which is still code + comment + summary.
Task objective: preprocess all kinds of data in the open income table to meet the analysis requirements.
Target disassembly: open the way and bridge the water until the data meets the available standards.
#Only show ideas and simplify the code
import os,sys import numpy as np import pandas as pd import openpyxl import csv import xlwt #Traverse the folder and output all file paths and names under the folder def walk(path): if not os.path.exists(path): return -1 for root,dirs,names in os.walk(path): for filename in names: if os.path.splitext(filename) == '.csv': docs = os.path.join(root,filename) print(docs) csvlist.append(docs) #Define function implementation rule judgment def pro(a): if "Inner Mongolia" in a or "Shanxi" in a: return "Jin Meng Theater" elif "Beijing" in a or "Tianjin" in a or "Shandong" in a or "Jin Meng" in a or "Hebei" in a: return a else: return "Other" csvlist =  cur_path = os.getcwd() walk(cur_path) for doc in csvlist: df = pd.read_csv(doc,encoding = "gbk") df = df[df['Self employment contract'].isin()][df['External single internal B business'].isin()][["Month","Customer name","Primary Theater","Merchant channel name","Self employment contract","External single internal B business","Single volume","weight","income","cost"]] df.head(10) df.dtypes() df.info() df["Merchant channel name"] = df["Merchant channel name"].astype(str) df = df[~df["Merchant channel name"].str.contains('Cold chain|fresh|medicine')] df["Attribution Theater"] = df.apply(lambda df:pro(df.Primary Theater),axis=1) df[["income","cost"]].fillna(0) df.groupby([["Month","Primary Theater"]]).sum().reset_index() df["Single income"] = df[["income","Single volume"]].apply(lambda df:df["income"]/df["Single volume"],axis=1) df["Single kg income"] = df[["income","weight"]].apply(lambda df:df["income"]/df["weight"],axis=1) columns =["Month","Primary Theater","Single volume","weight","income","Single income","Single kg income"] df.to_csv("Aggregate data.csv",columns =columns,encoding = "gbk")
Traversing the folder and reading the csv part are the same as the previous one, which will not be covered here.
df.head(10)#Preview the first 10 rows of data df.dtypes()#Show data types for each field df.info()#Display information of each field, including number of digits, data type, etc df.shape()#Show row and column shape of data file #In addition, numpy supports reshape() to change the row column structure when creating arrays
Duplicate value de duplication, missing value filling, deviation value adjustment
Duplicate value de duplication: df.drop_duplicates(subset=, keep='first', inplace=False) #The subset parameter is the selected column, the keep parameter determines the reserved value, and inplace determines whether to replace it //Missing value handling: df.isnull()#Judge whether it is a missing value, and output a column of True or False Boolean df.dropna()#Remove the missing value. At this time, the whole row in the data table is removed. Other functions can be realized by adjusting parameters df.dropna(how="all",axis=1)#Remove all columns with empty values df.fillna(0)#Fill missing values with 0
About the parameters of fillna() function, @ Denver_Liao's article is very detailed:
The adjustment of deviation value involves deeper statistical knowledge, which I don't know very well and don't discuss;
Data type conversion
When operating on the value of a row or column, there are requirements for the data type of the column. For example, if a column has multiple data types, pandas will report an error when reading csv. At this time, you need to specify the data type of the read column.
There are two ways to convert data types: using library function conversion or using custom function conversion
- Through the astype function and pandas built-in function
df.dtypes()#View field data types df.astype(str)#Convert to string df.astype(float)#Convert to floating-point number (requires a full number string, cannot contain special characters) df.astype(int)#Convert to integer pd.to_numeric(df["income"],errors='coerce').fillna(0)#Convert to numeric, unrecognized string pd.to_datetime(df[['Month', 'Day', 'Year']])#Merge mm / DD / yyyy
- Through custom functions
Define function method: def convert(a): b = a.replace(r'[^\w\s]+', '') return float(b) #Remove the special symbols in column a, and then convert them to float floating-point numbers. string library support is required //Shortcut: UsingapplyFunctions, etc. df["Single volume"] = df["Single volume"].str.replace(r'[^\w\s]+', '').astype("int") df["Amount of money"].apply(lambda x: x.replace("￥","")).astype("float64")
About data type conversion, this blog is good: https://www.cnblogs.com/onemorepoint/p/9404753.html
This paper does not discuss the conversion of time and date data types, and will have the opportunity to add another one in the future
Split and merge strings, adjust and replace strings, etc
val.split('), for val press, split;
pieces = [x.strip() for x in val.split(',')]split and strip are used to clear spaces;
See the data type conversion replace section for string replacement
df = df[df['Self employment contract'].isin()][df['External single internal B business'].isin()][["Month","Customer name","income","cost"]] df.isin(["a","b")Function to select a specific column value that contains a perhaps b Row; df = df[~df["Merchant channel name"].str.contains('Cold chain|fresh|medicine')]#Filter the row containing the keyword through the str.contains() function, or represented by a "|" df = df[df["Self employment contract"] == "0"]#Judge by Boolean value df.iloc[[0:5],[0:5]]Selection6×6Cells in the range, first row and then column, only: select all
User defined function rule judgment
It is a simple and effective way to show several examples:
list1 = ["Beijing Theater","Hebei Theater","Shandong Theater","Tianjin Theater","Jin Meng Theater","Other"] def pro(a): if "Inner Mongolia" in a or "Shanxi" in a: return "Jin Meng Theater" elif a in a list1: return a else: return "Other"
#Bugs to avoid 0 / 0: def div(a,b): if b == 0 or a == 0: return 0 else: return a/b
User defined function implementation rule is a very effective means, and it is very simple and convenient to apply. It can be realized by applying & lambda function. For example, df ["home theater"] = df.apply(lambda df:pro(df. First level theater), axis=1). If the parameter is a column, you need to set the parameter by df. Column name
The above is this sharing. It's scattered and talked a lot. I hope you can avoid detours~