python data analysis & Office Automation: data preprocessing / data cleaning

Brief introduction
This chapter is a summary of data cleaning experience, involving the following functions: preview, exception value processing, data type conversion, string operation, row and column selection, rule judgment by defining functions, etc., which is still code + comment + summary.
Task objective: preprocess all kinds of data in the open income table to meet the analysis requirements.
Target disassembly: open the way and bridge the water until the data meets the available standards.

#Code display#
#Only show ideas and simplify the code

import os,sys
import numpy as np
import pandas as pd
import openpyxl
import csv
import xlwt

#Traverse the folder and output all file paths and names under the folder
def walk(path):
    if not os.path.exists(path):
        return -1
    for root,dirs,names in os.walk(path):
        for filename in names:
            if os.path.splitext(filename)[1] == '.csv':
                docs = os.path.join(root,filename)
                print(docs)
                csvlist.append(docs)
#Define function implementation rule judgment
def pro(a):
    if "Inner Mongolia" in a or "Shanxi" in a:
        return "Jin Meng Theater"
    elif "Beijing" in a or "Tianjin" in a or "Shandong" in a or "Jin Meng" in a or "Hebei" in a:
        return a
    else:
        return "Other"

csvlist = [] 
cur_path = os.getcwd()
walk(cur_path)

for doc in csvlist:
    df = pd.read_csv(doc,encoding = "gbk")
    df = df[df['Self employment contract'].isin([0])][df['External single internal B business'].isin([0])][["Month","Customer name","Primary Theater","Merchant channel name","Self employment contract","External single internal B business","Single volume","weight","income","cost"]]
    df.head(10)
    df.dtypes()
    df.info()
    df["Merchant channel name"] = df["Merchant channel name"].astype(str)
    df = df[~df["Merchant channel name"].str.contains('Cold chain|fresh|medicine')]
    df["Attribution Theater"] = df.apply(lambda df:pro(df.Primary Theater),axis=1)
    df[["income","cost"]].fillna(0)
    df.groupby([["Month","Primary Theater"]]).sum().reset_index()
    df["Single income"] = df[["income","Single volume"]].apply(lambda df:df["income"]/df["Single volume"],axis=1)
    df["Single kg income"] = df[["income","weight"]].apply(lambda df:df["income"]/df["weight"],axis=1)
    columns =["Month","Primary Theater","Single volume","weight","income","Single income","Single kg income"]
    df.to_csv("Aggregate data.csv",columns =columns,encoding = "gbk")

#Code disassembly#
Traversing the folder and reading the csv part are the same as the previous one, which will not be covered here.
Data preview

df.head(10)#Preview the first 10 rows of data
df.dtypes()#Show data types for each field
df.info()#Display information of each field, including number of digits, data type, etc
df.shape()#Show row and column shape of data file
#In addition, numpy supports reshape() to change the row column structure when creating arrays

Exception handling
Duplicate value de duplication, missing value filling, deviation value adjustment

Duplicate value de duplication:
df.drop_duplicates(subset=[], keep='first', inplace=False)
#The subset parameter is the selected column, the keep parameter determines the reserved value, and inplace determines whether to replace it

//Missing value handling:
df[].isnull()#Judge whether it is a missing value, and output a column of True or False Boolean
df.dropna()#Remove the missing value. At this time, the whole row in the data table is removed. Other functions can be realized by adjusting parameters
df.dropna(how="all",axis=1)#Remove all columns with empty values
df.fillna(0)#Fill missing values with 0

About the parameters of fillna() function, @ Denver_Liao's article is very detailed:

https://blog.csdn.net/weixin_39549734/article/details/81221276

The adjustment of deviation value involves deeper statistical knowledge, which I don't know very well and don't discuss;

Data type conversion
When operating on the value of a row or column, there are requirements for the data type of the column. For example, if a column has multiple data types, pandas will report an error when reading csv. At this time, you need to specify the data type of the read column.
There are two ways to convert data types: using library function conversion or using custom function conversion

  1. Through the astype function and pandas built-in function
df.dtypes()#View field data types
df[].astype(str)#Convert to string
df[].astype(float)#Convert to floating-point number (requires a full number string, cannot contain special characters)
df[].astype(int)#Convert to integer
pd.to_numeric(df["income"],errors='coerce').fillna(0)#Convert to numeric, unrecognized string
pd.to_datetime(df[['Month', 'Day', 'Year']])#Merge mm / DD / yyyy
  1. Through custom functions
Define function method:
def convert(a):
    b = a.replace(r'[^\w\s]+', '')
    return float(b)
   #Remove the special symbols in column a, and then convert them to float floating-point numbers. string library support is required

//Shortcut: UsingapplyFunctions, etc.
df["Single volume"] = df["Single volume"].str.replace(r'[^\w\s]+', '').astype("int")
df["Amount of money"].apply(lambda x: x.replace("¥","")).astype("float64")

About data type conversion, this blog is good: https://www.cnblogs.com/onemorepoint/p/9404753.html

This paper does not discuss the conversion of time and date data types, and will have the opportunity to add another one in the future

String operation
Split and merge strings, adjust and replace strings, etc
val.split('), for val press, split;
pieces = [x.strip() for x in val.split(',')]split and strip are used to clear spaces;
See the data type conversion replace section for string replacement

Select row

df = df[df['Self employment contract'].isin([0])][df['External single internal B business'].isin([0])][["Month","Customer name","income","cost"]]
df[].isin(["a","b")Function to select a specific column value that contains a perhaps b Row;
df = df[~df["Merchant channel name"].str.contains('Cold chain|fresh|medicine')]#Filter the row containing the keyword through the str.contains() function, or represented by a "|"
df = df[df["Self employment contract"] == "0"]#Judge by Boolean value
df.iloc[[0:5],[0:5]]Selection6×6Cells in the range, first row and then column, only: select all

User defined function rule judgment
It is a simple and effective way to show several examples:

list1 = ["Beijing Theater","Hebei Theater","Shandong Theater","Tianjin Theater","Jin Meng Theater","Other"]

def pro(a):
    if "Inner Mongolia" in a or "Shanxi" in a:
        return "Jin Meng Theater"
    elif a in a list1:
        return a
    else:
        return "Other"
#Bugs to avoid 0 / 0:
def div(a,b):
    if b == 0 or a == 0:
        return 0
    else:
        return a/b

User defined function implementation rule is a very effective means, and it is very simple and convenient to apply. It can be realized by applying & lambda function. For example, df ["home theater"] = df.apply(lambda df:pro(df. First level theater), axis=1). If the parameter is a column, you need to set the parameter by df. Column name

The above is this sharing. It's scattered and talked a lot. I hope you can avoid detours~

Published 3 original articles, praised 0, visited 63
Private letter follow

Tags: Lambda encoding

Posted on Thu, 16 Jan 2020 05:45:20 -0800 by rdawson