Python 3 Quick Start - Pandas Data Processing

Python 3 Quick Start (15) - Pandas Data Processing

1. Function Application

1. Introduction to Function Application

There are three ways to use a custom function or other library function on a Pandas object.pipe() uses functions for tables, apply() for rows or columns, and applymap() for elements.

2. Application of Table Functions

Custom actions can be performed by using the function object and parameters as parameters to the pipe function, which operates on the entire DataFrame.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

def adder(x, y):
    return x + y

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(5, 3),columns=['col1', 'col2', 'col3'])
    print(df)
    df = df.pipe(adder, 1)
    print(df)

# output:
#        col1      col2      col3
# 0  0.390803  0.940306 -1.300635
# 1 -0.349588 -1.290132  0.415693
# 2 -0.079585 -0.083825  0.262867
# 3  0.582377  0.171701 -1.011748
# 4 -0.466655  1.746269  1.281538
#        col1      col2      col3
# 0  1.390803  1.940306 -0.300635
# 1  0.650412 -0.290132  1.415693
# 2  0.920415  0.916175  1.262867
# 3  1.582377  1.171701 -0.011748
# 4  0.533345  2.746269  2.281538

3. Application of Row and Column Functions

Use the apply() function to execute the application function along the axis of the DataFrame or Panel with optional axis parameters.By default, operations are performed by column.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

def adder(x, y):
    return x + y

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3'])
    print(df)
    # Execute by column
    result = df.apply(np.sum)
    print(result)
    # Execute by line
    result = df.apply(np.sum, axis=1)
    print(result)

# output:
#        col1      col2      col3
# 0 -1.773775 -0.608478  0.602059
# 1 -0.208412  0.969435 -0.292108
# 2  0.776864 -0.768559 -0.389092
# 3 -2.088412  1.133090  1.006486
# 4  0.693241  1.808845  0.772191
# col1   -2.600494
# col2    2.534332
# col3    1.699536
# dtype: float64
# 0   -1.780194
# 1    0.468915
# 2   -0.380788
# 3    0.051164
# 4    3.274277
# dtype: float64

4. Application of Element Functions

The applymap() function in the DataFrame can accept any Python function and return a single value.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3'])
    print(df)
    df = df.applymap(lambda x: x + 1)
    print(df)

# output:
#        col1      col2      col3
# 0  2.396185 -0.263581 -0.090799
# 1  1.718716  0.876074 -1.067746
# 2 -1.033945 -0.078448  1.036566
# 3  0.553849  0.251312 -0.422640
# 4 -0.896062  1.605349 -0.089430
#        col1      col2      col3
# 0  3.396185  0.736419  0.909201
# 1  2.718716  1.876074 -0.067746
# 2 -0.033945  0.921552  2.036566
# 3  1.553849  1.251312  0.577360
# 4  0.103938  2.605349  0.910570

2. Data cleaning

1. Introduction to Data Cleaning

Data cleaning is a complex and tedious work, and it is also the most important link in the data analysis process.The purpose of data cleaning is to make the data available through cleaning and to make the data more suitable for data analysis.Therefore, clean data should be cleaned as well as dirty data.In the actual data analysis, data cleaning will take up about 70% of the project time.

2. Treatment of missing values

See how many missing values each column has.
df.isnull().sum()
See how much complete data each column has
df.shape[0]-df.isnull().sum()

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20190101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC'))
    print(df)
    print(df.isnull().sum())
    print(df.shape[0] - df.isnull().sum())

# output:
#                    A         B         C
# 2019-01-01  1.138325  0.981597  1.359580
# 2019-01-02 -1.622074  0.812393 -0.946351
# 2019-01-03  0.049815  1.194241  0.807209
# 2019-01-04  1.500074 -0.570367 -0.328529
# 2019-01-05  0.465869  1.049651 -0.112453
# 2019-01-06 -1.399495  0.492769  1.961198
# A    0
# B    0
# C    0
# dtype: int64
# A    6
# B    6
# C    6
# dtype: int64

Delete Columns

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20190101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    print(df)
    del df['D']
    # Delete Column 2
    df.drop(df.columns[2], axis=1, inplace=True)
    # Delete Column B
    df.drop('B', axis=1, inplace=True)
    print(df)

# output:
#                    A         B         C
# 2019-01-01 -0.703151  0.753482 -0.624376
# 2019-01-02 -0.396221 -0.832279 -1.419897
# 2019-01-03 -0.179341 -0.368501 -0.300810
# 2019-01-04  0.464156  0.117461  1.502114
# 2019-01-05 -1.022012 -1.612456  1.611377
# 2019-01-06 -0.677521  0.001020 -0.342290
#                    A
# 2019-01-01 -0.703151
# 2019-01-02 -0.396221
# 2019-01-03 -0.179341
# 2019-01-04  0.464156
# 2019-01-05 -1.022012
# 2019-01-06 -0.677521

Delete NaN Value

df.dropna(self, axis=0, how='any', thresh=None, subset=None,
           inplace=False)

Axis is an axis, 0 means to operate on rows, and 1 means to operate on columns.
how is the operation type,'any'means that whenever a row or column with NaN is deleted,'all' means that the entire row or column with NaN is deleted.
Threshold of Threshold:NaN, deleted when Threshold is reached.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20190101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    df.iloc[1, 3] = None
    df.iloc[2, 2] = None
    print(df)
    print(df.dropna(axis=1))
    print(df.dropna(how='any'))

# output:
#                    A         B         C         D
# 2019-01-01 -0.152239 -2.315100 -0.504998 -0.987549
# 2019-01-02 -1.884801  1.046506 -1.618871       NaN
# 2019-01-03  0.976682 -1.043107       NaN  0.391338
# 2019-01-04  0.143389  0.951518  0.040632 -0.443944
# 2019-01-05  3.092766  0.787921 -2.408260 -1.111238
# 2019-01-06 -0.179249  0.573734 -0.912023  0.261517
#                    A         B
# 2019-01-01 -0.152239 -2.315100
# 2019-01-02 -1.884801  1.046506
# 2019-01-03  0.976682 -1.043107
# 2019-01-04  0.143389  0.951518
# 2019-01-05  3.092766  0.787921
# 2019-01-06 -0.179249  0.573734
#                    A         B         C         D
# 2019-01-01 -0.152239 -2.315100 -0.504998 -0.987549
# 2019-01-04  0.143389  0.951518  0.040632 -0.443944
# 2019-01-05  3.092766  0.787921 -2.408260 -1.111238
# 2019-01-06 -0.179249  0.573734 -0.912023  0.261517

Fill in NaN values

df.fillna(self, value=None, method=None, axis=None, inplace=False,limit=None, downcast=None, **kwargs)

Value: The populated value can be a dictionary with the key of the dictionary as the column name.
inplace: Indicates whether the source data has been modified, defaulting to False.
Filna returns new objects by default, but it can also modify existing objects in place.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20190101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    df.iloc[1, 3] = None
    df.iloc[2, 2] = None
    print(df)
    print(df.fillna({'C': 3.14, 'D': 0.0}))
    # Fill with specified values
    df.fillna(value=3.14, inplace=True)
    print(df)

# output:
#                    A         B         C         D
# 2019-01-01  0.490727 -0.603079  0.202922  2.012060
# 2019-01-02 -0.855106  0.305557  0.851141       NaN
# 2019-01-03 -0.324215  0.629637       NaN -0.174930
# 2019-01-04  0.085996  0.173265  0.416938 -0.903989
# 2019-01-05  0.009368  0.410056 -1.297822 -2.202893
# 2019-01-06  0.021892 -0.359749 -0.608556 -0.859454
#                    A         B         C         D
# 2019-01-01  0.490727 -0.603079  0.202922  2.012060
# 2019-01-02 -0.855106  0.305557  0.851141  0.000000
# 2019-01-03 -0.324215  0.629637  3.140000 -0.174930
# 2019-01-04  0.085996  0.173265  0.416938 -0.903989
# 2019-01-05  0.009368  0.410056 -1.297822 -2.202893
# 2019-01-06  0.021892 -0.359749 -0.608556 -0.859454
#                    A         B         C         D
# 2019-01-01  0.490727 -0.603079  0.202922  2.012060
# 2019-01-02 -0.855106  0.305557  0.851141  3.140000
# 2019-01-03 -0.324215  0.629637  3.140000 -0.174930
# 2019-01-04  0.085996  0.173265  0.416938 -0.903989
# 2019-01-05  0.009368  0.410056 -1.297822 -2.202893
# 2019-01-06  0.021892 -0.359749 -0.608556 -0.859454

Boolean Filling of Data

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20190101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    df.iloc[1, 3] = None
    df.iloc[2, 2] = None
    print(df)
    print(pd.isnull(df))

# output:
#                    A         B         C         D
# 2019-01-01 -1.337471  0.154446  0.493862  1.278946
# 2019-01-02  2.853301 -0.151376  0.318281       NaN
# 2019-01-03  1.094465  0.059063       NaN  0.216805
# 2019-01-04 -0.983091 -1.052905  0.416604 -1.431156
# 2019-01-05 -1.421142  1.015465 -1.851315 -0.680514
# 2019-01-06  0.224378 -0.636699 -0.749040 -0.728368
#                 A      B      C      D
# 2019-01-01  False  False  False  False
# 2019-01-02  False  False  False   True
# 2019-01-03  False  False   True  False
# 2019-01-04  False  False  False  False
# 2019-01-05  False  False  False  False
# 2019-01-06  False  False  False  False

3. Row and Column Processing

A column selection is made using the dictionary key to get a column of data in the DataFrame.
Specify index and columns when generating DataFrame

import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    print(df)

# output:
#                    A         B         C         D
# 2013-01-01  1.116914 -0.221035 -0.577299 -0.328831
# 2013-01-02  1.764656  1.462838 -0.360678  1.176134
# 2013-01-03  0.144396 -0.594359 -0.548543  1.281829
# 2013-01-04  0.632378  0.895123 -0.757924 -1.325917
# 2013-01-05  0.219125 -1.247446  0.335363 -0.676052
# 2013-01-06  0.963715 -0.131331  0.326482 -0.718461

index and columns can also be specified after the DataFrame is created

import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    print(df)
    df.index = pd.date_range('20130201', periods=df.shape[0])
    df.columns = list('abcd')
    print(df)
    df.index = pd.date_range('20130301', periods=len(df))
    df.columns = list('ABCD')
    print(df)

# output:
#                    A         B         C         D
# 2013-01-01  1.588442  1.548420  0.132539  0.410512
# 2013-01-02  0.200415  1.515354  2.275575 -1.533603
# 2013-01-03  0.838294  0.067409 -1.157181  0.401973
# 2013-01-04  0.551363 -0.749296  0.343762 -1.558969
# 2013-01-05 -0.799507 -1.343379 -0.006312  1.091014
# 2013-01-06  0.012188 -0.382384  0.280008 -2.333430
#                    a         b         c         d
# 2013-02-01  1.588442  1.548420  0.132539  0.410512
# 2013-02-02  0.200415  1.515354  2.275575 -1.533603
# 2013-02-03  0.838294  0.067409 -1.157181  0.401973
# 2013-02-04  0.551363 -0.749296  0.343762 -1.558969
# 2013-02-05 -0.799507 -1.343379 -0.006312  1.091014
# 2013-02-06  0.012188 -0.382384  0.280008 -2.333430
#                    A         B         C         D
# 2013-03-01  1.588442  1.548420  0.132539  0.410512
# 2013-03-02  0.200415  1.515354  2.275575 -1.533603
# 2013-03-03  0.838294  0.067409 -1.157181  0.401973
# 2013-03-04  0.551363 -0.749296  0.343762 -1.558969
# 2013-03-05 -0.799507 -1.343379 -0.006312  1.091014
# 2013-03-06  0.012188 -0.382384  0.280008 -2.333430

You can specify a column as index

import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
    df['date'] = dates
    print(df)
    df = df.set_index('date', drop=True)
    print(df)

# output:
#           A         B         C         D       date
# 0  0.910416 -0.378195  0.332562 -0.194766 2013-01-01
# 1  0.533733  0.888629 -0.358143  1.583278 2013-01-02
# 2  0.482362 -0.905558  1.045753 -0.874653 2013-01-03
# 3  0.901622 -0.535862 -0.439763 -0.640594 2013-01-04
# 4 -1.273577 -0.746785  1.448309 -0.368285 2013-01-05
# 5  0.191289 -1.246213  0.184757 -1.143074 2013-01-06
#                    A         B         C         D
# date
# 2013-01-01  0.910416 -0.378195  0.332562 -0.194766
# 2013-01-02  0.533733  0.888629 -0.358143  1.583278
# 2013-01-03  0.482362 -0.905558  1.045753 -0.874653
# 2013-01-04  0.901622 -0.535862 -0.439763 -0.640594
# 2013-01-05 -1.273577 -0.746785  1.448309 -0.368285
# 2013-01-06  0.191289 -1.246213  0.184757 -1.143074

On the basis of the original DataFrame, you can create a new DataFrame, or you can create a new DataFrame by summarizing the original DataFrame in rows.

import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC'))
    print(df)
    df1 = pd.DataFrame()
    df1['min'] = df.min()
    df1['max'] = df.max()
    df1['std'] = df.std()
    print(df1)
    df['min'] = df.min(axis=1)
    df['max'] = df.max(axis=1)
    df['std'] = df.std(axis=1)
    print(df)

# output:
#                    A         B         C
# 2013-01-01  0.901073  1.706925 -0.503194
# 2013-01-02  0.379870  0.729674  0.579337
# 2013-01-03 -1.285323 -0.665951 -0.161148
# 2013-01-04 -0.714282  0.423376  0.586061
# 2013-01-05 -0.895171 -0.413328  0.485803
# 2013-01-06  1.926472 -0.718467  1.113522
#         min       max       std
# A -1.285323  1.926472  1.234084
# B -0.718467  1.706925  0.955797
# C -0.503194  1.113522  0.582913
#                    A         B         C       min       max       std
# 2013-01-01  0.901073  1.706925 -0.503194 -0.503194  1.706925  1.113132
# 2013-01-02  0.379870  0.729674  0.579337  0.379870  0.729674  0.175247
# 2013-01-03 -1.285323 -0.665951 -0.161148 -1.285323 -0.161148  0.562671
# 2013-01-04 -0.714282  0.423376  0.586061 -0.714282  0.586061  0.685749
# 2013-01-05 -0.895171 -0.413328  0.485803 -0.895171  0.485803  0.696763
# 2013-01-06  1.926472 -0.718467  1.113522 -0.718467  1.926472  1.341957

axis=0, each column of the DataFrame is statistically calculated to get a row.Axis = 0, calculates each row of data in the DataFrame to get a column.
DataFrame can modify index es and columns.

import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC'))
    print(df)
    df = df.rename(index=lambda x: x + 5, columns={'A': 'newA', 'B': 'newB'})
    print(df)

# output:
#                   A         B         C
# 2013-01-01  0.834910  0.652175  0.537611
# 2013-01-02  1.083902  0.836208 -1.466876
# 2013-01-03 -0.044256  0.932547  1.843682
# 2013-01-04  1.610113 -0.705734 -0.145042
# 2013-01-05  1.114897  0.273569 -0.047725
# 2013-01-06 -0.541942 -0.112752  1.644338
#                 newA      newB         C
# 2013-01-06  0.834910  0.652175  0.537611
# 2013-01-07  1.083902  0.836208 -1.466876
# 2013-01-08 -0.044256  0.932547  1.843682
# 2013-01-09  1.610113 -0.705734 -0.145042
# 2013-01-10  1.114897  0.273569 -0.047725
# 2013-01-11 -0.541942 -0.112752  1.644338

Unit Unification of Column Data

import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20190101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    df['D'] = [10000, 34000, 60000, 34000, 56000, 80000]
    print(df)
    for i in range(len(df['D'])):
        weight = float(df.iloc[i, 3]) / 10000
        df.iloc[i, 3] = '{}ten thousand'.format(weight)
    print(df)

# output:
#                    A         B         C      D
# 2019-01-01 -0.889533 -0.411451  0.563969  10000
# 2019-01-02 -0.573239  0.264805 -0.058530  34000
# 2019-01-03  1.224993 -1.815338 -2.075301  60000
# 2019-01-04  0.266483  1.841926 -0.759681  34000
# 2019-01-05 -0.167595  0.432617  0.533577  56000
# 2019-01-06 -0.973877  0.700821  1.093101  80000
#                    A         B         C     D
# 2019-01-01-0.889533-0.411451 0.563969.10 million
# 2019-01-02-0.573239 0.264805-0.05853034,000
# 2019-01-03 1.224993-1.815338-20.075301.66 million
# 2019-01-04 0.266483 1.841926-0.759681.34 million
# 2019-01-05-0.167595 0.432617 0.533577.56 million
# 2019-01-06-0.973877 0.700821.093101.8 million

4. Delete duplicate values

df.duplicated(self, subset=None, keep='first')
Check the DataFrame for duplicate data.
Subset: subset, the sequence of column labels or column labels
keep: The optional values are first, last, False, first to preserve the first occurrence, last to preserve the last occurrence, and False to preserve all values.
df.drop_duplicates(self, subset=None, keep='first', inplace=False)
Remove duplicate data from the DataFrame.
Subset: subset, the sequence of column labels or column labels
keep: The optional values are first, last, False, first to preserve the first occurrence, last to preserve the last occurrence, and False to preserve all values.
inplace: a value of True means modifying the source data, and a value of False means not modifying the source data

import pandas as pd
import numpy as np

if __name__ == "__main__":
    data = [['Alex', np.nan, 80], ['Bob', 25, 90], ['Bob', 25, 90]]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    # Use bool filtering to remove duplicate values
     print(df[df.duplicated(keep=False)])
    # Delete duplicate values, modify source data
     df.drop_duplicates(keep='last', inplace=True)
    print(df)

# output:
#    Name   Age  Score
# 0  Alex   NaN     80
# 1   Bob  25.0     90
# 2   Bob  25.0     90
#   Name   Age  Score
# 1  Bob  25.0     90
# 2  Bob  25.0     90
#    Name   Age  Score
# 0  Alex   NaN     80
# 2   Bob  25.0     90

5. Handling of outliers

There are two types of outliers, one is illegal data, such as a number column with some Chinese characters or symbols in the middle; the other is outlier data, unusual large or small values.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

def swap(x):
    if type(x) == str:
        if x[-1] == 'year':
            x = int(x[:-1])
        elif x[-1] == 'branch':
            x = int(x[:-1])
    return x

if __name__ == "__main__":
    data = [['Alex', np.nan, '89 branch'], ['Bob', '25 year', '90 branch'], ['Bob', '28 year', '90 branch']]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    df = df.applymap(swap)
    print(df)

# output:
#    Name  Age Score
# 0 Alex NaN 89 points
# 1 Bob 25, 90
# 2 Bob, 28, 90
#    Name   Age  Score
# 0  Alex   NaN     89
# 1   Bob  25.0     90
# 2   Bob  28.0     90

6. Data Format Cleaning

Clear the space before and after field characters
df['city']=df['city'].map(str.strip)
Convert fields to uppercase and lowercase:
df['city']=df['city'].str.lower()

import pandas as pd
import numpy as np

if __name__ == "__main__":
    data = [['Alex', np.nan, 80], [' Bob ', 25, 90], [' Bob', 25, 90]]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    # Clear the space before and after the string
    print(df['Name'].map(str.strip))
    # toggle case
    print(df['Name'].str.lower())

# output:
#     Name   Age  Score
# 0   Alex   NaN     80
# 1   Bob   25.0     90
# 2    Bob  25.0     90
# 0    Alex
# 1     Bob
# 2     Bob
# Name: Name, dtype: object
# 0     alex
# 1     bob 
# 2      bob
# Name: Name, dtype: object

Change the data type of the column:
df['price'].astype('int')

7. Data Substitution

df['city'].replace('sh', 'shanghai')
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data = [['Alex', np.nan, 80], ['Bob', 25, 90], ['Bob', 25, 90]]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    print(df['Name'].replace('Bob', 'Bauer'))

# output:
#    Name   Age  Score
# 0  Alex   NaN     80
# 1   Bob  25.0     90
# 2   Bob  25.0     90
# 0     Alex
# 1    Bauer
# 2    Bauer
# Name: Name, dtype: object

When replacing a string, there must be no spaces before or after it. Strict matches are required.

3. Data Processing

1. Sorting

(1) Sort by label

sort_index(self, axis=0, level=None, ascending=True, inplace=False,
               kind='quicksort', na_position='last', sort_remaining=True,
               by=None)

The DataFrame can be sorted by passing axis parameters and sort order using the sort_index() function.By default, row labels are sorted in ascending order.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col1', 'col2', 'col3'])
    print(df)
    df = df.sort_index()
    print(df)

# output:
#            col1      col2      col3
# rank2 -0.627700 -0.361006 -1.126366
# rank1 -1.997538  1.569461  0.454773
# rank4 -0.598688  1.348594  0.777791
# rank3 -0.190794 -1.209312  0.830699
#            col1      col2      col3
# rank1 -1.997538  1.569461  0.454773
# rank2 -0.627700 -0.361006 -1.126366
# rank3 -0.190794 -1.209312  0.830699
# rank4 -0.598688  1.348594  0.777791

The sort order can be controlled by passing a Boolean value to the ascending parameter ascending; the column labels can be sorted by passing an axis parameter value of 1.By default, axis = 0 sorts the row labels.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1'])
    print(df)
    # Sort by column label
    df = df.sort_index(ascending=True, axis=1)
    print(df)

# output:
#            col3      col2      col1
# rank2 -0.715319 -0.245760 -1.282737
# rank1  0.046705 -0.202133  0.185576
# rank4 -1.608270 -0.491281  0.047686
# rank3 -1.013456 -0.020197  1.184151
#            col1      col2      col3
# rank2 -1.282737 -0.245760 -0.715319
# rank1  0.185576 -0.202133  0.046705
# rank4  0.047686 -0.491281 -1.608270
# rank3  1.184151 -0.020197 -1.013456

(2) Sort by value

sort_values(self, by, axis=0, ascending=True, inplace=False,
                kind='quicksort', na_position='last')

Using the sort_values function, you can sort by value, receive a by parameter, use the column name of the DataFrame as the value, and sort by a column.By can be a list of column names.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1'])
    print(df)
    df = df.sort_values(by="col2")
    print(df)

# output:
#            col3      col2      col1
# rank2 -0.706054 -2.135880  1.066836
# rank1  0.290660 -2.214451 -1.724394
# rank4  1.211874  0.475177 -0.711855
# rank3 -0.253331  1.211301 -0.208633
#            col3      col2      col1
# rank1  0.290660 -2.214451 -1.724394
# rank2 -0.706054 -2.135880  1.066836
# rank4  1.211874  0.475177 -0.711855
# rank3 -0.253331  1.211301 -0.208633

sort_values() provides three sort algorithms, mergesort, heapsort, and quicksort. Mergesort is the only stable sort algorithm that is passed by the parameter kind.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1'])
    print(df)
    df = df.sort_values(by="col2", kind='mergesort')
    print(df)

# output:
#            col3      col2      col1
# rank2 -0.243768 -0.344846  0.535481
# rank1 -1.491950  0.690749 -2.023808
# rank4 -0.656292 -0.704788  0.655129
# rank3  0.468007 -0.250702  0.079670
#            col3      col2      col1
# rank4 -0.656292 -0.704788  0.655129
# rank2 -0.243768 -0.344846  0.535481
# rank3  0.468007 -0.250702  0.079670
# rank1 -1.491950  0.690749 -2.023808

Multi-column descending sort in order

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1'])
    print(df)
    df = df.sort_values(by=['col1', 'col3'], ascending=True, axis=0)
    print(df)

# output:
#            col3      col2      col1
# rank2  1.035965  1.048124 -0.341586
# rank1  2.391899 -1.575462  0.616940
# rank4  0.968523 -0.932288 -0.553498
# rank3  0.585521  1.907344 -0.264500
#            col3      col2      col1
# rank4  0.968523 -0.932288 -0.553498
# rank2  1.035965  1.048124 -0.341586
# rank3  0.585521  1.907344 -0.264500
# rank1  2.391899 -1.575462  0.616940

2. Grouping

Pandas can use the groupby function to split the DataFrame to get grouped objects.

df.groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
            group_keys=True, squeeze=False, observed=False, **kwargs)

by: Grouping, which can be a dictionary, function, label, label list

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90], ['Jack', 26, 80]]
    df = pd.DataFrame(data, index=['a', 'b', 'c', 'd'], columns=['Name', 'Age', 'A'])
    print(df)
    group_obj1 = df.groupby('Name')
    print(group_obj1.groups)
    print('===================================')
    # Single-Layer Group Iteration
    for key, data in group_obj1:
        print(key)
        print(data)
    group_obj2 = df.groupby(['Name', 'A'])
    # Group Information View
    print(group_obj2.groups)
    print('===================================')
    # Multilayer Group Iteration
    for key, data in group_obj2:
        print(key)
        print(data)

# output:
#     Name  Age   A
# a   Alex   24  80
# b    Bob   25  90
# c  Bauer   25  90
# d   Jack   26  80
# {'Alex': Index(['a'], dtype='object'), 'Bauer': Index(['c'], dtype='object'), 'Bob': Index(['b'], dtype='object'), 'Jack': Index(['d'], dtype='object')}
# ===================================
# Alex
#    Name  Age   A
# a  Alex   24  80
# Bauer
#     Name  Age   A
# c  Bauer   25  90
# Bob
#   Name  Age   A
# b  Bob   25  90
# Jack
#    Name  Age   A
# d  Jack   26  80
# {('Alex', 80): Index(['a'], dtype='object'), ('Bauer', 90): Index(['c'], dtype='object'), ('Bob', 90): Index(['b'], dtype='object'), ('Jack', 80): Index(['d'], dtype='object')}
# ===================================
# ('Alex', 80)
#    Name  Age   A
# a  Alex   24  80
# ('Bauer', 90)
#     Name  Age   A
# c  Bauer   25  90
# ('Bob', 90)
#   Name  Age   A
# b  Bob   25  90
# ('Jack', 80)
#    Name  Age   A
# d  Jack   26  80

The filter() function can be used to filter data.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data = [['Alex', 24, 80], ['Bob', 25, 92], ['Bauer', 25, 90], ['Jack', 26, 80]]
    df = pd.DataFrame(data, index=['a', 'b', 'c', 'd'], columns=['Name', 'Age', 'A'])
    print(df)
    group_obj1 = df.groupby('Age')
    print(group_obj1.groups)
    # Filter people of the same age
    group = group_obj1.filter(lambda x: len(x) > 1)
    print(group)

# output:
#     Name  Age   A
# a   Alex   24  80
# b    Bob   25  92
# c  Bauer   25  90
# d   Jack   26  80
# {24: Index(['a'], dtype='object'), 25: Index(['b', 'c'], dtype='object'), 26: Index(['d'], dtype='object')}
#     Name  Age   A
# b    Bob   25  92
# c  Bauer   25  90

3. Consolidation

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)
Merge two DataFrame objects.
Left, left DataFrame object.
Right, right DataFrame object.
on, column (name) connection, must exist (found) in the left and right DataFrame objects.
left_on, the column in the left DataFrame used as the key, can be a column name or an array of length equal to the length of the DataFrame.
right_on, a column from the right DataFrame as a key, can be a column name or an array of length equal to the length of the DataFrame.
left_index, if True, uses the index (row label) in the left DataFrame as its connection key.In the case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of connection keys from the right DataFrame.
right_index, which has the same usage as left_index of the right DataFrame.
how, the optional value is left, right, outer, inner, defaulting to inner.
sort, sorting the result DataFrame by connection keys in dictionary order.The default is True, which greatly improves performance when set to False.
An example of merging two DataFrame s on one key is as follows:

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90]]
    left = pd.DataFrame(data1, columns=['Name', 'Age', 'A'])

    data2 = [['Alex', 87, 78], ['Bob', 67, 87], ['Bauer', 98, 78]]
    right = pd.DataFrame(data2, columns=['Name', 'B', 'C'])

    print(left)
    print('==================================')
    print(right)
    print('==================================')
    df = pd.merge(left, right, on='Name')
    print(df)

# output:
#     Name  Age   A
# 0   Alex   24  80
# 1    Bob   25  90
# 2  Bauer   25  90
# ==================================
#     Name   B   C
# 0   Alex  87  78
# 1    Bob  67  87
# 2  Bauer  98  78
# ==================================
#     Name  Age   A   B   C
# 0   Alex   24  80  87  78
# 1    Bob   25  90  67  87
# 2  Bauer   25  90  98  78

An example of merging two DataFrame s on multiple keys is as follows:

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]
    left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A'])

    data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]]
    right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C'])

    print(left)
    print('==================================')
    print(right)
    print('==================================')
    df = pd.merge(left, right, on=['ID', 'Name'])
    print(df)

# output:
#    ID   Name  Age   A
# 0   1   Alex   24  80
# 1   2    Bob   25  90
# 2   3  Bauer   25  90
# ==================================
#    ID   Name   B   C
# 0   1   Alex  87  78
# 1   4    Bob  67  87
# 2   3  Bauer  98  78
# ==================================
#    ID   Name  Age   A   B   C
# 0   1   Alex   24  80  87  78
# 1   3  Bauer   25  90  98  78

Use the "how" parameter for merging, and the merge parameter specifies how to determine which keys will be included in the result table.If the key combination does not appear in the left or right table, the value in the join table will be NA.
Left:LEFT OUTER JOIN, using the keys of the left object.
Right:RIGHT OUTER JOIN, using the key of the right object.
outer:FULL OUTER JOIN, union using keys.
inner:INNER JOIN, using the intersection of keys.
Left Join example:

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]
    left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A'])

    data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]]
    right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C'])

    print(left)
    print('==================================')
    print(right)
    print('==================================')
    df = pd.merge(left, right, on='ID', how='left')
    print(df)

# output:
#    ID   Name  Age   A
# 0   1   Alex   24  80
# 1   2    Bob   25  90
# 2   3  Bauer   25  90
# ==================================
#    ID   Name   B   C
# 0   1   Alex  87  78
# 1   4    Bob  67  87
# 2   3  Bauer  98  78
# ==================================
#    ID Name_x  Age   A Name_y     B     C
# 0   1   Alex   24  80   Alex  87.0  78.0
# 1   2    Bob   25  90    NaN   NaN   NaN
# 2   3  Bauer   25  90  Bauer  98.0  78.0

Right Join example:

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]
    left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A'])

    data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]]
    right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C'])

    print(left)
    print('==================================')
    print(right)
    print('==================================')
    df = pd.merge(left, right, on='ID', how='right')
    print(df)

# output:
#    ID   Name  Age   A
# 0   1   Alex   24  80
# 1   2    Bob   25  90
# 2   3  Bauer   25  90
# ==================================
#    ID   Name   B   C
# 0   1   Alex  87  78
# 1   4    Bob  67  87
# 2   3  Bauer  98  78
# ==================================
#    ID Name_x   Age     A Name_y   B   C
# 0   1   Alex  24.0  80.0   Alex  87  78
# 1   3  Bauer  25.0  90.0  Bauer  98  78
# 2   4    NaN   NaN   NaN    Bob  67  87

Outer Join example:

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]
    left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A'])

    data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]]
    right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C'])

    print(left)
    print('==================================')
    print(right)
    print('==================================')
    df = pd.merge(left, right, on='ID', how='outer')
    print(df)

# output:
#    ID   Name  Age   A
# 0   1   Alex   24  80
# 1   2    Bob   25  90
# 2   3  Bauer   25  90
# ==================================
#    ID   Name   B   C
# 0   1   Alex  87  78
# 1   4    Bob  67  87
# 2   3  Bauer  98  78
# ==================================
#    ID Name_x   Age     A Name_y     B     C
# 0   1   Alex  24.0  80.0   Alex  87.0  78.0
# 1   2    Bob  25.0  90.0    NaN   NaN   NaN
# 2   3  Bauer  25.0  90.0  Bauer  98.0  78.0
# 3   4    NaN   NaN   NaN    Bob  67.0  87.0

Inner Join example:

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]
    left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A'])

    data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]]
    right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C'])

    print(left)
    print('==================================')
    print(right)
    print('==================================')
    df = pd.merge(left, right, on='ID', how='inner')
    print(df)

# output:
#    ID   Name  Age   A
# 0   1   Alex   24  80
# 1   2    Bob   25  90
# 2   3  Bauer   25  90
# ==================================
#    ID   Name   B   C
# 0   1   Alex  87  78
# 1   4    Bob  67  87
# 2   3  Bauer  98  78
# ==================================
#    ID Name_x  Age   A Name_y   B   C
# 0   1   Alex   24  80   Alex  87  78
# 1   3  Bauer   25  90  Bauer  98  78

4. Cascade

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
           keys=None, levels=None, names=None, verify_integrity=False,
           sort=None, copy=True)

Cascade along an axis.
objs, a sequence or dictionary of Series, DataFrame, or Panel objects.
Axis, {0, 1,...}, defaults to 0, axis=0 means cascade by index, axis=1 means cascade by columns.
join, {'inner','outer'}, default inner, indicates how indexes on other axes are handled.
ignore_index, Boolean value, defaults to False.If True is specified, the index value on the join axis is not used.The result axis will be marked as 0,..., n-1.
join_axes, a list of Index objects.Used for specific indexes on other (n-1) axes instead of performing internal/external set logic.
sort: sort or not, True sorts, False does not.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90]]
    one = pd.DataFrame(data1, columns=['Name', 'Age', 'A'])

    data2 = [['Alex', 87, 78], ['Bob', 67, 87], ['Bauer', 98, 78]]
    two = pd.DataFrame(data2, columns=['Name', 'B', 'C'])

    print(one)
    print('==================================')
    print(two)
    print('==================================')
    df = pd.concat([one, two], axis=1, sort=False)
    print(df)

# output:
#     Name  Age   A
# 0   Alex   24  80
# 1    Bob   25  90
# 2  Bauer   25  90
# ==================================
#     Name   B   C
# 0   Alex  87  78
# 1    Bob  67  87
# 2  Bauer  98  78
# ==================================
#     Name  Age   A   Name   B   C
# 0   Alex   24  80   Alex  87  78
# 1    Bob   25  90    Bob  67  87
# 2  Bauer   25  90  Bauer  98  78

When the index of the result is duplicated, you need to set ignore_index to True if the object you want to generate must follow its own index.
Pandas provides an append method to connect the DataFrame, connecting along axis=0.

df.append(self, other, ignore_index=False,
           verify_integrity=False, sort=None)

Add a new row to the DataFrame object, and if the column name added is not in the DataFrame object, it will be added as a new column.
other: DataFrame,series,dict,list
ignore_index: The default value is False, and if True, the index tag is not used.
verify_integrity: The default value is False, which throws a ValueError exception when creating the same index for True.
sort:boolean, default is None.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90]]
    one = pd.DataFrame(data1, columns=['Name', 'Age', 'A'])

    data2 = [['Alex', 87, 78], ['Bob', 67, 87], ['Bauer', 98, 78]]
    two = pd.DataFrame(data2, columns=['Name', 'B', 'C'])

    print(one)
    print('==================================')
    print(two)
    print('==================================')
    df = one.append(two, sort=False)
    print(df)

# output:
#     Name  Age   A
# 0   Alex   24  80
# 1    Bob   25  90
# 2  Bauer   25  90
# ==================================
#     Name   B   C
# 0   Alex  87  78
# 1    Bob  67  87
# 2  Bauer  98  78
# ==================================
#     Name   Age     A     B     C
# 0   Alex  24.0  80.0   NaN   NaN
# 1    Bob  25.0  90.0   NaN   NaN
# 2  Bauer  25.0  90.0   NaN   NaN
# 0   Alex   NaN   NaN  87.0  78.0
# 1    Bob   NaN   NaN  67.0  87.0
# 2  Bauer   NaN   NaN  98.0  78.0

Pandas provides a join method to join a DataFrame along axis=1, which combines different column indexes from two DataFrames into one DataFrame.

df.join(self, other, on=None, how='left', lsuffix='', rsuffix='',
         sort=False)

The join method provides a Join operation for SQL, defaulting to left outer connection how=left.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90],['Jack', 26, 80]]
    one = pd.DataFrame(data1, index=['a', 'b', 'c', 'd'], columns=['Name', 'Age', 'A'])

    data2 = [[87, 78], [67, 87], [98, 78]]
    two = pd.DataFrame(data2, index=['a', 'b', 'c'], columns=['B', 'C'])

    print(one)
    print('==================================')
    print(two)
    print('==================================')
    df = one.join(two)
    print(df)

# output:
#     Name  Age   A
# a   Alex   24  80
# b    Bob   25  90
# c  Bauer   25  90
# d   Jack   26  80
# ==================================
#     B   C
# a  87  78
# b  67  87
# c  98  78
# ==================================
#     Name  Age   A     B     C
# a   Alex   24  80  87.0  78.0
# b    Bob   25  90  67.0  87.0
# c  Bauer   25  90  98.0  78.0
# d   Jack   26  80   NaN   NaN

5. Iteration

Iterate DataFrame to provide column names.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20190101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    print(df)
    for col in df:
        print(col, end=' ')

# output:
#                    A         B         C         D
# 2019-01-01 -0.415754 -1.214340 -0.103952  1.232414
# 2019-01-02 -0.367888  0.257199 -1.615029 -0.335322
# 2019-01-03  0.552697  0.202993 -1.000219 -0.530897
# 2019-01-04  0.503410 -1.610091  1.660362  0.649700
# 2019-01-05  0.575416 -1.962578 -1.681379 -0.425239
# 2019-01-06  1.075917 -0.499081  1.886878 -0.073895
# A B C D 

df.iteritems() is used for iteration (key, value) pairs, with each column label as the key and value as the Series object.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    dates = pd.date_range('20190101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    print(df)
    for key, value in df.iteritems():
        print(key, value)

# output:
#                    A         B         C         D
# 2019-01-01 -0.302021  1.343811 -0.070351 -0.409479
# 2019-01-02 -0.365564  0.743572 -0.475075  1.026054
# 2019-01-03  0.025748  1.395340 -0.987686  0.141003
# 2019-01-04 -0.291348 -1.173600 -2.286905  0.528416
# 2019-01-05 -1.844523 -0.052567  0.575980  0.260001
# 2019-01-06  0.271046 -0.583334 -0.596251  0.772095
# A 2019-01-01   -0.302021
# 2019-01-02   -0.365564
# 2019-01-03    0.025748
# 2019-01-04   -0.291348
# 2019-01-05   -1.844523
# 2019-01-06    0.271046
# Freq: D, Name: A, dtype: float64
# B 2019-01-01    1.343811
# 2019-01-02    0.743572
# 2019-01-03    1.395340
# 2019-01-04   -1.173600
# 2019-01-05   -0.052567
# 2019-01-06   -0.583334
# Freq: D, Name: B, dtype: float64
# C 2019-01-01   -0.070351
# 2019-01-02   -0.475075
# 2019-01-03   -0.987686
# 2019-01-04   -2.286905
# 2019-01-05    0.575980
# 2019-01-06   -0.596251
# Freq: D, Name: C, dtype: float64
# D 2019-01-01   -0.409479
# 2019-01-02    1.026054
# 2019-01-03    0.141003
# 2019-01-04    0.528416
# 2019-01-05    0.260001
# 2019-01-06    0.772095
# Freq: D, Name: D, dtype: float64

df.iterrows() is used to return iterators, producing each index and a Series containing each row of data.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
    print(df)
    for index, value in df.iterrows():
        print(index, value)

# output:
#           A         B         C         D
# 0 -1.097851  0.785749 -1.727198 -1.120925
# 1 -1.420429  0.094384 -1.566202  0.237084
# 2 -0.761957  0.552395  0.680884 -0.290955
# 3  0.357713 -0.323331  1.438013 -1.334616
# 4  0.015467 -2.431556 -0.717285 -0.094409
# 5 -1.198224 -1.370170  0.201725  0.258093
# 0 A   -1.097851
# B    0.785749
# C   -1.727198
# D   -1.120925
# Name: 0, dtype: float64
# 1 A   -1.420429
# B    0.094384
# C   -1.566202
# D    0.237084
# Name: 1, dtype: float64
# 2 A   -0.761957
# B    0.552395
# C    0.680884
# D   -0.290955
# Name: 2, dtype: float64
# 3 A    0.357713
# B   -0.323331
# C    1.438013
# D   -1.334616
# Name: 3, dtype: float64
# 4 A    0.015467
# B   -2.431556
# C   -0.717285
# D   -0.094409
# Name: 4, dtype: float64
# 5 A   -1.198224
# B   -1.370170
# C    0.201725
# D    0.258093
# Name: 5, dtype: float64

The df.itertuples() method returns an iterator that produces a named tuple for each row in the DataFrame.The first element of the tuple is the row index, while the remaining values are the row values.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
    print(df)
    for row in df.itertuples():
        print(row)

# output:
#           A         B         C         D
# 0  0.681324  1.047734 -1.909570 -0.845900
# 1 -0.879077 -0.897085 -0.795461 -0.634519
# 2  0.484502 -0.061608  0.605827 -0.321721
# 3 -0.051974  1.533112 -1.011544 -0.922280
# 4 -0.634157 -0.173692  1.228584 -1.229581
# 5  0.236769 -0.933609  0.111948  1.048215
# Pandas(Index=0, A=0.6813238552921729, B=1.0477343302788706, C=-1.909570436815022, D=-0.8459001766064564)
# Pandas(Index=1, A=-0.8790771200969485, B=-0.8970849190216943, C=-0.7954606477323869, D=-0.6345188867416923)
# Pandas(Index=2, A=0.48450157948338324, B=-0.061608014575315506, C=0.6058267522125123, D=-0.32172144100965605)
# Pandas(Index=3, A=-0.05197447447575398, B=1.5331115391025778, C=-1.0115444345763995, D=-0.9222798204619236)
# Pandas(Index=4, A=-0.6341570074338677, B=-0.173692444412635, C=1.2285839004083785, D=-1.2295807166909738)
# Pandas(Index=5, A=0.23676890089548117, B=-0.9336090868233837, C=0.11194794444517034, D=1.0482154173833818)

Iterations are used for reading and the iterator returns a copy of the original object (view), so changes during iteration will not be reflected on the original object.

6. SQL Operation

In SQL, SELECT is done using a comma-separated list of columns (or selecting all columns).
SELECT ID, Name FROM tablename LIMIT 5;
In Pandas, column selection is passed to the DataFrame by column name.
df[['ID', 'Name']].head(5)
Example SELECT operation:

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]
    df = pd.DataFrame(data, columns=['ID', 'Name', 'Age', 'A'])
    print(df)
    print(df[['ID', 'Name']].head(5))

# output:
#    ID   Name  Age   A
# 0   1   Alex   24  80
# 1   2    Bob   25  90
# 2   3  Bauer   25  90
#    ID   Name
# 0   1   Alex
# 1   2    Bob
# 2   3  Bauer

In SQL, conditional filtering is done using WHERE.
SELECT * FROM tablename WHERE Name = 'Bauer' LIMIT 5;
In Pandas, Boolean indexes are often used for filtering.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    data = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]
    df = pd.DataFrame(data, columns=['ID', 'Name', 'Age', 'A'])
    print(df)
    print('===========================')
    print(df[df['Name'] == 'Bauer'].head(5))

# output:
#    ID   Name  Age   A
# 0   1   Alex   24  80
# 1   2    Bob   25  90
# 2   3  Bauer   25  90
# ===========================
#    ID   Name  Age   A
# 2   3  Bauer   25  90

4. Data Analysis

1. Descriptive Statistics

(1)sum
Returns the sum of the values of the requested axis.By default, the axis is an index (axis=0).

import pandas as pd

if __name__ == "__main__":
    data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    print(df.sum())
    print(df.sum(1))

# output:
#     Name  Age  Score
# 0   Alex   25     80
# 1    Bob   26     90
# 2  Bauer   24     87
# Name     AlexBobBauer
# Age                75
# Score             257
# dtype: object
# 0    105
# 1    116
# 2    111
# dtype: int64

(2)mean
Returns the average value.

import pandas as pd

if __name__ == "__main__":
    data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    print(df.mean())

# output:
#     Name  Age  Score
# 0   Alex   25     80
# 1    Bob   26     90
# 2  Bauer   24     87
# Age      25.000000
# Score    85.666667
# dtype: float64

(3)std
Returns the Bressel standard deviation of a numeric column.

import pandas as pd

if __name__ == "__main__":
    data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    print(df.std())

# output:
#     Name  Age  Score
# 0   Alex   25     80
# 1    Bob   26     90
# 2  Bauer   24     87
# Age      1.000000
# Score    5.131601
# dtype: float64

(4)median
Find the median of all values.

import pandas as pd

if __name__ == "__main__":
    data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    print(df.median())

# output:
#     Name  Age  Score
# 0   Alex   25     80
# 1    Bob   26     90
# 2  Bauer   24     87
# Age      25.0
# Score    87.0
# dtype: float64

(5)min
Find the minimum of all values.

import pandas as pd

if __name__ == "__main__":
    data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    print(df.min())

# output:
#     Name  Age  Score
# 0   Alex   25     80
# 1    Bob   26     90
# 2  Bauer   24     87
# Name     Alex
# Age        24
# Score      80
# dtype: object

(6)max
Find the maximum of all values.

import pandas as pd

if __name__ == "__main__":
    data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    print(df.max())

# output:
#     Name  Age  Score
# 0   Alex   25     80
# 1    Bob   26     90
# 2  Bauer   24     87
# Name     Bob
# Age       26
# Score     90
# dtype: object

(7)describe
Summary of statistical information about DataFrame columns.
def describe(self, percentiles=None, include=None, exclude=None)
Incude is a parameter used to pass the necessary information about which columns need to be considered for summary.Gets a list of values, number by default.
object - Summary string column
Number - Summary number column
All - Summarize all columns together (they should not be passed as list values)

import pandas as pd

if __name__ == "__main__":
    data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]
    df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])
    print(df)
    print(df.describe(include="all"))

# output:
# Name  Age  Score
# 0   Alex   25     80
# 1    Bob   26     90
# 2  Bauer   24     87
#         Name   Age      Score
# count      3   3.0   3.000000
# unique     3   NaN        NaN
# top     Alex   NaN        NaN
# freq       1   NaN        NaN
# mean     NaN  25.0  85.666667
# std      NaN   1.0   5.131601
# min      NaN  24.0  80.000000
# 25%      NaN  24.5  83.500000
# 50%      NaN  25.0  87.000000
# 75%      NaN  25.5  88.500000
# max      NaN  26.0  90.000000

abs: evaluates the absolute value of all values
prod: product of all values
cumsum: cumulative total
cumprod: cumulative product

2. Percentage Change

Series, DatFrames, and Panel all have a pct_change() function that compares each element to its previous one and calculates the percentage change.By default, pct_change() operates on columns; if you want to apply it to rows, you can use the axis = 1 parameter.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1'])
    print(df)
    print(df.pct_change())

# output:
#            col3      col2      col1
# rank2  0.988739  2.062798  1.400892
# rank1  0.394663 -0.988307  1.583098
# rank4 -0.768109 -0.163727 -1.801323
# rank3  0.999816 -1.224068  1.470020
#            col3      col2      col1
# rank2       NaN       NaN       NaN
# rank1 -0.600842 -1.479110  0.130064
# rank4 -2.946241 -0.834336 -2.137846
# rank3 -2.301659  6.476294 -1.816078

3. Covariance

Covariance applies to Series data, and the Series object has a method, cov, that calculates the covariance between Series objects, and NA is automatically excluded.When applied to a DataFrame object, the covariance method calculates the covariance (cov) values between all columns.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e'])
    print(df)
    print(df['a'].cov(df['b']))
    print(df.cov())

# output:
#           a         b         c         d         e
# 0  1.168443 -0.343905  2.254448  0.269765 -0.928009
# 1  0.542551 -1.303205 -1.767313 -0.349884 -0.352578
# 2 -2.028410 -1.176339  0.156047  1.426468 -1.338805
# 0.48923631972868176
#           a         b         c         d         e
# a  2.870241  0.489236  0.713430 -1.312818  0.581441
# b  0.489236  0.271550  0.974811 -0.023849 -0.055862
# c  0.713430  0.974811  4.046193  0.580236 -0.558184
# d -1.312818 -0.023849  0.580236  0.812892 -0.430603
# e  0.581441 -0.055862 -0.558184 -0.430603  0.245420

4. Relevance

The correlation shows a linear relationship between any two values (Series).There are several ways to calculate correlation, such as Pearson (default), spearman, and kendall.If any non-numeric columns exist in the DataFrame, they are automatically excluded.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    df = pd.DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e'])
    print(df)
    print(df['a'].corr(df['b']))
    print(df.corr())

# output:
#           a         b         c         d         e
# 0 -2.110756  0.693665  0.405701 -0.628349 -1.062029
# 1 -1.331364  1.283434  1.619166 -0.025866  1.742287
# 2 -1.159944  0.435840 -0.251710 -0.347102 -0.026825
# 0.052396578025987336
#           a         b         c         d         e
# a  1.000000  0.052397 -0.000006  0.743940  0.664845
# b  0.052397  1.000000  0.998626  0.706309  0.780790
# c -0.000006  0.998626  1.000000  0.668242  0.746977
# d  0.743940  0.706309  0.668242  1.000000  0.993772
# e  0.664845  0.780790  0.746977  0.993772  1.000000

5. Data Ranking

Data rankings generate rankings for each element in the element array.In the case of relationships, the average rank is assigned.

# -*- coding=utf-8 -*-
import pandas as pd
import numpy as np

if __name__ == "__main__":
    s = pd.Series(np.random.randn(5), index=list('abcde'))
    print(s)
    s['a'] = s['c']
    print(s.rank())

# output:
# a    1.597684
# a    1.597684
# b    1.107413
# c   -0.298296
# d   -0.281076
# e   -0.667954
# dtype: float64
# a    2.5
# b    5.0
# c    2.5
# d    4.0
# e    1.0
# dtype: float64

rank uses an ascending parameter that defaults to True; when False, the data is sorted backwards and larger values are assigned a smaller sort.(

Tags: Python SQL Lambda

Posted on Wed, 04 Sep 2019 17:59:04 -0700 by ghostrider1