Pandas - Series and DataFrame

Article Directory

1. Comparison of Numpy and Pandas

If you compare a python list with a dictionary, you can say that Numpy is a list, has no numeric labels, and Pandas is a dictionary.Pandas are built on Numpy to make Numpy-centric applications easier.

To use pandas, you first need to understand its two main data structures: Series and DataFrame.

2.Series

import pandas as pd
import numpy as np

s = pd.Series([10,20,30,np.nan,40,np.nan])  #np.nan by NAN #Series is a one-dimensional array object similar to NumPy.It contains a set of indexes in addition to a set of data, so you can think of it as a set of indexed arrays
print(s)

Series is represented as a string with an index on the left and a value on the right.Because we did not specify an index for the data.An integer index of 0 to N-1 (N is length) is automatically created.

3.DataFrame

import pandas as pd  
import numpy as np

dates = pd.date_range('20200314', periods = 10)     #pd.date_range() - Date range: Generation date range
df = pd.DataFrame(np.random.randn(10,5), index = dates, columns = ['A', 'B', 'C', 'D', 'E'])   #Construct a table of 10 rows and 5 columns

print(df)


A DataFrame is a tabular data structure that contains an ordered set of columns, each of which can be of different value types (numeric, string, Boolean, and so on).The DataFrame has both row and column indexes and can be thought of as a big dictionary of Series

4. Simple use of DataFrame

import pandas as pd  
import numpy as np

dates = pd.date_range('20200314', periods = 10)     #pd.date_range() - Date range: Generation date range
df = pd.DataFrame(np.random.randn(10,5), index = dates, columns = ['A', 'B', 'C', 'D', 'E'])   #Construct a table of 10 rows and 5 columns

print(df['A'])    #Return indexed columns
print(df['E'])


We are creating a set of data_frame s without given row and column labels:

data_frame = pd.DataFrame(np.arange(20).reshape((4,5)))
print(data_frame)


This way, he will take the default 0-based index. There is also a way to generate df s, such as data_frame 2:

data_frame2 = pd.DataFrame({'A' : 1,
                            'B' : pd.Timestamp('20200314'),    #time stamp
                            'C' : pd.Series(1, index = list(range(4)), dtype = 'float32'),  #A set of indexed arrays
                            'D' : np.array([3] * 4, dtype = 'int32'),  
                            'E' : pd.Categorical(["test","train","test","train"]),
                            'F' : 'foo'
})
print(data_frame2)


This method treats each column's data specially. If you want to see the types in the data, you can use the property dtype:

print(data_frame2.dtypes)


If you want to see the sequence number of the columns:

print(data_frame2.index)


Similarly, the names of each type of data can be seen:

print(data_frame2.columns)


If you only want to see the values of all data_frame2:

print(data_frame2.values)


To find a summary of the data, use describe():

print(data_frame2.describe())

#output
         A    C    D
count  4.0  4.0  4.0
mean   1.0  1.0  3.0
std    0.0  0.0  0.0
min    1.0  1.0  3.0
25%    1.0  1.0  3.0
50%    1.0  1.0  3.0
75%    1.0  1.0  3.0
max    1.0  1.0  3.0

If you want to flip data, transpose:

print(data_frame2.T)

#output
                     0                    1                    2  \
A                    1                    1                    1   
B  2020-03-14 00:00:00  2020-03-14 00:00:00  2020-03-14 00:00:00   
C                    1                    1                    1   
D                    3                    3                    3   
E                 test                train                 test   
F                  foo                  foo                  foo   

                     3  
A                    1  
B  2020-03-14 00:00:00  
C                    1  
D                    3  
E                train  
F   
               foo  

If you want to sort the index of the data and output:

print(data_frame2.sort_index(axis = 1, ascending = False))

#output
     F      E  D    C          B  A
0  foo   test  3  1.0 2020-03-14  1
1  foo  train  3  1.0 2020-03-14  1
2  foo   test  3  1.0 2020-03-14  1
3  foo  train  3  1.0 2020-03-14  1

If the output is sorted by data values:

print(data_frame2.sort_values(by = 'E'))

#output
   A          B    C  D      E    F
0  1 2020-03-14  1.0  3   test  foo
2  1 2020-03-14  1.0  3   test  foo
1  1 2020-03-14  1.0  3  train  foo
3  1 2020-03-14  1.0  3  train  foo
132 original articles were published. 366 were praised. 10,000 visits+
Private letter follow

Tags: Python

Posted on Sat, 14 Mar 2020 20:10:17 -0700 by XeroXer