Python data analysis - Chapter 3 Introduction to Pandas

Pandas is an open source BSD licensed Python library, which provides python programming language with high performance and easy to use data structure and data analysis tools. Pandas is used in a wide range of fields, including finance, economy, statistics, analysis and other academic and business fields. Here we learn to use pandas for some simple data analysis work.
Pandas' official website

Through the official website getting_started Carry out preliminary introductory learning. If there is a deeper learning, you can read the following books:

1. installation

Generally speaking, you can install it through conda or pip. Here, you can use PIP command to install in the environment created by miniconda. If the installation is slow, you can use domestic source.

pip install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple

Installation:

(python36) xxxx@master:~$ pip install pandas
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting pandas
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/08/ec/b5dd8cfb078380fb5ae9325771146bccd4e8cad2d3e4c72c7433010684eb/pandas-1.0.1-cp36-cp36m-manylinux1_x86_64.whl (10.1 MB)
     |████████████████████████████████| 10.1 MB 194 kB/s 
Requirement already satisfied: pytz>=2017.2 in ./miniconda3/envs/python36/lib/python3.6/site-packages (from pandas) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in ./.local/lib/python3.6/site-packages (from pandas) (2.8.1)
Collecting numpy>=1.13.3
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/62/20/4d43e141b5bc426ba38274933ef8e76e85c7adea2c321ecf9ebf7421cedf/numpy-1.18.1-cp36-cp36m-manylinux1_x86_64.whl (20.1 MB)
     |████████████████████████████████| 20.1 MB 61 kB/s 
Requirement already satisfied: six>=1.5 in ./.local/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas) (1.13.0)
Installing collected packages: numpy, pandas
Successfully installed numpy-1.18.1 pandas-1.0.1

2.Pandas data structure

The data structure of Pandas is built on the basis of Numpy array, mainly including:

  • Series
  • Data frame
  • Panel

2.1. Series

Series is an object similar to a one-dimensional array, which consists of a set of data (various Numpy data types) and a set of data labels (i.e. indexes) related to it. The simplest series can be generated from a single set of data:

from pandas import Series,DataFrame
import pandas as pd
obj=Series([4,7,-5,3])
print(obj)
Operation result
0    4
1    7
2   -5
3    3
dtype: int64

The string representation of Series is: index on the left, value on the right. Since we did not specify an index for the data, an integer index of 0 to N-1 (N is the length of the data) will be created automatically. You can obtain the array representation and index objects through the values and index properties of Series:

obj.values
Operation result
array([ 4,  7, -5,  3])
obj.index
Operation result
RangeIndex(start=0, stop=4, step=1)

The Series you create has an index that you can tag individual data points:

obj2=Series([4,7,-5,3],index=['d','b','a','c'])
print(obj2)
Operation result
d    4
b    7
a   -5
c    3
dtype: int64

The Series object itself and its index have a name attribute, which is closely related to other key functions of pandas. The index of Series can also be modified in place through assignment:

obj2.name='obj2Name'
obj2.index=['one','two','three','four']
obj2.index.name='obj2IndexName'
obj2
Operation result
obj2IndexName
one      4
two      7
three   -5
four     3
Name: obj2Name, dtype: int64

Single or set of values in Series can be selected by index

obj2['one']
Operation result
4
obj2[['two','one']]
Operation result
obj2IndexName
two    7
one    4
Name: obj2Name, dtype: int64

You can also create a Series directly from a dictionary

sdata={'Ohio':35000,'Texas':72000,'Oregon':16000,'Utah':5000}
obj3=Series(sdata)
obj3
Operation result
Ohio      35000
Texas     72000
Oregon    16000
Utah       5000
dtype: int64

2.2. Data frame

DataFrame is a tabular data structure, which contains a set of ordered columns. Each column can be of different value types (numeric value, string, Boolean value, etc.). DataFrame has both row and column indexes. It can be regarded as a dictionary composed of Series (sharing the same index). Compared with other similar data structures (such as R's data.frame), row oriented and column oriented operations in DataFrame are basically balanced. In fact, the data in DataFrame is stored in one or more two-dimensional blocks (instead of lists, dictionaries or other one-dimensional data structures).

Although DataFrame stores data in a two-dimensional structure, you can easily represent it as data of higher dimensions (tabular structure of hierarchical index, which is the key element of many advanced data processing functions in Panda).

There are many ways to build a DataFrame. The most common way is to directly pass in a dictionary consisting of an equal length list or a Numpy array. As a result, the DataFrame will be automatically indexed (like the Series), and all columns will be orderly arranged:

data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9]}
frame=DataFrame(data)
frame

If a column sequence is specified, the columns of the DataFrame are arranged in the specified order:

 DataFrame(data,columns=['year','state','pop'])

The columns of the DataFrame can be obtained as a Series in a way similar to that of a dictionary tag or a property. The returned Series has the same index as the original DataFrame, and its name attribute has been set accordingly:

frame.state
Operation result
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

Panel

This item is rarely used

For more articles, please pay attention to:
  

Tags: Programming Python pip Attribute

Posted on Tue, 24 Mar 2020 08:44:48 -0700 by j007ha