Data analysis case -- data analysis of 2012 US presidential campaign sponsorship

An analysis of the data of US presidential campaign sponsorship

This article is from Alibaba cloud Tianchi laboratory, Original address of case
For reference, Xiao Wang, a self-taught data analysis student, wrote and analyzed the data by himself. The codes and results are as follows
1. Import related python data analysis library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

2. Data loading and overview
2.1 because the single data is too large, we divide the data source into three 0-50w lines, 50-100W lines and 100W + lines

#data fetch
data_01=pd.read_csv(r'H:\Ali cloud\2012 An analysis of the data of US presidential campaign sponsorship\data_01.csv')
data_02=pd.read_csv(r'H:\Ali cloud\2012 An analysis of the data of US presidential campaign sponsorship\data_02.csv')
data_03=pd.read_csv(r'H:\Ali cloud\2012 An analysis of the data of US presidential campaign sponsorship\data_03.csv')

2.2 data display:
Read the first five lines of data

Read the first five lines of data
Read the first five lines of data
2.3 consolidated data (data ABCD 01, data ABCD 02, data ABCD 03)

data=pd.concat([data_01,data_02,data_03])
data.head()

2.4 view data information, including the name, non empty quantity and data type of each field

data.info()


We can see that the number of fields in the two columns of contbr ﹣ employee and contbr ﹣ occupation is a little less, indicating that there are empty values in them
2.5 view the summary of data

data.describe()


2.6 treatment of missing value
From data.info(), we can see that there are a small number of missing values in contbr? Employee and contbr? Occupation, which we fill with not provided

data['contbr_employer'].fillna('not provided',inplace=True)
data['contbr_occupation'].fillna('not provided',inplace=True)

2.7 view of missing values

data[data['contbr_employer'].isnull()]
data[data['contbr_occupation'].isnull()]
data.info()


As you can see, there is no missing value
2.8 check who are the presidential candidates in the data

print('Share{}Candidates, respectively'.format(len(data['cand_nm'].unique())))
data['cand_nm'].unique()


2.9 through search engine and other ways, obtain the Party of each presidential candidate, establish dictionary parties, with candidate name as key and party as corresponding value

parties = {'Bachmann, Michelle': 'Republican',
           'Cain, Herman': 'Republican',
           'Gingrich, Newt': 'Republican',
           'Huntsman, Jon': 'Republican',
           'Johnson, Gary Earl': 'Republican',
           'McCotter, Thaddeus G': 'Republican',
           'Obama, Barack': 'Democrat',
           'Paul, Ron': 'Republican',
           'Pawlenty, Timothy': 'Republican',
           'Perry, Rick': 'Republican',
           "Roemer, Charles E. 'Buddy' III": 'Republican',
           'Romney, Mitt': 'Republican',
           'Santorum, Rick': 'Republican'}

2.10 add a column of parties to store party information through map mapping function

data['party']=data['cand_nm'].map(parties)#map mapping
#Look at the two parties
data['party'].value_counts()

#It can be seen that the total amount of sponsorship received by the Republican Party is higher, and the number of sponsorship received by Democrat is higher

2.11 check the data sheet

Meaning of each field name
Can? Nm – name of the candidate receiving the donation
contbr_nm - donor name
contbr_st – donor's state
Contbr? Employer – donor's company
contbr_occupation
Contb ﹐ receive ﹐ AMT – donation amount (USD)
Contb · receive · DT – date of receipt of donation

2.12 sorting: sort the total sponsorship amount according to the occupation summary, summarize it according to the position, calculate the total sponsorship amount, display the top 20 items, and find that many professions are the same, just different expressions, such as C.E.O. and CEO, are all one occupation

data.groupby('contbr_occupation')['contb_receipt_amt'].sum().sort_values(ascending=False)[:20]


2.13 data conversion using functions: analysis of occupation and employer information
Many occupations involve the same basic type of work. Let's clean up the data (dict.get is cleverly used here to allow non mapping occupations to "pass")

#Set up a profession corresponding dictionary, map different expressions of the same profession to corresponding professions, for example, map C.E.O. to CEO
occupation_map = {
  'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
  'INFORMATION REQUESTED':'NOT PROVIDED',
  'SELF' : 'SELF-EMPLOYED',
  'SELF EMPLOYED' : 'SELF-EMPLOYED',
  'C.E.O.':'CEO',
  'LAWYER':'ATTORNEY',
}

# If not in dictionary, returns x
f = lambda x: occupation_map.get(x, x)
data.contbr_occupation = data.contbr_occupation.map(f)
#Data.contbr occupation is equivalent to the x in the above statement
#Data.contbr ﹣ the result returned by the location is mapped with the key in the location ﹣ map. If the same value is returned, the corresponding value is returned. If the different values are returned, the default value is returned
#contbr_occupation - donor occupation
data.contbr_occupation.head()


Similarly, similar transformation of employer information

emp_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'SELF' : 'SELF-EMPLOYED',
   'SELF EMPLOYED' : 'SELF-EMPLOYED',
}

# If no mapping provided, return x
f = lambda x: emp_mapping.get(x, x)
data.contbr_employer = data.contbr_employer.map(f)

Let's demonstrate the use of dict.get()

dict_data ={1:'one',2:'two',3:'three',4:'four'}
print(dict_data.get(1))
print(dict_data.get(3))
print(dict_data.get(5))
print(dict_data.get(5,'notfound'))

Output result

 # get method of dictionary
        # For example: list.get(k,d), where get is equivalent to an if...else... Statement, if the parameter k is in the dictionary, the dictionary will return list[k]; if the parameter k is not in the dictionary, then it will return the parameter D, if the parameter k is in the dictionary, then it will return the value value corresponding to K
        # l = {5:2,3:4}
        # The value returned by print l.get(3,0) is 4;
        # The return value of Print l.get (1,0) is 0

3.1 data filtering
Sponsorship includes refunds (negative contributions). To simplify the analysis process, we limit the dataset to only positive contributions

data = data[data['contb_receipt_amt']>0]


3.2 check the total amount of sponsorship received by each candidate, contb ﹣ receive ﹣ AMT – donation amount (USD), and can ﹣ nm – name of the candidate receiving the donation

data.groupby('cand_nm')['contb_receipt_amt'].sum().sort_values(ascending=False)


As can be seen from the above, sponsorship is mainly between Obama and Romney. In order to better focus on the competition between the two, we select the data subsets of these two candidates for further analysis
3.3 select a subset of data whose candidates are Obama and Romney

data_vs = data[data['cand_nm'].isin(['Obama, Barack','Romney, Mitt'])].copy()
#data['cand_nm'].isin(['Obama, Barack','Romney, Mitt']) returns a bool type value
# data[bool] returns specific data, all of which are true
#data.copy() copies the returned data
data_vs.head()


3.4 discrete data, discrete data with cut

bins = np.array([0,1,10,100,1000,10000,100000,1000000,10000000])
labels = pd.cut(data_vs['contb_receipt_amt'],bins)
labels.head()


3.6 summarize the sponsorship amount according to the party and occupation, similar to the pivot table operation in excel, and the aggregate function is sum

by_occupation = data.pivot_table('contb_receipt_amt',index='contbr_occupation',columns='party',aggfunc='sum')
#Filter out data with sponsorship amount less than 200W

over_2mm = by_occupation[by_occupation.sum(1)>2000000]
over_2mm


3.7 drawing

over_2mm.plot(kind='bar')


3.8 grouping based on occupation and employer information
Let's take a look at the occupations and employers with the highest total contributions to Obama and Romney. Note that dict.get is cleverly used here to allow careers without mapping to "pass"

def get_top_amounts(group,key,n=5):
#Pass in the object after groupby grouping, and return the data before sorting summarized by key field
    totals = group.groupby(key)['contb_receipt_amt'].sum()
    return totals.sort_values(ascending=False)[:n]
  
grouped = data_vs.groupby('cand_nm')
grouped.apply(get_top_amounts,'contbr_occupation',n=7)


3.9 similarly, use get top amounts() to analyze the employer

grouped.apply(get_top_amounts,'contbr_employer',n=10)


3.10 labels is the Series after the amount of sponsorship is discretized

grouped_bins = data_vs.groupby(['cand_nm',labels])
grouped_bins.size().unstack(0)


3.11 next, we will count the sponsorship amount of each interval

bucket_sums=grouped_bins['contb_receipt_amt'].sum().unstack(0)
bucket_sums

Insert a small experiment

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
data11=DataFrame(np.arange(6).reshape((2,3)),index=pd.Index(['street1','street2']),columns=pd.Index(['one','two','three']))
print(data11)
print('-----------------------------------------\n')
data2=data11.stack()
data3=data2.unstack()
print(data2)
print('-----------------------------------------\n')
print(data3)

Result:
Use the stack function to convert the row index of data ['one', 'two', 'three'] into column index (the second layer), and then you get a hierarchical Series (data2). Use the unstack function to convert the column index of the second layer of data2 into row index (the default can be changed), and then you get DataFrame (data3)
This experiment mainly demonstrates the stack function and unstack function

3.12 total sponsorship amount of Obama and Romney

bucket_sums.plot(kind='bar')


3.13 calculate the proportion of the total amount of sponsorship received by two candidates in each interval

normed_sums = bucket_sums.div(bucket_sums.sum(axis=1),axis=0)
normed_sums


3.14 using histogram and specifying stacked=True to stack, the percentage stacked graph can be completed
It can be seen that in terms of micro sponsorship, Obama received much more amount and quantity than Romney
3.15 count the top 20 sponsors by name

data.groupby('contbr_nm')['contbr_nm'].count().sort_values(ascending=False)[:20]


4.1 STR to datatime
We can use the to "datetime method to parse a variety of different date representations. The parsing of standard date formats (such as ISO8601) is very fast. We can also specify a specific date resolution format, such as PD. To "datetime (series, format = '% Y%m%d')

data_vs['time'] = pd.to_datetime(data_vs['contb_receipt_dt'])
data_vs['time'].head()


4.2 time as index

data_vs.set_index('time',inplace=True)
data_vs.head()


4.3 resampling and frequency conversion

Resampling refers to the process of changing the frequency of time series into another frequency. Changing high-frequency data to low-frequency data is called downsampling. Example groups the data and then calls the aggregate function. Here we convert the frequency from daily to monthly, which belongs to the sampling of high-frequency to low-frequency.

vs_time = data_vs.groupby('cand_nm').resample('M')['cand_nm'].count()
vs_time.unstack(0)


4.4 by comparing the number of sponsorship received by the two presidential candidates from April, 2011 to April, 2012, we can see that the closer to the election, the higher the enthusiasm for sponsorship, and Obama has an absolute advantage in all periods

fig1, ax1=plt.subplots(figsize=(32,8))
vs_time.unstack(0).plot(kind='area',ax=ax1,alpha=0.6)
plt.show()

Published 2 original articles, won praise 1, visited 32
Private letter follow

Tags: Lambda less Python Excel

Posted on Thu, 12 Mar 2020 05:46:17 -0700 by sam2004