IST 718 | LAB 3

01-23-20

V1 DATA SOURCE: Coaches9

O: OBTAIN

In [16]:
import pandas as pd
data = pd.read_csv("Coaches9.csv")
In [17]:
data.head()
Out[17]:
School Conference Coach SchoolPay TotalPay Bonus BonusPaid AssistantPay Buyout
0 Air Force Mt. West Troy Calhoun 885000 885000 247000 -- $0 --
1 Akron MAC Terry Bowden $411,000 $412,500 $225,000 $50,000 $0 $688,500
2 Alabama SEC Nick Saban $8,307,000 $8,307,000 $1,100,000 $500,000 $0 $33,600,000
3 Alabama at Birmingham C-USA Bill Clark $900,000 $900,000 $950,000 $165,471 $0 $3,847,500
4 Appalachian State Sun Belt Scott Satterfield $712,500 $712,500 $295,000 $145,000 $0 $2,160,417

S: SCRUB

  1. Change the columns with dollar amounts as strings to floats
  2. Inspect and possibly replace missing values
In [18]:
attrs = list(data[data.columns[3:]].columns)
data[attrs] = data[attrs].replace({'\$':'', ',': '', '--':None}, regex=True).astype('float')
In [19]:
data.dtypes
Out[19]:
School           object
Conference       object
Coach            object
SchoolPay       float64
TotalPay        float64
Bonus           float64
BonusPaid       float64
AssistantPay    float64
Buyout          float64
dtype: object
In [20]:
data.isnull().sum()
Out[20]:
School           0
Conference       0
Coach            0
SchoolPay        4
TotalPay         4
Bonus           22
BonusPaid       41
AssistantPay     0
Buyout          22
dtype: int64

Replacing Missing Values

Four options:

  1. Drop the schools with missing values
  2. Drop the attributes with missing values (e.g. 'BonusPaid')
  3. Replace the missing values with some value (e.g. mean, 0)
  4. Find additional data

For V1, we are going to

  1. Drop the schools without SchoolPay
  2. Drop the attribute BonusPaid
  3. Fill Bonus & Buyout with the attribute mean*

For V2, we are going to use #4 and a more advanced version of #3* taking within conference mean

In [22]:
data.dropna(subset=['SchoolPay'], inplace=True)
In [23]:
data.drop('BonusPaid', axis=1, inplace=True)
In [24]:
median_bonus = data['Bonus'].median()
median_buyout = data['Buyout'].median()
data["Bonus"].fillna(median_bonus, inplace = True)
data["Buyout"].fillna(median_buyout, inplace = True)
In [25]:
data.isnull().sum()
Out[25]:
School          0
Conference      0
Coach           0
SchoolPay       0
TotalPay        0
Bonus           0
AssistantPay    0
Buyout          0
dtype: int64
In [ ]: