IST 718 | LAB 3

01-23-20

V1 DATA SOURCE: Coaches9

O: OBTAIN

In [32]:
import pandas as pd
data = pd.read_csv("Coaches9.csv")
In [33]:
data.head()
Out[33]:
School Conference Coach SchoolPay TotalPay Bonus BonusPaid AssistantPay Buyout
0 Air Force Mt. West Troy Calhoun 885000 885000 247000 -- $0 --
1 Akron MAC Terry Bowden $411,000 $412,500 $225,000 $50,000 $0 $688,500
2 Alabama SEC Nick Saban $8,307,000 $8,307,000 $1,100,000 $500,000 $0 $33,600,000
3 Alabama at Birmingham C-USA Bill Clark $900,000 $900,000 $950,000 $165,471 $0 $3,847,500
4 Appalachian State Sun Belt Scott Satterfield $712,500 $712,500 $295,000 $145,000 $0 $2,160,417

S: SCRUB

  1. Change the columns with dollar amounts as strings to floats
  2. Inspect and possibly replace missing values
In [34]:
attrs = list(data[data.columns[3:]].columns)
data[attrs] = data[attrs].replace({'\$':'', ',': '', '--':None}, regex=True).astype('float')
In [35]:
data.dtypes
Out[35]:
School           object
Conference       object
Coach            object
SchoolPay       float64
TotalPay        float64
Bonus           float64
BonusPaid       float64
AssistantPay    float64
Buyout          float64
dtype: object
In [36]:
data.isnull().sum()
Out[36]:
School           0
Conference       0
Coach            0
SchoolPay        4
TotalPay         4
Bonus           22
BonusPaid       41
AssistantPay     0
Buyout          22
dtype: int64

Replacing Missing Values

Four options:

  1. Drop the schools with missing values
  2. Drop the attributes with missing values (e.g. 'BonusPaid')
  3. Replace the missing values with some value (e.g. mean, 0)
  4. Find additional data

For V1, we are going to

  1. Drop the schools without SchoolPay
  2. Drop the attribute BonusPaid
  3. Fill Bonus & Buyout with the attribute mean*

For V2, we are going to use #4 and a more advanced version of #3* taking within conference mean

In [37]:
data.dropna(subset=['SchoolPay'], inplace=True)
In [38]:
data.drop('BonusPaid', axis=1, inplace=True)
In [39]:
median_bonus = data['Bonus'].median()
median_buyout = data['Buyout'].median()
data["Bonus"].fillna(median_bonus, inplace = True)
data["Buyout"].fillna(median_buyout, inplace = True)
In [40]:
data.isnull().sum()
Out[40]:
School          0
Conference      0
Coach           0
SchoolPay       0
TotalPay        0
Bonus           0
AssistantPay    0
Buyout          0
dtype: int64

E: EXPLORE

In [41]:
%matplotlib inline
import matplotlib.pyplot as plt
In [42]:
from pandas.plotting import scatter_matrix
scatter_matrix(data, figsize=(12,8))
plt.show()

From these two visualizations, it's clear we need to:

  1. Drop either SchoolPay or TotalPay
  2. Remove (or supplement) AssistantPay
In [43]:
data.drop(['TotalPay','AssistantPay'], axis=1, inplace=True) 
scatter_matrix(data, figsize=(12,8))
plt.show()
In [44]:
data.head()
Out[44]:
School Conference Coach SchoolPay Bonus Buyout
0 Air Force Mt. West Troy Calhoun 885000.0 247000.0 4000000.0
1 Akron MAC Terry Bowden 411000.0 225000.0 688500.0
2 Alabama SEC Nick Saban 8307000.0 1100000.0 33600000.0
3 Alabama at Birmingham C-USA Bill Clark 900000.0 950000.0 3847500.0
4 Appalachian State Sun Belt Scott Satterfield 712500.0 295000.0 2160417.0

Add More Data

We've lost a lot of data at this point and we need to add more. To do so we:

  1. O: Obtain data
  2. S: Scrub data
  3. M: MERGE data! Add a "key" to both columns for merging

This sub OSM occured in an attached auxiliary file

In [52]:
data['school_key'] = data.apply(lambda x: ''.join(x['School'].lower().split(' '))[:8], axis=1)
In [53]:
data
Out[53]:
School Conference Coach SchoolPay Bonus Buyout school_key
0 Air Force Mt. West Troy Calhoun 885000.0 247000.0 4000000.0 airforce
1 Akron MAC Terry Bowden 411000.0 225000.0 688500.0 akron
2 Alabama SEC Nick Saban 8307000.0 1100000.0 33600000.0 alabama
3 Alabama at Birmingham C-USA Bill Clark 900000.0 950000.0 3847500.0 alabamaa
4 Appalachian State Sun Belt Scott Satterfield 712500.0 295000.0 2160417.0 appalach
... ... ... ... ... ... ... ...
124 West Virginia Big 12 Dana Holgorsen 3605000.0 1000000.0 7150000.0 westvirg
125 Western Kentucky C-USA Mike Sanford Jr. 800000.0 400000.0 1200000.0 westernk
126 Western Michigan MAC Tim Lester 800000.0 346500.0 800000.0 westernm
127 Wisconsin Big Ten Paul Chryst 3750000.0 775000.0 6000000.0 wisconsi
128 Wyoming Mt. West Craig Bohl 1412000.0 450000.0 8016667.0 wyoming

125 rows × 7 columns

In [55]:
winloss = pd.read_csv('winloss_withkey.csv')
In [56]:
test = data.copy()
In [59]:
data_wl = test.merge(winloss, left_on="school_key", right_on="team_key")
In [60]:
data_wl
Out[60]:
School Conference Coach SchoolPay Bonus Buyout school_key Rank Team W L T Pct team_key
0 Air Force Mt. West Troy Calhoun 885000.0 247000.0 4000000.0 airforce 10 Air Force (Mountain West) 11 2 0 0.846 airforce
1 Akron MAC Terry Bowden 411000.0 225000.0 688500.0 akron 130 Akron (MAC) 0 12 0 0.000 akron
2 Alabama SEC Nick Saban 8307000.0 1100000.0 33600000.0 alabama 10 Alabama (SEC) 11 2 0 0.846 alabama
3 Appalachian State Sun Belt Scott Satterfield 712500.0 295000.0 2160417.0 appalach 3 Appalachian State (Sun Belt) 13 1 0 0.929 appalach
4 Arizona Pac-12 Kevin Sumlin 1600000.0 2025000.0 10000000.0 arizona 95 Arizona (Pac-12) 4 8 0 0.333 arizona
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
125 West Virginia Big 12 Dana Holgorsen 3605000.0 1000000.0 7150000.0 westvirg 81 West Virginia (Big 12) 5 7 0 0.417 westvirg
126 Western Kentucky C-USA Mike Sanford Jr. 800000.0 400000.0 1200000.0 westernk 28 Western Ky. (C-USA) 9 4 0 0.692 westernk
127 Western Michigan MAC Tim Lester 800000.0 346500.0 800000.0 westernm 57 Western Mich. (MAC) 7 6 0 0.538 westernm
128 Wisconsin Big Ten Paul Chryst 3750000.0 775000.0 6000000.0 wisconsi 27 Wisconsin (Big Ten) 10 4 0 0.714 wisconsi
129 Wyoming Mt. West Craig Bohl 1412000.0 450000.0 8016667.0 wyoming 34 Wyoming (Mountain West) 8 5 0 0.615 wyoming

130 rows × 14 columns

In [66]:
data_for_m = data_wl.drop(['school_key','Team','team_key'], axis=1)
In [67]:
columns = ['school','conf','coach','school_pay', 'bonus', 'buyout', 'rank', 'W','L', 'T', 'ratio']
data_for_m.columns = columns
data_for_m
Out[67]:
school conf coach school_pay bonus buyout rank W L T ratio
0 Air Force Mt. West Troy Calhoun 885000.0 247000.0 4000000.0 10 11 2 0 0.846
1 Akron MAC Terry Bowden 411000.0 225000.0 688500.0 130 0 12 0 0.000
2 Alabama SEC Nick Saban 8307000.0 1100000.0 33600000.0 10 11 2 0 0.846
3 Appalachian State Sun Belt Scott Satterfield 712500.0 295000.0 2160417.0 3 13 1 0 0.929
4 Arizona Pac-12 Kevin Sumlin 1600000.0 2025000.0 10000000.0 95 4 8 0 0.333
... ... ... ... ... ... ... ... ... ... ... ...
125 West Virginia Big 12 Dana Holgorsen 3605000.0 1000000.0 7150000.0 81 5 7 0 0.417
126 Western Kentucky C-USA Mike Sanford Jr. 800000.0 400000.0 1200000.0 28 9 4 0 0.692
127 Western Michigan MAC Tim Lester 800000.0 346500.0 800000.0 57 7 6 0 0.538
128 Wisconsin Big Ten Paul Chryst 3750000.0 775000.0 6000000.0 27 10 4 0 0.714
129 Wyoming Mt. West Craig Bohl 1412000.0 450000.0 8016667.0 34 8 5 0 0.615

130 rows × 11 columns

In [68]:
scatter_matrix(data_for_m, figsize=(12,8))
plt.show()
In [69]:
data_for_m.drop('T', axis=1, inplace=True)
In [71]:
data = data_for_m.copy()
In [72]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

data = train_set.drop('school_pay', axis=1)
data_labels = train_set['school_pay'].copy()
In [73]:
from sklearn.compose import ColumnTransformer

data_num = data.drop(['school', 'conf', 'coach'],axis =1) 
num_attribs = list(data_num)
cat_attribs = ['school','conf','coach']
In [74]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
#     ('attribs_addr', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])
In [75]:
full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs)
])
In [76]:
data_prepared = full_pipeline.fit_transform(data)

LINEAR REGRESSION

In [77]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(data_prepared, data_labels)
Out[77]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [78]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
import numpy as np

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Stardard Deviation:", scores.std())
In [79]:
scores = cross_val_score(lin_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
lin_reg_rmse_scores = np.sqrt(-scores)
display_scores(lin_reg_rmse_scores)
Scores: [554331.56457865 229560.8218585  565768.22700361 820419.40507848
 331154.63261208 321707.67778108 644754.00356616 812049.90453042
 466793.88780985 755509.88511905]
Mean: 550205.0009937885
Stardard Deviation: 200721.60411036602

DECISION TREE

In [80]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(data_prepared, data_labels)
Out[80]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
In [81]:
scores = cross_val_score(tree_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
tree_reg_rmse_scores = np.sqrt(-scores)
display_scores(tree_reg_rmse_scores)
Scores: [1061315.48217972  388101.64569291  838352.48061063 1310105.90729247
  761111.21968192  813957.7190166  1191476.11762561 1072436.80091342
 1048970.92139129 1632316.56249617]
Mean: 1011814.4856900737
Stardard Deviation: 321697.2361820451

RANDOM FOREST

In [82]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(data_prepared, data_labels)
/Users/danielcaraway/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
Out[82]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)
In [83]:
scores = cross_val_score(forest_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
forest_reg_rmse_scores = np.sqrt(-scores)
display_scores(forest_reg_rmse_scores)
Scores: [ 673351.8422406   606464.01825049  425498.26999178 1090543.6350158
  690811.54849218  850426.329261    839837.59854235 1300692.87382977
  816062.72166777 1095827.29603885]
Mean: 838951.6133330595
Stardard Deviation: 248497.0665433372
In [ ]: