IST 718 | LAB 3¶

01-23-20

V1 DATA SOURCE: Coaches9¶

O: OBTAIN¶

import pandas as pd
data = pd.read_csv("Coaches9.csv")

data.head()

S: SCRUB¶

Change the columns with dollar amounts as strings to floats
Inspect and possibly replace missing values

attrs = list(data[data.columns[3:]].columns)
data[attrs] = data[attrs].replace({'\$':'', ',': '', '--':None}, regex=True).astype('float')

data.dtypes

School           object
Conference       object
Coach            object
SchoolPay       float64
TotalPay        float64
Bonus           float64
BonusPaid       float64
AssistantPay    float64
Buyout          float64
dtype: object

data.isnull().sum()

School           0
Conference       0
Coach            0
SchoolPay        4
TotalPay         4
Bonus           22
BonusPaid       41
AssistantPay     0
Buyout          22
dtype: int64

Replacing Missing Values¶

Four options:

Drop the schools with missing values
Drop the attributes with missing values (e.g. 'BonusPaid')
Replace the missing values with some value (e.g. mean, 0)
Find additional data

For V1, we are going to

Drop the schools without SchoolPay
Drop the attribute BonusPaid
Fill Bonus & Buyout with the attribute mean*

For V2, we are going to use #4 and a more advanced version of #3* taking within conference mean

data.dropna(subset=['SchoolPay'], inplace=True)

data.drop('BonusPaid', axis=1, inplace=True)

median_bonus = data['Bonus'].median()
median_buyout = data['Buyout'].median()
data["Bonus"].fillna(median_bonus, inplace = True)
data["Buyout"].fillna(median_buyout, inplace = True)

data.isnull().sum()

School          0
Conference      0
Coach           0
SchoolPay       0
TotalPay        0
Bonus           0
AssistantPay    0
Buyout          0
dtype: int64

E: EXPLORE¶

%matplotlib inline
import matplotlib.pyplot as plt

from pandas.plotting import scatter_matrix
scatter_matrix(data, figsize=(12,8))
plt.show()

From these two visualizations, it's clear we need to:

Drop either SchoolPay or TotalPay
Remove (or supplement) AssistantPay

data.drop(['TotalPay','AssistantPay'], axis=1, inplace=True) 
scatter_matrix(data, figsize=(12,8))
plt.show()

data.head()

Add More Data¶

We've lost a lot of data at this point and we need to add more. To do so we:

O: Obtain data
S: Scrub data
M: MERGE data! Add a "key" to both columns for merging

This sub OSM occured in an attached auxiliary file

data['school_key'] = data.apply(lambda x: ''.join(x['School'].lower().split(' '))[:8], axis=1)

data

winloss = pd.read_csv('winloss_withkey.csv')

test = data.copy()

data_wl = test.merge(winloss, left_on="school_key", right_on="team_key")

data_wl

data_for_m = data_wl.drop(['school_key','Team','team_key'], axis=1)

columns = ['school','conf','coach','school_pay', 'bonus', 'buyout', 'rank', 'W','L', 'T', 'ratio']
data_for_m.columns = columns
data_for_m

scatter_matrix(data_for_m, figsize=(12,8))
plt.show()

data_for_m.drop('T', axis=1, inplace=True)

data = data_for_m.copy()

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

data = train_set.drop('school_pay', axis=1)
data_labels = train_set['school_pay'].copy()

from sklearn.compose import ColumnTransformer

data_num = data.drop(['school', 'conf', 'coach'],axis =1) 
num_attribs = list(data_num)
cat_attribs = ['school','conf','coach']

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
#     ('attribs_addr', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs)
])

data_prepared = full_pipeline.fit_transform(data)

LINEAR REGRESSION¶

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(data_prepared, data_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
import numpy as np

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Stardard Deviation:", scores.std())

scores = cross_val_score(lin_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
lin_reg_rmse_scores = np.sqrt(-scores)
display_scores(lin_reg_rmse_scores)

Scores: [554331.56457865 229560.8218585  565768.22700361 820419.40507848
 331154.63261208 321707.67778108 644754.00356616 812049.90453042
 466793.88780985 755509.88511905]
Mean: 550205.0009937885
Stardard Deviation: 200721.60411036602

DECISION TREE¶

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(data_prepared, data_labels)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

scores = cross_val_score(tree_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
tree_reg_rmse_scores = np.sqrt(-scores)
display_scores(tree_reg_rmse_scores)

Scores: [1061315.48217972  388101.64569291  838352.48061063 1310105.90729247
  761111.21968192  813957.7190166  1191476.11762561 1072436.80091342
 1048970.92139129 1632316.56249617]
Mean: 1011814.4856900737
Stardard Deviation: 321697.2361820451

RANDOM FOREST¶

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(data_prepared, data_labels)

/Users/danielcaraway/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

scores = cross_val_score(forest_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
forest_reg_rmse_scores = np.sqrt(-scores)
display_scores(forest_reg_rmse_scores)

Scores: [ 673351.8422406   606464.01825049  425498.26999178 1090543.6350158
  690811.54849218  850426.329261    839837.59854235 1300692.87382977
  816062.72166777 1095827.29603885]
Mean: 838951.6133330595
Stardard Deviation: 248497.0665433372

	School	Conference	Coach	SchoolPay	TotalPay	Bonus	BonusPaid	AssistantPay	Buyout
0	Air Force	Mt. West	Troy Calhoun	885000	885000	247000	--	$0	--
1	Akron	MAC	Terry Bowden	$411,000	$412,500	$225,000	$50,000	$0	$688,500
2	Alabama	SEC	Nick Saban	$8,307,000	$8,307,000	$1,100,000	$500,000	$0	$33,600,000
3	Alabama at Birmingham	C-USA	Bill Clark	$900,000	$900,000	$950,000	$165,471	$0	$3,847,500
4	Appalachian State	Sun Belt	Scott Satterfield	$712,500	$712,500	$295,000	$145,000	$0	$2,160,417

	School	Conference	Coach	SchoolPay	Bonus	Buyout
0	Air Force	Mt. West	Troy Calhoun	885000.0	247000.0	4000000.0
1	Akron	MAC	Terry Bowden	411000.0	225000.0	688500.0
2	Alabama	SEC	Nick Saban	8307000.0	1100000.0	33600000.0
3	Alabama at Birmingham	C-USA	Bill Clark	900000.0	950000.0	3847500.0
4	Appalachian State	Sun Belt	Scott Satterfield	712500.0	295000.0	2160417.0

	School	Conference	Coach	SchoolPay	Bonus	Buyout	school_key
0	Air Force	Mt. West	Troy Calhoun	885000.0	247000.0	4000000.0	airforce
1	Akron	MAC	Terry Bowden	411000.0	225000.0	688500.0	akron
2	Alabama	SEC	Nick Saban	8307000.0	1100000.0	33600000.0	alabama
3	Alabama at Birmingham	C-USA	Bill Clark	900000.0	950000.0	3847500.0	alabamaa
4	Appalachian State	Sun Belt	Scott Satterfield	712500.0	295000.0	2160417.0	appalach
...	...	...	...	...	...	...	...
124	West Virginia	Big 12	Dana Holgorsen	3605000.0	1000000.0	7150000.0	westvirg
125	Western Kentucky	C-USA	Mike Sanford Jr.	800000.0	400000.0	1200000.0	westernk
126	Western Michigan	MAC	Tim Lester	800000.0	346500.0	800000.0	westernm
127	Wisconsin	Big Ten	Paul Chryst	3750000.0	775000.0	6000000.0	wisconsi
128	Wyoming	Mt. West	Craig Bohl	1412000.0	450000.0	8016667.0	wyoming

	School	Conference	Coach	SchoolPay	Bonus	Buyout	school_key	Rank	Team	W	L	T	Pct	team_key
0	Air Force	Mt. West	Troy Calhoun	885000.0	247000.0	4000000.0	airforce	10	Air Force (Mountain West)	11	2	0	0.846	airforce
1	Akron	MAC	Terry Bowden	411000.0	225000.0	688500.0	akron	130	Akron (MAC)	0	12	0	0.000	akron
2	Alabama	SEC	Nick Saban	8307000.0	1100000.0	33600000.0	alabama	10	Alabama (SEC)	11	2	0	0.846	alabama
3	Appalachian State	Sun Belt	Scott Satterfield	712500.0	295000.0	2160417.0	appalach	3	Appalachian State (Sun Belt)	13	1	0	0.929	appalach
4	Arizona	Pac-12	Kevin Sumlin	1600000.0	2025000.0	10000000.0	arizona	95	Arizona (Pac-12)	4	8	0	0.333	arizona
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
125	West Virginia	Big 12	Dana Holgorsen	3605000.0	1000000.0	7150000.0	westvirg	81	West Virginia (Big 12)	5	7	0	0.417	westvirg
126	Western Kentucky	C-USA	Mike Sanford Jr.	800000.0	400000.0	1200000.0	westernk	28	Western Ky. (C-USA)	9	4	0	0.692	westernk
127	Western Michigan	MAC	Tim Lester	800000.0	346500.0	800000.0	westernm	57	Western Mich. (MAC)	7	6	0	0.538	westernm
128	Wisconsin	Big Ten	Paul Chryst	3750000.0	775000.0	6000000.0	wisconsi	27	Wisconsin (Big Ten)	10	4	0	0.714	wisconsi
129	Wyoming	Mt. West	Craig Bohl	1412000.0	450000.0	8016667.0	wyoming	34	Wyoming (Mountain West)	8	5	0	0.615	wyoming

	school	conf	coach	school_pay	bonus	buyout	rank	W	L	T	ratio
0	Air Force	Mt. West	Troy Calhoun	885000.0	247000.0	4000000.0	10	11	2	0	0.846
1	Akron	MAC	Terry Bowden	411000.0	225000.0	688500.0	130	0	12	0	0.000
2	Alabama	SEC	Nick Saban	8307000.0	1100000.0	33600000.0	10	11	2	0	0.846
3	Appalachian State	Sun Belt	Scott Satterfield	712500.0	295000.0	2160417.0	3	13	1	0	0.929
4	Arizona	Pac-12	Kevin Sumlin	1600000.0	2025000.0	10000000.0	95	4	8	0	0.333
...	...	...	...	...	...	...	...	...	...	...	...
125	West Virginia	Big 12	Dana Holgorsen	3605000.0	1000000.0	7150000.0	81	5	7	0	0.417
126	Western Kentucky	C-USA	Mike Sanford Jr.	800000.0	400000.0	1200000.0	28	9	4	0	0.692
127	Western Michigan	MAC	Tim Lester	800000.0	346500.0	800000.0	57	7	6	0	0.538
128	Wisconsin	Big Ten	Paul Chryst	3750000.0	775000.0	6000000.0	27	10	4	0	0.714
129	Wyoming	Mt. West	Craig Bohl	1412000.0	450000.0	8016667.0	34	8	5	0	0.615