IST 718 | LAB 3¶

01-23-20

V1 DATA SOURCE: Coaches9¶

O: OBTAIN¶

import pandas as pd
data = pd.read_csv("Coaches9.csv")

data.head()

S: SCRUB¶

Change the columns with dollar amounts as strings to floats
Inspect and possibly replace missing values

attrs = list(data[data.columns[3:]].columns)
data[attrs] = data[attrs].replace({'\$':'', ',': '', '--':None}, regex=True).astype('float')

data.dtypes

School           object
Conference       object
Coach            object
SchoolPay       float64
TotalPay        float64
Bonus           float64
BonusPaid       float64
AssistantPay    float64
Buyout          float64
dtype: object

data.isnull().sum()

School           0
Conference       0
Coach            0
SchoolPay        4
TotalPay         4
Bonus           22
BonusPaid       41
AssistantPay     0
Buyout          22
dtype: int64

Replacing Missing Values¶

Four options:

Drop the schools with missing values
Drop the attributes with missing values (e.g. 'BonusPaid')
Replace the missing values with some value (e.g. mean, 0)
Find additional data

For V1, we are going to

Drop the schools without SchoolPay
Drop the attribute BonusPaid
Fill Bonus & Buyout with the attribute mean*

For V2, we are going to use #4 and a more advanced version of #3* taking within conference mean

data.dropna(subset=['SchoolPay'], inplace=True)

data.drop('BonusPaid', axis=1, inplace=True)

median_bonus = data['Bonus'].median()
median_buyout = data['Buyout'].median()
data["Bonus"].fillna(median_bonus, inplace = True)
data["Buyout"].fillna(median_buyout, inplace = True)

data.isnull().sum()

School          0
Conference      0
Coach           0
SchoolPay       0
TotalPay        0
Bonus           0
AssistantPay    0
Buyout          0
dtype: int64

E: EXPLORE¶

%matplotlib inline
import matplotlib.pyplot as plt

from pandas.plotting import scatter_matrix
scatter_matrix(data, figsize=(12,8))
plt.show()

From these two visualizations, it's clear we need to:

Drop either SchoolPay or TotalPay
Remove (or supplement) AssistantPay

data.drop(['TotalPay','AssistantPay'], axis=1, inplace=True) 
scatter_matrix(data, figsize=(12,8))
plt.show()

data.head()

Add More Data¶

We've lost a lot of data at this point and we need to add more. To do so we:

O: Obtain data
S: Scrub data
M: MERGE data! Add a "key" to both columns for merging

This sub OSM occured in an attached auxiliary file

+ Add Win Loss Data¶

data['school_key'] = data.apply(lambda x: ''.join(x['School'].lower().split(' '))[:8], axis=1)

data

winloss = pd.read_csv('winloss_withkey.csv')

test = data.copy()

data_wl = test.merge(winloss, left_on="school_key", right_on="team_key")

data_wl

+ Add Stadium Size and Graduation Rate¶

data_wl['team_coach_key'] = data_wl.apply(lambda x: (''.join(x['Coach'].split())+x['School'][:4]).lower(), axis=1)

gsr = pd.read_csv('c_modify_small.csv')
test = data_wl.copy()
data_wl_gsr = test.merge(gsr, left_on="team_coach_key", right_on="team_coach_key")
data_wl_gsr.columns

Index(['School', 'Conference', 'Coach', 'SchoolPay', 'Bonus', 'Buyout',
       'school_key', 'Rank', 'Team', 'W', 'L', 'T', 'Pct', 'team_key_x',
       'team_coach_key', 'school', 'stad_size', 'gsr', 'med_conf_sal',
       'seat_rank', 'combo_rank', 'true_rank', 'gs_rank', 'team_key_y',
       'coach', 'team_coach'],
      dtype='object')

data_wl_gsr.isna().sum()

School            0
Conference        0
Coach             0
SchoolPay         0
Bonus             0
Buyout            0
school_key        0
Rank              0
Team              0
W                 0
L                 0
T                 0
Pct               0
team_key_x        0
team_coach_key    0
school            0
stad_size         0
gsr               0
med_conf_sal      0
seat_rank         0
combo_rank        0
true_rank         0
gs_rank           0
team_key_y        0
coach             0
team_coach        0
dtype: int64

data_wl_gsr.seat_rank.value_counts()

20     5
88     3
64     3
26     2
103    2
      ..
86     1
87     1
90     1
91     1
128    1
Name: seat_rank, Length: 93, dtype: int64

data_for_m = data_wl_gsr.drop(['school_key','Team','team_key_y',
                               'team_key_x', 'team_coach_key', 'coach', 'team_coach', 'school'], axis=1)

data_for_m.columns

Index(['School', 'Conference', 'Coach', 'SchoolPay', 'Bonus', 'Buyout', 'Rank',
       'W', 'L', 'T', 'Pct', 'stad_size', 'gsr', 'med_conf_sal', 'seat_rank',
       'combo_rank', 'true_rank', 'gs_rank'],
      dtype='object')

columns = ['school','conf','coach','school_pay', 'bonus', 'buyout', 'rank', 'W','L', 'T', 'ratio', 'stad_size', 'gsr', 'med_conf_sal', 'seat_rank',
       'combo_rank', 'true_rank', 'gs_rank' ]
data_for_m.columns = columns

data_for_m['med_conf_sal'] = data_for_m['med_conf_sal'].replace({'\$':'', ',': '', '--':0}, regex=True).astype('float')
data_for_m['stad_size'] = data_for_m['stad_size'].replace({'\$':'', ',': '', '--':0}, regex=True).astype('float')

scatter_matrix(data_for_m, figsize=(12,8))
plt.show()

/Users/danielcaraway/anaconda3/lib/python3.7/site-packages/pandas/plotting/_matplotlib/misc.py:80: UserWarning: Attempting to set identical left == right == 0.0 results in singular transformations; automatically expanding.
  ax.set_xlim(boundaries_list[j])
/Users/danielcaraway/anaconda3/lib/python3.7/site-packages/pandas/plotting/_matplotlib/misc.py:81: UserWarning: Attempting to set identical bottom == top == 0.0 results in singular transformations; automatically expanding.
  ax.set_ylim(boundaries_list[i])
/Users/danielcaraway/anaconda3/lib/python3.7/site-packages/pandas/plotting/_matplotlib/misc.py:71: UserWarning: Attempting to set identical left == right == 0.0 results in singular transformations; automatically expanding.
  ax.set_xlim(boundaries_list[i])

data_for_m.drop('T', axis=1, inplace=True)

data_for_m.to_csv('data_for_m_v3.csv', index=False)

data = data_for_m.copy()

data.head()

MODELS IN NEW WORKBOOK -- IST718_LAB3_V2_MODELS¶

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

data = train_set.drop('school_pay', axis=1)
data_labels = train_set['school_pay'].copy()

from sklearn.compose import ColumnTransformer

data_num = data.drop(['school', 'conf', 'coach'],axis =1) 
num_attribs = list(data_num)
# cat_attribs = ['school','conf','coach']
cat_attribs = ['conf']

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
#     ('attribs_addr', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs)
])

data_prepared = full_pipeline.fit_transform(data)
# data.dtypes

LINEAR REGRESSION¶

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(data_prepared, data_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
import numpy as np

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Stardard Deviation:", scores.std())

With only win-loss Scores: [ 627138.42590247 528461.92031606 740615.40460434 772653.8213664 421934.01610918 189714.87363032 913332.69201488 1110949.79108439 559555.59486329 1070420.75902697] Mean: 693477.7298918304 Stardard Deviation: 273911.52769763477

scores = cross_val_score(lin_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
lin_reg_rmse_scores = np.sqrt(-scores)
display_scores(lin_reg_rmse_scores)

Scores: [415759.78957098 335939.64663732 501500.88316226 291747.89978931
 330147.67958576 633813.27857645 803408.59272099 619665.81749927
 870888.03389595 389391.02186592]
Mean: 519226.26433042076
Stardard Deviation: 194064.85413625545

data_whole = data_for_m.copy()
data_whole_y = data_whole['school_pay']
data_whole_X = data_whole.drop(['school_pay', 'school','coach'], axis=1)

# data_whole_prepared = full_pipeline.fit_transform(data_whole_X)
# lin_reg.fit(data_whole_prepared, data_whole_y)
# lin_reg.fit()
# data_whole_X
# lin_reg.fit(data_whole_X, data_whole_y)

# post_prediction = data_for_m.copy()
# post_prediction['prediction'] = lin_reg.predict(data_whole_prepared)

# post_prediction['prediction_clear'] = post_prediction.apply(lambda x: '%.1f' % x['prediction'], axis=1)

# post_prediction

DECISION TREE¶

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(data_prepared, data_labels)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

With only win-loss Scores: [1061315.48217972 388101.64569291 838352.48061063 1310105.90729247 761111.21968192 813957.7190166 1191476.11762561 1072436.80091342 1048970.92139129 1632316.56249617] Mean: 1011814.4856900737 Stardard Deviation: 321697.2361820451

scores = cross_val_score(tree_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
tree_reg_rmse_scores = np.sqrt(-scores)
display_scores(tree_reg_rmse_scores)

Scores: [ 695960.02224984  190985.32132078  865917.51308222  495846.21566096
  715740.05141881  474615.74502199  801647.18805615 1653859.92548697
 1004394.43767377  704543.86440259]
Mean: 760351.028437408
Stardard Deviation: 368211.9671187752

RANDOM FOREST¶

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(data_prepared, data_labels)

/Users/danielcaraway/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

With only win-loss Scores: [ 673351.8422406 606464.01825049 425498.26999178 1090543.6350158 690811.54849218 850426.329261 839837.59854235 1300692.87382977 816062.72166777 1095827.29603885] Mean: 838951.6133330595 Stardard Deviation: 248497.0665433372

scores = cross_val_score(forest_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
forest_reg_rmse_scores = np.sqrt(-scores)
display_scores(forest_reg_rmse_scores)

Scores: [ 468710.30629126  356629.96098513  688266.15774732  622175.40880706
  572174.37416687  601459.50130731  997561.09378467  565678.85222655
 1121408.95217627  549767.80651937]
Mean: 654383.2414011809
Stardard Deviation: 221262.96170275135

MORE MODELING¶

import statsmodels.formula.api as smf

MODEL 1¶

RESPONSE: school_pay | PREDICTORS: bonus, rank, buyout -- 71%¶

RESULT: (adj. R-Squared) 71%

data = data_for_m.copy()
my_model = str('school_pay ~ bonus + rank + buyout')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.732
Model:                            OLS   Adj. R-squared:                  0.723
Method:                 Least Squares   F-statistic:                     82.76
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           6.54e-26
Time:                        16:19:53   Log-Likelihood:                -1443.7
No. Observations:                  95   AIC:                             2895.
Df Residuals:                      91   BIC:                             2906.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1.037e+06   2.78e+05      3.725      0.000    4.84e+05    1.59e+06
bonus          0.4464      0.209      2.137      0.035       0.032       0.861
rank       -2978.9510   2938.326     -1.014      0.313   -8815.574    2857.672
buyout         0.1705      0.013     12.721      0.000       0.144       0.197
==============================================================================
Omnibus:                       11.613   Durbin-Watson:                   2.075
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               12.196
Skew:                           0.857   Prob(JB):                      0.00225
Kurtosis:                       3.380   Cond. No.                     3.31e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.31e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

MODEL 2¶

RESPONSE: school_pay | PREDICTORS: conf, bonus, buyout -- 88%¶

RESULT: (adj. R-Squared) 88%

my_model = str('school_pay ~ conf + bonus + buyout')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.881
Model:                            OLS   Adj. R-squared:                  0.865
Method:                 Least Squares   F-statistic:                     55.87
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           1.20e-33
Time:                        16:19:53   Log-Likelihood:                -1405.1
No. Observations:                  95   AIC:                             2834.
Df Residuals:                      83   BIC:                             2865.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         1.489e+06   3.18e+05      4.680      0.000    8.56e+05    2.12e+06
conf[T.ACC]       8.803e+05   3.59e+05      2.455      0.016    1.67e+05    1.59e+06
conf[T.Big 12]    1.408e+06   3.87e+05      3.640      0.000    6.39e+05    2.18e+06
conf[T.Big Ten]   1.542e+06   3.61e+05      4.275      0.000    8.24e+05    2.26e+06
conf[T.Ind.]     -6.208e+05   4.46e+05     -1.393      0.167   -1.51e+06    2.66e+05
conf[T.MAC]      -9.176e+05   3.71e+05     -2.477      0.015   -1.65e+06   -1.81e+05
conf[T.Mt. West] -6.707e+05   3.63e+05     -1.850      0.068   -1.39e+06    5.05e+04
conf[T.Pac-12]     7.94e+05   3.74e+05      2.123      0.037    5.02e+04    1.54e+06
conf[T.SEC]       1.656e+06   3.59e+05      4.608      0.000    9.41e+05    2.37e+06
conf[T.Sun Belt] -8.755e+05    3.6e+05     -2.433      0.017   -1.59e+06    -1.6e+05
bonus               -0.3289      0.193     -1.707      0.092      -0.712       0.054
buyout               0.1225      0.011     11.647      0.000       0.102       0.143
==============================================================================
Omnibus:                       20.520   Durbin-Watson:                   1.955
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               61.965
Skew:                           0.623   Prob(JB):                     3.50e-14
Kurtosis:                       6.755   Cond. No.                     1.57e+08
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.57e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

1.489e+06 + 8.803e+05 
# + -0.3289(bonus) + 0.1225(buyout)

2369300.0

MODEL 3¶

RESPONSE: school_pay | PREDICTORS: ratio, conf, buyout -- 87.7%¶

RESULT: 87.7% (adj. R-Squared)

my_model = str('school_pay ~ conf + buyout')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.877
Model:                            OLS   Adj. R-squared:                  0.862
Method:                 Least Squares   F-statistic:                     59.80
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           6.16e-34
Time:                        16:19:53   Log-Likelihood:                -1406.7
No. Observations:                  95   AIC:                             2835.
Df Residuals:                      84   BIC:                             2864.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         1.243e+06   2.87e+05      4.333      0.000    6.73e+05    1.81e+06
conf[T.ACC]       7.582e+05   3.55e+05      2.134      0.036    5.17e+04    1.46e+06
conf[T.Big 12]    1.415e+06   3.91e+05      3.617      0.001    6.37e+05    2.19e+06
conf[T.Big Ten]   1.484e+06   3.63e+05      4.087      0.000    7.62e+05    2.21e+06
conf[T.Ind.]     -5.425e+05   4.48e+05     -1.210      0.230   -1.43e+06    3.49e+05
conf[T.MAC]       -7.92e+05   3.67e+05     -2.157      0.034   -1.52e+06   -6.17e+04
conf[T.Mt. West] -6.262e+05   3.66e+05     -1.712      0.091   -1.35e+06    1.01e+05
conf[T.Pac-12]    5.776e+05   3.56e+05      1.623      0.108    -1.3e+05    1.29e+06
conf[T.SEC]       1.531e+06   3.56e+05      4.303      0.000    8.23e+05    2.24e+06
conf[T.Sun Belt] -7.309e+05   3.54e+05     -2.066      0.042   -1.43e+06   -2.75e+04
buyout               0.1213      0.011     11.429      0.000       0.100       0.142
==============================================================================
Omnibus:                       17.441   Durbin-Watson:                   1.990
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               42.045
Skew:                           0.589   Prob(JB):                     7.41e-10
Kurtosis:                       6.039   Cond. No.                     1.56e+08
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.56e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

1.243e+06 + 7.582e+05
# + 0.1213(buyout)

2001200.0

MODEL 4¶

RESPONSE: school_pay | PREDICTORS: conf¶

RESULT: 87.7% (adj. R-Squared)

my_model = str('school_pay ~ conf')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.652
Method:                 Least Squares   F-statistic:                     20.57
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           6.49e-18
Time:                        16:19:54   Log-Likelihood:                -1451.3
No. Observations:                  95   AIC:                             2923.
Df Residuals:                      85   BIC:                             2948.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         1.767e+06    4.5e+05      3.924      0.000    8.72e+05    2.66e+06
conf[T.ACC]       1.838e+06   5.44e+05      3.376      0.001    7.55e+05    2.92e+06
conf[T.Big 12]    2.697e+06   5.96e+05      4.528      0.000    1.51e+06    3.88e+06
conf[T.Big Ten]   2.707e+06   5.51e+05      4.910      0.000    1.61e+06     3.8e+06
conf[T.Ind.]     -7.675e+05   7.12e+05     -1.078      0.284   -2.18e+06    6.48e+05
conf[T.MAC]      -1.171e+06   5.81e+05     -2.014      0.047   -2.33e+06    -1.5e+04
conf[T.Mt. West] -6.408e+05   5.81e+05     -1.103      0.273    -1.8e+06    5.15e+05
conf[T.Pac-12]    1.153e+06    5.6e+05      2.061      0.042    4.06e+04    2.27e+06
conf[T.SEC]       2.424e+06   5.51e+05      4.397      0.000    1.33e+06    3.52e+06
conf[T.Sun Belt] -1.106e+06    5.6e+05     -1.975      0.051   -2.22e+06    7186.614
==============================================================================
Omnibus:                       27.553   Durbin-Watson:                   1.947
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               57.186
Skew:                           1.090   Prob(JB):                     3.82e-13
Kurtosis:                       6.114   Cond. No.                         13.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

1.767e+06 + 1.838e+06

3605000.0

MODEL 5¶

RESPONSE: school_pay | PREDICTORS: conf, ratio¶

my_model = str('school_pay ~ conf + ratio')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.738
Model:                            OLS   Adj. R-squared:                  0.707
Method:                 Least Squares   F-statistic:                     23.71
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           1.75e-20
Time:                        16:25:25   Log-Likelihood:                -1442.5
No. Observations:                  95   AIC:                             2907.
Df Residuals:                      84   BIC:                             2935.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         6.742e+05    4.9e+05      1.375      0.173   -3.01e+05    1.65e+06
conf[T.ACC]       1.765e+06      5e+05      3.534      0.001    7.72e+05    2.76e+06
conf[T.Big 12]    2.467e+06   5.49e+05      4.494      0.000    1.38e+06    3.56e+06
conf[T.Big Ten]    2.66e+06   5.06e+05      5.259      0.000    1.65e+06    3.67e+06
conf[T.Ind.]     -5.567e+05   6.55e+05     -0.850      0.398   -1.86e+06    7.46e+05
conf[T.MAC]       -9.97e+05   5.35e+05     -1.864      0.066   -2.06e+06    6.64e+04
conf[T.Mt. West] -5.329e+05   5.34e+05     -0.998      0.321   -1.59e+06    5.28e+05
conf[T.Pac-12]    1.109e+06   5.13e+05      2.161      0.034    8.83e+04    2.13e+06
conf[T.SEC]       2.322e+06   5.06e+05      4.586      0.000    1.32e+06    3.33e+06
conf[T.Sun Belt] -1.274e+06   5.15e+05     -2.475      0.015    -2.3e+06    -2.5e+05
ratio             2.061e+06   4.99e+05      4.127      0.000    1.07e+06    3.05e+06
==============================================================================
Omnibus:                       18.217   Durbin-Watson:                   2.055
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               32.977
Skew:                           0.754   Prob(JB):                     6.90e-08
Kurtosis:                       5.461   Cond. No.                         15.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

6.742e+05 +  1.765e+06 + (2.061e+06*0.417)

3298637.0

my_model = str('school_pay ~ conf + rank')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.733
Model:                            OLS   Adj. R-squared:                  0.701
Method:                 Least Squares   F-statistic:                     23.03
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           4.16e-20
Time:                        16:29:07   Log-Likelihood:                -1443.5
No. Observations:                  95   AIC:                             2909.
Df Residuals:                      84   BIC:                             2937.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         2.467e+06   4.55e+05      5.420      0.000    1.56e+06    3.37e+06
conf[T.ACC]        1.72e+06   5.05e+05      3.402      0.001    7.15e+05    2.73e+06
conf[T.Big 12]    2.406e+06   5.57e+05      4.318      0.000     1.3e+06    3.51e+06
conf[T.Big Ten]   2.663e+06   5.11e+05      5.208      0.000    1.65e+06    3.68e+06
conf[T.Ind.]     -6.251e+05   6.61e+05     -0.946      0.347   -1.94e+06    6.89e+05
conf[T.MAC]      -1.041e+06    5.4e+05     -1.928      0.057   -2.11e+06    3.25e+04
conf[T.Mt. West]  -5.69e+05   5.39e+05     -1.055      0.294   -1.64e+06    5.03e+05
conf[T.Pac-12]    1.065e+06   5.19e+05      2.050      0.043    3.21e+04     2.1e+06
conf[T.SEC]        2.34e+06   5.12e+05      4.573      0.000    1.32e+06    3.36e+06
conf[T.Sun Belt] -1.269e+06   5.21e+05     -2.437      0.017    -2.3e+06   -2.33e+05
rank             -1.154e+04   2990.706     -3.859      0.000   -1.75e+04   -5594.947
==============================================================================
Omnibus:                       19.207   Durbin-Watson:                   2.055
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               36.686
Skew:                           0.773   Prob(JB):                     1.08e-08
Kurtosis:                       5.623   Cond. No.                         862.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

2.467e+06 + 1.72e+06 + (-1.154e+04*81)

3252260.0

my_model = str('school_pay ~ conf + stad_size + grad_rate')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.787
Model:                            OLS   Adj. R-squared:                  0.761
Method:                 Least Squares   F-statistic:                     31.00
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           4.11e-24
Time:                        16:33:38   Log-Likelihood:                -1432.8
No. Observations:                  95   AIC:                             2888.
Df Residuals:                      84   BIC:                             2916.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         5.201e+04   4.61e+05      0.113      0.910   -8.65e+05    9.69e+05
conf[T.ACC]       1.061e+06   4.67e+05      2.271      0.026    1.32e+05    1.99e+06
conf[T.Big 12]    1.638e+06   5.21e+05      3.145      0.002    6.02e+05    2.67e+06
conf[T.Big Ten]   1.502e+06   4.95e+05      3.037      0.003    5.18e+05    2.49e+06
conf[T.Ind.]     -1.038e+06   5.91e+05     -1.756      0.083   -2.21e+06    1.37e+05
conf[T.MAC]      -5.696e+05   4.91e+05     -1.161      0.249   -1.55e+06    4.06e+05
conf[T.Mt. West] -3.397e+05   4.84e+05     -0.702      0.484    -1.3e+06    6.22e+05
conf[T.Pac-12]    6.669e+05    4.7e+05      1.420      0.159   -2.67e+05     1.6e+06
conf[T.SEC]       9.958e+05   5.09e+05      1.955      0.054   -1.71e+04    2.01e+06
conf[T.Sun Belt] -5.234e+05   4.72e+05     -1.108      0.271   -1.46e+06    4.16e+05
stad_size           39.8866      6.307      6.324      0.000      27.344      52.429
==============================================================================
Omnibus:                       10.884   Durbin-Watson:                   1.840
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               13.431
Skew:                           0.588   Prob(JB):                      0.00121
Kurtosis:                       4.417   Cond. No.                     7.69e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

train_model_fit.params[0] + train_model_fit.params[1] + (train_model_fit.params[-1] * syracuse['stad_size'])
# train_model_fit.params[-1]

92    3.077273e+06
Name: stad_size, dtype: float64

syracuse = data[data['school'] == 'Syracuse']

syracuse

I : INTERPRET¶

What is the recommended salary for the Syracuse football coach?¶

Drop Syracuse from the training data
Rerun our model
Apply our linear regression equation

data = data_for_m.copy()

data[data['school'] == 'Syracuse']

len(data)

130

	School	Conference	Coach	SchoolPay	TotalPay	Bonus	BonusPaid	AssistantPay	Buyout
0	Air Force	Mt. West	Troy Calhoun	885000	885000	247000	--	$0	--
1	Akron	MAC	Terry Bowden	$411,000	$412,500	$225,000	$50,000	$0	$688,500
2	Alabama	SEC	Nick Saban	$8,307,000	$8,307,000	$1,100,000	$500,000	$0	$33,600,000
3	Alabama at Birmingham	C-USA	Bill Clark	$900,000	$900,000	$950,000	$165,471	$0	$3,847,500
4	Appalachian State	Sun Belt	Scott Satterfield	$712,500	$712,500	$295,000	$145,000	$0	$2,160,417

	School	Conference	Coach	SchoolPay	Bonus	Buyout
0	Air Force	Mt. West	Troy Calhoun	885000.0	247000.0	4000000.0
1	Akron	MAC	Terry Bowden	411000.0	225000.0	688500.0
2	Alabama	SEC	Nick Saban	8307000.0	1100000.0	33600000.0
3	Alabama at Birmingham	C-USA	Bill Clark	900000.0	950000.0	3847500.0
4	Appalachian State	Sun Belt	Scott Satterfield	712500.0	295000.0	2160417.0

	School	Conference	Coach	SchoolPay	Bonus	Buyout	school_key
0	Air Force	Mt. West	Troy Calhoun	885000.0	247000.0	4000000.0	airforce
1	Akron	MAC	Terry Bowden	411000.0	225000.0	688500.0	akron
2	Alabama	SEC	Nick Saban	8307000.0	1100000.0	33600000.0	alabama
3	Alabama at Birmingham	C-USA	Bill Clark	900000.0	950000.0	3847500.0	alabamaa
4	Appalachian State	Sun Belt	Scott Satterfield	712500.0	295000.0	2160417.0	appalach
...	...	...	...	...	...	...	...
124	West Virginia	Big 12	Dana Holgorsen	3605000.0	1000000.0	7150000.0	westvirg
125	Western Kentucky	C-USA	Mike Sanford Jr.	800000.0	400000.0	1200000.0	westernk
126	Western Michigan	MAC	Tim Lester	800000.0	346500.0	800000.0	westernm
127	Wisconsin	Big Ten	Paul Chryst	3750000.0	775000.0	6000000.0	wisconsi
128	Wyoming	Mt. West	Craig Bohl	1412000.0	450000.0	8016667.0	wyoming

	School	Conference	Coach	SchoolPay	Bonus	Buyout	school_key	Rank	Team	W	L	T	Pct	team_key
0	Air Force	Mt. West	Troy Calhoun	885000.0	247000.0	4000000.0	airforce	10	Air Force (Mountain West)	11	2	0	0.846	airforce
1	Akron	MAC	Terry Bowden	411000.0	225000.0	688500.0	akron	130	Akron (MAC)	0	12	0	0.000	akron
2	Alabama	SEC	Nick Saban	8307000.0	1100000.0	33600000.0	alabama	10	Alabama (SEC)	11	2	0	0.846	alabama
3	Appalachian State	Sun Belt	Scott Satterfield	712500.0	295000.0	2160417.0	appalach	3	Appalachian State (Sun Belt)	13	1	0	0.929	appalach
4	Arizona	Pac-12	Kevin Sumlin	1600000.0	2025000.0	10000000.0	arizona	95	Arizona (Pac-12)	4	8	0	0.333	arizona
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
121	West Virginia	Big 12	Dana Holgorsen	3605000.0	1000000.0	7150000.0	westvirg	81	West Virginia (Big 12)	5	7	0	0.417	westvirg
122	Western Kentucky	C-USA	Mike Sanford Jr.	800000.0	400000.0	1200000.0	westernk	28	Western Ky. (C-USA)	9	4	0	0.692	westernk
123	Western Michigan	MAC	Tim Lester	800000.0	346500.0	800000.0	westernm	57	Western Mich. (MAC)	7	6	0	0.538	westernm
124	Wisconsin	Big Ten	Paul Chryst	3750000.0	775000.0	6000000.0	wisconsi	27	Wisconsin (Big Ten)	10	4	0	0.714	wisconsi
125	Wyoming	Mt. West	Craig Bohl	1412000.0	450000.0	8016667.0	wyoming	34	Wyoming (Mountain West)	8	5	0	0.615	wyoming