IST 718 | LAB 3

01-23-20

V1 DATA SOURCE: Coaches9

O: OBTAIN

In [187]:
import pandas as pd
data = pd.read_csv("Coaches9.csv")
In [188]:
data.head()
Out[188]:
School Conference Coach SchoolPay TotalPay Bonus BonusPaid AssistantPay Buyout
0 Air Force Mt. West Troy Calhoun 885000 885000 247000 -- $0 --
1 Akron MAC Terry Bowden $411,000 $412,500 $225,000 $50,000 $0 $688,500
2 Alabama SEC Nick Saban $8,307,000 $8,307,000 $1,100,000 $500,000 $0 $33,600,000
3 Alabama at Birmingham C-USA Bill Clark $900,000 $900,000 $950,000 $165,471 $0 $3,847,500
4 Appalachian State Sun Belt Scott Satterfield $712,500 $712,500 $295,000 $145,000 $0 $2,160,417

S: SCRUB

  1. Change the columns with dollar amounts as strings to floats
  2. Inspect and possibly replace missing values
In [189]:
attrs = list(data[data.columns[3:]].columns)
data[attrs] = data[attrs].replace({'\$':'', ',': '', '--':None}, regex=True).astype('float')
In [190]:
data.dtypes
Out[190]:
School           object
Conference       object
Coach            object
SchoolPay       float64
TotalPay        float64
Bonus           float64
BonusPaid       float64
AssistantPay    float64
Buyout          float64
dtype: object
In [191]:
data.isnull().sum()
Out[191]:
School           0
Conference       0
Coach            0
SchoolPay        4
TotalPay         4
Bonus           22
BonusPaid       41
AssistantPay     0
Buyout          22
dtype: int64

Replacing Missing Values

Four options:

  1. Drop the schools with missing values
  2. Drop the attributes with missing values (e.g. 'BonusPaid')
  3. Replace the missing values with some value (e.g. mean, 0)
  4. Find additional data

For V1, we are going to

  1. Drop the schools without SchoolPay
  2. Drop the attribute BonusPaid
  3. Fill Bonus & Buyout with the attribute mean*

For V2, we are going to use #4 and a more advanced version of #3* taking within conference mean

In [192]:
data.dropna(subset=['SchoolPay'], inplace=True)
In [193]:
data.drop('BonusPaid', axis=1, inplace=True)
In [194]:
median_bonus = data['Bonus'].median()
median_buyout = data['Buyout'].median()
data["Bonus"].fillna(median_bonus, inplace = True)
data["Buyout"].fillna(median_buyout, inplace = True)
In [195]:
data.isnull().sum()
Out[195]:
School          0
Conference      0
Coach           0
SchoolPay       0
TotalPay        0
Bonus           0
AssistantPay    0
Buyout          0
dtype: int64

E: EXPLORE

In [196]:
%matplotlib inline
import matplotlib.pyplot as plt
In [197]:
from pandas.plotting import scatter_matrix
scatter_matrix(data, figsize=(12,8))
plt.show()

From these two visualizations, it's clear we need to:

  1. Drop either SchoolPay or TotalPay
  2. Remove (or supplement) AssistantPay
In [198]:
data.drop(['TotalPay','AssistantPay'], axis=1, inplace=True) 
scatter_matrix(data, figsize=(12,8))
plt.show()
In [199]:
data.head()
Out[199]:
School Conference Coach SchoolPay Bonus Buyout
0 Air Force Mt. West Troy Calhoun 885000.0 247000.0 4000000.0
1 Akron MAC Terry Bowden 411000.0 225000.0 688500.0
2 Alabama SEC Nick Saban 8307000.0 1100000.0 33600000.0
3 Alabama at Birmingham C-USA Bill Clark 900000.0 950000.0 3847500.0
4 Appalachian State Sun Belt Scott Satterfield 712500.0 295000.0 2160417.0

Add More Data

We've lost a lot of data at this point and we need to add more. To do so we:

  1. O: Obtain data
  2. S: Scrub data
  3. M: MERGE data! Add a "key" to both columns for merging

This sub OSM occured in an attached auxiliary file

+ Add Win Loss Data

In [244]:
data['school_key'] = data.apply(lambda x: ''.join(x['School'].lower().split(' '))[:8], axis=1)
In [245]:
data
Out[245]:
School Conference Coach SchoolPay Bonus Buyout school_key
0 Air Force Mt. West Troy Calhoun 885000.0 247000.0 4000000.0 airforce
1 Akron MAC Terry Bowden 411000.0 225000.0 688500.0 akron
2 Alabama SEC Nick Saban 8307000.0 1100000.0 33600000.0 alabama
3 Alabama at Birmingham C-USA Bill Clark 900000.0 950000.0 3847500.0 alabamaa
4 Appalachian State Sun Belt Scott Satterfield 712500.0 295000.0 2160417.0 appalach
... ... ... ... ... ... ... ...
124 West Virginia Big 12 Dana Holgorsen 3605000.0 1000000.0 7150000.0 westvirg
125 Western Kentucky C-USA Mike Sanford Jr. 800000.0 400000.0 1200000.0 westernk
126 Western Michigan MAC Tim Lester 800000.0 346500.0 800000.0 westernm
127 Wisconsin Big Ten Paul Chryst 3750000.0 775000.0 6000000.0 wisconsi
128 Wyoming Mt. West Craig Bohl 1412000.0 450000.0 8016667.0 wyoming

125 rows × 7 columns

In [246]:
winloss = pd.read_csv('winloss_withkey.csv')
In [247]:
test = data.copy()
In [248]:
data_wl = test.merge(winloss, left_on="school_key", right_on="team_key")
In [249]:
data_wl
Out[249]:
School Conference Coach SchoolPay Bonus Buyout school_key Rank Team W L T Pct team_key
0 Air Force Mt. West Troy Calhoun 885000.0 247000.0 4000000.0 airforce 10 Air Force (Mountain West) 11 2 0 0.846 airforce
1 Akron MAC Terry Bowden 411000.0 225000.0 688500.0 akron 130 Akron (MAC) 0 12 0 0.000 akron
2 Alabama SEC Nick Saban 8307000.0 1100000.0 33600000.0 alabama 10 Alabama (SEC) 11 2 0 0.846 alabama
3 Appalachian State Sun Belt Scott Satterfield 712500.0 295000.0 2160417.0 appalach 3 Appalachian State (Sun Belt) 13 1 0 0.929 appalach
4 Arizona Pac-12 Kevin Sumlin 1600000.0 2025000.0 10000000.0 arizona 95 Arizona (Pac-12) 4 8 0 0.333 arizona
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
121 West Virginia Big 12 Dana Holgorsen 3605000.0 1000000.0 7150000.0 westvirg 81 West Virginia (Big 12) 5 7 0 0.417 westvirg
122 Western Kentucky C-USA Mike Sanford Jr. 800000.0 400000.0 1200000.0 westernk 28 Western Ky. (C-USA) 9 4 0 0.692 westernk
123 Western Michigan MAC Tim Lester 800000.0 346500.0 800000.0 westernm 57 Western Mich. (MAC) 7 6 0 0.538 westernm
124 Wisconsin Big Ten Paul Chryst 3750000.0 775000.0 6000000.0 wisconsi 27 Wisconsin (Big Ten) 10 4 0 0.714 wisconsi
125 Wyoming Mt. West Craig Bohl 1412000.0 450000.0 8016667.0 wyoming 34 Wyoming (Mountain West) 8 5 0 0.615 wyoming

126 rows × 14 columns

+ Add Stadium Size and Graduation Rate

In [306]:
data_wl['team_coach_key'] = data_wl.apply(lambda x: (''.join(x['Coach'].split())+x['School'][:4]).lower(), axis=1)

gsr = pd.read_csv('c_modify_small.csv')
test = data_wl.copy()
data_wl_gsr = test.merge(gsr, left_on="team_coach_key", right_on="team_coach_key")
data_wl_gsr.columns
Out[306]:
Index(['School', 'Conference', 'Coach', 'SchoolPay', 'Bonus', 'Buyout',
       'school_key', 'Rank', 'Team', 'W', 'L', 'T', 'Pct', 'team_key_x',
       'team_coach_key', 'school', 'stad_size', 'gsr', 'med_conf_sal',
       'seat_rank', 'combo_rank', 'true_rank', 'gs_rank', 'team_key_y',
       'coach', 'team_coach'],
      dtype='object')
In [307]:
data_wl_gsr.isna().sum()
Out[307]:
School            0
Conference        0
Coach             0
SchoolPay         0
Bonus             0
Buyout            0
school_key        0
Rank              0
Team              0
W                 0
L                 0
T                 0
Pct               0
team_key_x        0
team_coach_key    0
school            0
stad_size         0
gsr               0
med_conf_sal      0
seat_rank         0
combo_rank        0
true_rank         0
gs_rank           0
team_key_y        0
coach             0
team_coach        0
dtype: int64
In [308]:
data_wl_gsr.seat_rank.value_counts()
Out[308]:
20     5
88     3
64     3
26     2
103    2
      ..
86     1
87     1
90     1
91     1
128    1
Name: seat_rank, Length: 93, dtype: int64
In [309]:
data_for_m = data_wl_gsr.drop(['school_key','Team','team_key_y',
                               'team_key_x', 'team_coach_key', 'coach', 'team_coach', 'school'], axis=1)
In [310]:
data_for_m.columns
Out[310]:
Index(['School', 'Conference', 'Coach', 'SchoolPay', 'Bonus', 'Buyout', 'Rank',
       'W', 'L', 'T', 'Pct', 'stad_size', 'gsr', 'med_conf_sal', 'seat_rank',
       'combo_rank', 'true_rank', 'gs_rank'],
      dtype='object')
In [311]:
columns = ['school','conf','coach','school_pay', 'bonus', 'buyout', 'rank', 'W','L', 'T', 'ratio', 'stad_size', 'gsr', 'med_conf_sal', 'seat_rank',
       'combo_rank', 'true_rank', 'gs_rank' ]
data_for_m.columns = columns

data_for_m['med_conf_sal'] = data_for_m['med_conf_sal'].replace({'\$':'', ',': '', '--':0}, regex=True).astype('float')
data_for_m['stad_size'] = data_for_m['stad_size'].replace({'\$':'', ',': '', '--':0}, regex=True).astype('float')
In [312]:
scatter_matrix(data_for_m, figsize=(12,8))
plt.show()
/Users/danielcaraway/anaconda3/lib/python3.7/site-packages/pandas/plotting/_matplotlib/misc.py:80: UserWarning: Attempting to set identical left == right == 0.0 results in singular transformations; automatically expanding.
  ax.set_xlim(boundaries_list[j])
/Users/danielcaraway/anaconda3/lib/python3.7/site-packages/pandas/plotting/_matplotlib/misc.py:81: UserWarning: Attempting to set identical bottom == top == 0.0 results in singular transformations; automatically expanding.
  ax.set_ylim(boundaries_list[i])
/Users/danielcaraway/anaconda3/lib/python3.7/site-packages/pandas/plotting/_matplotlib/misc.py:71: UserWarning: Attempting to set identical left == right == 0.0 results in singular transformations; automatically expanding.
  ax.set_xlim(boundaries_list[i])
In [313]:
data_for_m.drop('T', axis=1, inplace=True)
In [314]:
data_for_m.to_csv('data_for_m_v3.csv', index=False)
In [335]:
data = data_for_m.copy()
In [336]:
data.head()
Out[336]:
school conf coach school_pay bonus buyout rank W L ratio stad_size gsr med_conf_sal seat_rank combo_rank true_rank gs_rank
0 Air Force Mt. West Troy Calhoun 885000.0 247000.0 4000000.0 10 11 2 0.846 46692.0 83 879288.0 60 65 67 97
1 Akron MAC Terry Bowden 411000.0 225000.0 688500.0 130 0 12 0.000 30000.0 45 492413.0 20 11 129 5
2 Alabama SEC Nick Saban 8307000.0 1100000.0 33600000.0 10 11 2 0.846 101821.0 79 3929800.0 124 111 5 84
3 Appalachian State Sun Belt Scott Satterfield 712500.0 295000.0 2160417.0 3 13 1 0.929 24050.0 57 675000.0 11 20 122 13
4 Arizona Pac-12 Kevin Sumlin 1600000.0 2025000.0 10000000.0 95 4 8 0.333 51811.0 74 2752233.0 73 66 62 60
In [ ]:
 
In [ ]:
 

MODELS IN NEW WORKBOOK -- IST718_LAB3_V2_MODELS

In [344]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

data = train_set.drop('school_pay', axis=1)
data_labels = train_set['school_pay'].copy()
In [345]:
from sklearn.compose import ColumnTransformer

data_num = data.drop(['school', 'conf', 'coach'],axis =1) 
num_attribs = list(data_num)
# cat_attribs = ['school','conf','coach']
cat_attribs = ['conf']
In [346]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
#     ('attribs_addr', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])
In [347]:
full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs)
])
In [348]:
data_prepared = full_pipeline.fit_transform(data)
# data.dtypes

LINEAR REGRESSION

In [349]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(data_prepared, data_labels)
Out[349]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [350]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
import numpy as np

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Stardard Deviation:", scores.std())

With only win-loss Scores: [ 627138.42590247 528461.92031606 740615.40460434 772653.8213664 421934.01610918 189714.87363032 913332.69201488 1110949.79108439 559555.59486329 1070420.75902697] Mean: 693477.7298918304 Stardard Deviation: 273911.52769763477

In [351]:
scores = cross_val_score(lin_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
lin_reg_rmse_scores = np.sqrt(-scores)
display_scores(lin_reg_rmse_scores)
Scores: [415759.78957098 335939.64663732 501500.88316226 291747.89978931
 330147.67958576 633813.27857645 803408.59272099 619665.81749927
 870888.03389595 389391.02186592]
Mean: 519226.26433042076
Stardard Deviation: 194064.85413625545
In [352]:
data_whole = data_for_m.copy()
data_whole_y = data_whole['school_pay']
data_whole_X = data_whole.drop(['school_pay', 'school','coach'], axis=1)
In [353]:
# data_whole_prepared = full_pipeline.fit_transform(data_whole_X)
# lin_reg.fit(data_whole_prepared, data_whole_y)
# lin_reg.fit()
# data_whole_X
# lin_reg.fit(data_whole_X, data_whole_y)
In [354]:
# post_prediction = data_for_m.copy()
# post_prediction['prediction'] = lin_reg.predict(data_whole_prepared)
In [355]:
# post_prediction['prediction_clear'] = post_prediction.apply(lambda x: '%.1f' % x['prediction'], axis=1)
In [356]:
# post_prediction

DECISION TREE

In [357]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(data_prepared, data_labels)
Out[357]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

With only win-loss Scores: [1061315.48217972 388101.64569291 838352.48061063 1310105.90729247 761111.21968192 813957.7190166 1191476.11762561 1072436.80091342 1048970.92139129 1632316.56249617] Mean: 1011814.4856900737 Stardard Deviation: 321697.2361820451

In [358]:
scores = cross_val_score(tree_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
tree_reg_rmse_scores = np.sqrt(-scores)
display_scores(tree_reg_rmse_scores)
Scores: [ 695960.02224984  190985.32132078  865917.51308222  495846.21566096
  715740.05141881  474615.74502199  801647.18805615 1653859.92548697
 1004394.43767377  704543.86440259]
Mean: 760351.028437408
Stardard Deviation: 368211.9671187752

RANDOM FOREST

In [359]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(data_prepared, data_labels)
/Users/danielcaraway/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
Out[359]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

With only win-loss Scores: [ 673351.8422406 606464.01825049 425498.26999178 1090543.6350158 690811.54849218 850426.329261 839837.59854235 1300692.87382977 816062.72166777 1095827.29603885] Mean: 838951.6133330595 Stardard Deviation: 248497.0665433372

In [360]:
scores = cross_val_score(forest_reg, data_prepared, data_labels, scoring="neg_mean_squared_error", cv=10)
forest_reg_rmse_scores = np.sqrt(-scores)
display_scores(forest_reg_rmse_scores)
Scores: [ 468710.30629126  356629.96098513  688266.15774732  622175.40880706
  572174.37416687  601459.50130731  997561.09378467  565678.85222655
 1121408.95217627  549767.80651937]
Mean: 654383.2414011809
Stardard Deviation: 221262.96170275135

MORE MODELING

In [361]:
import statsmodels.formula.api as smf

MODEL 1

RESPONSE: school_pay | PREDICTORS: bonus, rank, buyout -- 71%

RESULT: (adj. R-Squared) 71%

In [362]:
data = data_for_m.copy()
my_model = str('school_pay ~ bonus + rank + buyout')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.732
Model:                            OLS   Adj. R-squared:                  0.723
Method:                 Least Squares   F-statistic:                     82.76
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           6.54e-26
Time:                        16:19:53   Log-Likelihood:                -1443.7
No. Observations:                  95   AIC:                             2895.
Df Residuals:                      91   BIC:                             2906.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1.037e+06   2.78e+05      3.725      0.000    4.84e+05    1.59e+06
bonus          0.4464      0.209      2.137      0.035       0.032       0.861
rank       -2978.9510   2938.326     -1.014      0.313   -8815.574    2857.672
buyout         0.1705      0.013     12.721      0.000       0.144       0.197
==============================================================================
Omnibus:                       11.613   Durbin-Watson:                   2.075
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               12.196
Skew:                           0.857   Prob(JB):                      0.00225
Kurtosis:                       3.380   Cond. No.                     3.31e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.31e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

MODEL 2

RESPONSE: school_pay | PREDICTORS: conf, bonus, buyout -- 88%

RESULT: (adj. R-Squared) 88%

In [363]:
my_model = str('school_pay ~ conf + bonus + buyout')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.881
Model:                            OLS   Adj. R-squared:                  0.865
Method:                 Least Squares   F-statistic:                     55.87
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           1.20e-33
Time:                        16:19:53   Log-Likelihood:                -1405.1
No. Observations:                  95   AIC:                             2834.
Df Residuals:                      83   BIC:                             2865.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         1.489e+06   3.18e+05      4.680      0.000    8.56e+05    2.12e+06
conf[T.ACC]       8.803e+05   3.59e+05      2.455      0.016    1.67e+05    1.59e+06
conf[T.Big 12]    1.408e+06   3.87e+05      3.640      0.000    6.39e+05    2.18e+06
conf[T.Big Ten]   1.542e+06   3.61e+05      4.275      0.000    8.24e+05    2.26e+06
conf[T.Ind.]     -6.208e+05   4.46e+05     -1.393      0.167   -1.51e+06    2.66e+05
conf[T.MAC]      -9.176e+05   3.71e+05     -2.477      0.015   -1.65e+06   -1.81e+05
conf[T.Mt. West] -6.707e+05   3.63e+05     -1.850      0.068   -1.39e+06    5.05e+04
conf[T.Pac-12]     7.94e+05   3.74e+05      2.123      0.037    5.02e+04    1.54e+06
conf[T.SEC]       1.656e+06   3.59e+05      4.608      0.000    9.41e+05    2.37e+06
conf[T.Sun Belt] -8.755e+05    3.6e+05     -2.433      0.017   -1.59e+06    -1.6e+05
bonus               -0.3289      0.193     -1.707      0.092      -0.712       0.054
buyout               0.1225      0.011     11.647      0.000       0.102       0.143
==============================================================================
Omnibus:                       20.520   Durbin-Watson:                   1.955
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               61.965
Skew:                           0.623   Prob(JB):                     3.50e-14
Kurtosis:                       6.755   Cond. No.                     1.57e+08
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.57e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
In [372]:
1.489e+06 + 8.803e+05 
# + -0.3289(bonus) + 0.1225(buyout)
Out[372]:
2369300.0

MODEL 3

RESPONSE: school_pay | PREDICTORS: ratio, conf, buyout -- 87.7%

RESULT: 87.7% (adj. R-Squared)

In [364]:
my_model = str('school_pay ~ conf + buyout')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.877
Model:                            OLS   Adj. R-squared:                  0.862
Method:                 Least Squares   F-statistic:                     59.80
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           6.16e-34
Time:                        16:19:53   Log-Likelihood:                -1406.7
No. Observations:                  95   AIC:                             2835.
Df Residuals:                      84   BIC:                             2864.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         1.243e+06   2.87e+05      4.333      0.000    6.73e+05    1.81e+06
conf[T.ACC]       7.582e+05   3.55e+05      2.134      0.036    5.17e+04    1.46e+06
conf[T.Big 12]    1.415e+06   3.91e+05      3.617      0.001    6.37e+05    2.19e+06
conf[T.Big Ten]   1.484e+06   3.63e+05      4.087      0.000    7.62e+05    2.21e+06
conf[T.Ind.]     -5.425e+05   4.48e+05     -1.210      0.230   -1.43e+06    3.49e+05
conf[T.MAC]       -7.92e+05   3.67e+05     -2.157      0.034   -1.52e+06   -6.17e+04
conf[T.Mt. West] -6.262e+05   3.66e+05     -1.712      0.091   -1.35e+06    1.01e+05
conf[T.Pac-12]    5.776e+05   3.56e+05      1.623      0.108    -1.3e+05    1.29e+06
conf[T.SEC]       1.531e+06   3.56e+05      4.303      0.000    8.23e+05    2.24e+06
conf[T.Sun Belt] -7.309e+05   3.54e+05     -2.066      0.042   -1.43e+06   -2.75e+04
buyout               0.1213      0.011     11.429      0.000       0.100       0.142
==============================================================================
Omnibus:                       17.441   Durbin-Watson:                   1.990
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               42.045
Skew:                           0.589   Prob(JB):                     7.41e-10
Kurtosis:                       6.039   Cond. No.                     1.56e+08
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.56e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
In [370]:
1.243e+06 + 7.582e+05
# + 0.1213(buyout)
Out[370]:
2001200.0

MODEL 4

RESPONSE: school_pay | PREDICTORS: conf

RESULT: 87.7% (adj. R-Squared)

In [365]:
my_model = str('school_pay ~ conf')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.652
Method:                 Least Squares   F-statistic:                     20.57
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           6.49e-18
Time:                        16:19:54   Log-Likelihood:                -1451.3
No. Observations:                  95   AIC:                             2923.
Df Residuals:                      85   BIC:                             2948.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         1.767e+06    4.5e+05      3.924      0.000    8.72e+05    2.66e+06
conf[T.ACC]       1.838e+06   5.44e+05      3.376      0.001    7.55e+05    2.92e+06
conf[T.Big 12]    2.697e+06   5.96e+05      4.528      0.000    1.51e+06    3.88e+06
conf[T.Big Ten]   2.707e+06   5.51e+05      4.910      0.000    1.61e+06     3.8e+06
conf[T.Ind.]     -7.675e+05   7.12e+05     -1.078      0.284   -2.18e+06    6.48e+05
conf[T.MAC]      -1.171e+06   5.81e+05     -2.014      0.047   -2.33e+06    -1.5e+04
conf[T.Mt. West] -6.408e+05   5.81e+05     -1.103      0.273    -1.8e+06    5.15e+05
conf[T.Pac-12]    1.153e+06    5.6e+05      2.061      0.042    4.06e+04    2.27e+06
conf[T.SEC]       2.424e+06   5.51e+05      4.397      0.000    1.33e+06    3.52e+06
conf[T.Sun Belt] -1.106e+06    5.6e+05     -1.975      0.051   -2.22e+06    7186.614
==============================================================================
Omnibus:                       27.553   Durbin-Watson:                   1.947
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               57.186
Skew:                           1.090   Prob(JB):                     3.82e-13
Kurtosis:                       6.114   Cond. No.                         13.6
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [373]:
1.767e+06 + 1.838e+06
Out[373]:
3605000.0

MODEL 5

RESPONSE: school_pay | PREDICTORS: conf, ratio

In [377]:
my_model = str('school_pay ~ conf + ratio')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.738
Model:                            OLS   Adj. R-squared:                  0.707
Method:                 Least Squares   F-statistic:                     23.71
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           1.75e-20
Time:                        16:25:25   Log-Likelihood:                -1442.5
No. Observations:                  95   AIC:                             2907.
Df Residuals:                      84   BIC:                             2935.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         6.742e+05    4.9e+05      1.375      0.173   -3.01e+05    1.65e+06
conf[T.ACC]       1.765e+06      5e+05      3.534      0.001    7.72e+05    2.76e+06
conf[T.Big 12]    2.467e+06   5.49e+05      4.494      0.000    1.38e+06    3.56e+06
conf[T.Big Ten]    2.66e+06   5.06e+05      5.259      0.000    1.65e+06    3.67e+06
conf[T.Ind.]     -5.567e+05   6.55e+05     -0.850      0.398   -1.86e+06    7.46e+05
conf[T.MAC]       -9.97e+05   5.35e+05     -1.864      0.066   -2.06e+06    6.64e+04
conf[T.Mt. West] -5.329e+05   5.34e+05     -0.998      0.321   -1.59e+06    5.28e+05
conf[T.Pac-12]    1.109e+06   5.13e+05      2.161      0.034    8.83e+04    2.13e+06
conf[T.SEC]       2.322e+06   5.06e+05      4.586      0.000    1.32e+06    3.33e+06
conf[T.Sun Belt] -1.274e+06   5.15e+05     -2.475      0.015    -2.3e+06    -2.5e+05
ratio             2.061e+06   4.99e+05      4.127      0.000    1.07e+06    3.05e+06
==============================================================================
Omnibus:                       18.217   Durbin-Watson:                   2.055
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               32.977
Skew:                           0.754   Prob(JB):                     6.90e-08
Kurtosis:                       5.461   Cond. No.                         15.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [379]:
6.742e+05 +  1.765e+06 + (2.061e+06*0.417)
Out[379]:
3298637.0
In [383]:
my_model = str('school_pay ~ conf + rank')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.733
Model:                            OLS   Adj. R-squared:                  0.701
Method:                 Least Squares   F-statistic:                     23.03
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           4.16e-20
Time:                        16:29:07   Log-Likelihood:                -1443.5
No. Observations:                  95   AIC:                             2909.
Df Residuals:                      84   BIC:                             2937.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         2.467e+06   4.55e+05      5.420      0.000    1.56e+06    3.37e+06
conf[T.ACC]        1.72e+06   5.05e+05      3.402      0.001    7.15e+05    2.73e+06
conf[T.Big 12]    2.406e+06   5.57e+05      4.318      0.000     1.3e+06    3.51e+06
conf[T.Big Ten]   2.663e+06   5.11e+05      5.208      0.000    1.65e+06    3.68e+06
conf[T.Ind.]     -6.251e+05   6.61e+05     -0.946      0.347   -1.94e+06    6.89e+05
conf[T.MAC]      -1.041e+06    5.4e+05     -1.928      0.057   -2.11e+06    3.25e+04
conf[T.Mt. West]  -5.69e+05   5.39e+05     -1.055      0.294   -1.64e+06    5.03e+05
conf[T.Pac-12]    1.065e+06   5.19e+05      2.050      0.043    3.21e+04     2.1e+06
conf[T.SEC]        2.34e+06   5.12e+05      4.573      0.000    1.32e+06    3.36e+06
conf[T.Sun Belt] -1.269e+06   5.21e+05     -2.437      0.017    -2.3e+06   -2.33e+05
rank             -1.154e+04   2990.706     -3.859      0.000   -1.75e+04   -5594.947
==============================================================================
Omnibus:                       19.207   Durbin-Watson:                   2.055
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               36.686
Skew:                           0.773   Prob(JB):                     1.08e-08
Kurtosis:                       5.623   Cond. No.                         862.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [384]:
2.467e+06 + 1.72e+06 + (-1.154e+04*81)
Out[384]:
3252260.0
In [387]:
my_model = str('school_pay ~ conf + stad_size + grad_rate')
train_model_fit = smf.ols(my_model, data = train_set).fit()
print(train_model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             school_pay   R-squared:                       0.787
Model:                            OLS   Adj. R-squared:                  0.761
Method:                 Least Squares   F-statistic:                     31.00
Date:                Fri, 24 Jan 2020   Prob (F-statistic):           4.11e-24
Time:                        16:33:38   Log-Likelihood:                -1432.8
No. Observations:                  95   AIC:                             2888.
Df Residuals:                      84   BIC:                             2916.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         5.201e+04   4.61e+05      0.113      0.910   -8.65e+05    9.69e+05
conf[T.ACC]       1.061e+06   4.67e+05      2.271      0.026    1.32e+05    1.99e+06
conf[T.Big 12]    1.638e+06   5.21e+05      3.145      0.002    6.02e+05    2.67e+06
conf[T.Big Ten]   1.502e+06   4.95e+05      3.037      0.003    5.18e+05    2.49e+06
conf[T.Ind.]     -1.038e+06   5.91e+05     -1.756      0.083   -2.21e+06    1.37e+05
conf[T.MAC]      -5.696e+05   4.91e+05     -1.161      0.249   -1.55e+06    4.06e+05
conf[T.Mt. West] -3.397e+05   4.84e+05     -0.702      0.484    -1.3e+06    6.22e+05
conf[T.Pac-12]    6.669e+05    4.7e+05      1.420      0.159   -2.67e+05     1.6e+06
conf[T.SEC]       9.958e+05   5.09e+05      1.955      0.054   -1.71e+04    2.01e+06
conf[T.Sun Belt] -5.234e+05   4.72e+05     -1.108      0.271   -1.46e+06    4.16e+05
stad_size           39.8866      6.307      6.324      0.000      27.344      52.429
==============================================================================
Omnibus:                       10.884   Durbin-Watson:                   1.840
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               13.431
Skew:                           0.588   Prob(JB):                      0.00121
Kurtosis:                       4.417   Cond. No.                     7.69e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [392]:
train_model_fit.params[0] + train_model_fit.params[1] + (train_model_fit.params[-1] * syracuse['stad_size'])
# train_model_fit.params[-1]
Out[392]:
92    3.077273e+06
Name: stad_size, dtype: float64
In [367]:
syracuse = data[data['school'] == 'Syracuse']
In [334]:
syracuse
Out[334]:
school conf coach bonus buyout rank W L ratio stad_size gsr med_conf_sal seat_rank combo_rank true_rank gs_rank
92 Syracuse ACC Dino Babers 775000.0 4000000.0 81 5 7 0.417 49250.0 80 2458033.0 63 73 49 87

I : INTERPRET

  1. Drop Syracuse from the training data
  2. Rerun our model
  3. Apply our linear regression equation
In [380]:
data = data_for_m.copy()
In [381]:
data[data['school'] == 'Syracuse']
Out[381]:
school conf coach school_pay bonus buyout rank W L ratio stad_size gsr med_conf_sal seat_rank combo_rank true_rank gs_rank
92 Syracuse ACC Dino Babers 2401206.0 775000.0 4000000.0 81 5 7 0.417 49250.0 80 2458033.0 63 73 49 87
In [186]:
len(data)
Out[186]:
130
In [ ]: