IST718 | BREAKOUT WEEK 3

Reading

What is your asessment of the sustainability journey?

How do consumers make choices? Yesterday? Today? Tomorrow?

What do we need to evaluate a sustainability journey?

What techniques can we use to make a recommendation engine?

Practice

Pick a country - any country - what does the data tell? 

Could we build a model for predicting sustainable claim?

Should we build a model for predicting whether a product will result in a sustainable claim?

Develop a graphic to support your team's answer.

In [67]:
import pandas as pd
import numpy as np
from scipy.stats import uniform
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

O: OBTAIN

In [3]:
url = "https://raw.githubusercontent.com/2SUBDA/Breakouts/Week3/Case3SalesProducts.csv"
items = pd.read_csv(url, error_bad_lines=False)
In [4]:
items.head()
Out[4]:
Country OrderMethod RetailerType ProductLine ProductType Product Year Quarter SustainableClaim SustainableMarketing Revenue Quantity GrossMargin
0 United States Fax Outdoors Shop Camping Equipment Cooking Gear TrailChef Deluxe Cook Set 2012 Q1 2012 1 0 59628.66 489 0.347548
1 United States Fax Outdoors Shop Camping Equipment Cooking Gear TrailChef Double Flame 2012 Q1 2012 0 1 35950.32 252 0.474274
2 United States Fax Outdoors Shop Camping Equipment Tents Star Dome 2012 Q1 2012 1 0 89940.48 147 0.352772
3 United States Fax Outdoors Shop Camping Equipment Tents Star Gazer 2 2012 Q1 2012 1 0 165883.41 303 0.282938
4 United States Fax Outdoors Shop Camping Equipment Sleeping Bags Hibernator Lite 2012 Q1 2012 1 0 119822.20 1415 0.291450
In [7]:
items.dtypes
Out[7]:
Country                  object
OrderMethod              object
RetailerType             object
ProductLine              object
ProductType              object
Product                  object
Year                      int64
Quarter                  object
SustainableClaim          int64
SustainableMarketing      int64
Revenue                 float64
Quantity                  int64
GrossMargin             float64
dtype: object
In [16]:
np.random.seed(1234)
items['runiform'] = uniform.rvs(loc = 0, scale = 1, size = len(items))
items_train = items[items['runiform'] >= 0.33]
items_test = items[items['runiform'] < 0.33]

my_model = str('Revenue ~ Quantity + SustainableClaim + SustainableMarketing')
train_model_fit = smf.ols(my_model, data = items_train).fit()
print(train_model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Revenue   R-squared:                       0.136
Model:                            OLS   Adj. R-squared:                  0.136
Method:                 Least Squares   F-statistic:                     3112.
Date:                Wed, 22 Jan 2020   Prob (F-statistic):               0.00
Time:                        09:30:38   Log-Likelihood:            -7.3728e+05
No. Observations:               59273   AIC:                         1.475e+06
Df Residuals:                   59269   BIC:                         1.475e+06
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept             2.546e+04    872.335     29.184      0.000    2.37e+04    2.72e+04
Quantity                14.6617      0.166     88.080      0.000      14.335      14.988
SustainableClaim      1.741e+04    749.616     23.228      0.000    1.59e+04    1.89e+04
SustainableMarketing -4625.3628    781.817     -5.916      0.000   -6157.727   -3092.998
==============================================================================
Omnibus:                    61303.368   Durbin-Watson:                   1.235
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          7368770.263
Skew:                           4.961   Prob(JB):                         0.00
Kurtosis:                      56.714   Cond. No.                     8.91e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [24]:
items['Country'].value_counts()
Out[24]:
United States     7482
Canada            5923
France            5779
Germany           5397
Japan             5359
United Kingdom    5102
Netherlands       4199
Switzerland       4103
Italy             4018
Austria           3862
Mexico            3845
Belgium           3710
Australia         3665
China             3652
Spain             3557
Singapore         3443
Finland           3409
Korea             3399
Brazil            3288
Sweden            2925
Denmark           2358
Name: Country, dtype: int64

Testing with Denmark

In [61]:
denmark = items[items['Country'] == 'Denmark']
In [62]:
denmark['SustainableClaim'].value_counts()
Out[62]:
1    1300
0    1058
Name: SustainableClaim, dtype: int64
In [64]:
df = denmark.groupby(['Year','SustainableClaim']).sum()
In [65]:
df.reset_index(inplace=True)
In [56]:
df
Out[56]:
Year SustainableClaim SustainableMarketing Revenue Quantity GrossMargin runiform
0 2012 0 383 4678059.55 170989 204.590129 190.908413
1 2012 1 147 11215207.01 180199 182.929616 246.010735
2 2013 0 439 6881559.00 165780 234.224984 207.095384
3 2013 1 176 13168906.71 211958 187.305995 246.407795
4 2014 0 236 4399289.58 100666 124.233882 123.766425
5 2014 1 102 7555510.08 132507 104.625091 135.537813
In [58]:
df.pivot('Year', 'SustainableClaim', 'Revenue').plot(kind="bar")
plt.title('denmark')
Out[58]:
Text(0.5, 1.0, 'denmark')

NOW IN A LOOP!

In [60]:
for country in set(items['Country'].values):
    country_df = items[items['Country'] == country]
    df = country_df.groupby(['Year','SustainableClaim']).sum()
    df.reset_index(inplace=True)
    df.pivot('Year', 'SustainableClaim', 'Revenue').plot(kind="bar")
    plt.title(country)
    plt.show()
In [ ]: