IST718 WK2

6 minute read

WEEK 2: Describing and Modeling

2.1 Scenario Introduction

OSEMN– obtain, scrub, explore, model, and interpret. These five steps can guide us through both basic and advanced analytical techniques, from linear regression to neural networks in natural language processing. Today, we will focus on explore, model, and interpret.

Explore– how do we describe the problem and the data we have available? Model– what relationships exist between the variables and the response variable we’re targeting? And interpret– based on the model and our data and our analysis, what action do we recommend? What steps should we take to move from our current state to our desired state? In modeling techniques with predictive analytics, Professor Miller begins with a simple problem that will let us walk through several basic and advanced visualizations and begin to model some of the relationships between variables.

2.2 Data Review

2.3 Model Review

2.4 Recommendation

2.5 Code Review

2.5

2.6 Describing

2.7 Describing

2.8 Modeling

2.8.1

So we just finished talking about describing. Describing gives us the exploration. We’ve obtained some data, we’ve explored some data, we’ve scrubbed some data.

And we start to think, OK, what are the relationships that we’ve identified inside this data that we can model. What are the independent and dependent variables? And how can we design a model that will define or describe that relationship mathematically?

2.8.2

This is a tough question. First, what metric are our stakeholders using to define “prefer?” Do they want us to answer which we “prefer” aesthetically? Which we “prefer” in terms of interpretability? Which our color printers would “prefer?” Which would translate best across cultures? Which would make the most sense to aliens? Without knowing what the metric is, it is hard to “prefer” one over the other. If they (amorphous stakeholders in the ether) are entirely entrusting my own non-scientifically backed “gut feeling,” my answers will differ depending on my own mood, time of day, energy levels and overall cynicism towards the human race. Did I find a lot of urine on the seat in the last public restroom I went to? If so I will likely prefer model 2 as I like the implication that humans are shallow, two dimensional beings. If I’ve just finished a riveting book sequel, a gripping podcast or an eye-opening conversation with a new acquaintance, I will likely prefer model 1 – a direct reflection of how it is entirely in the eyes of the beholder how to interpret and “prefer” everything from art to our own basic humanity.

2.8.4

In this case, we’ve got the number of friends, the number of friends that you or I might have, and with that number of friends, the time we spend online. If we plot this data out, we get a slightly more complex view than that nice linear line we had with our first toy problem. If we try to fit a line to this set of data, we get an R-squared, which is 0.43. In this case, our model only accounts for about 43% of the uncertainty that is present in our data.

And we get an estimate that we start with a base number of 199 minutes online plus 20 extra minutes for every friend that we have. As we start to look deeper into model building and model evaluation, we’ll explain how this model is built by trying to minimize the distance between the actual data point and what we estimate the data should be.

2.8.5

So now that we’ve seen what a basic model is, the next question is, how do we evaluate this model? How do we know if one model is better than another?

For ordinary least squares regression– the simple linear model that many of you are familiar with– one of the standard metrics we use is R-squared. So R-squared is basically the residual sum of squares, when we look at the difference between our actual y-value versus our predicted y-value.

In ordinary least square regression, when we get this result back, it’s accounting for the difference between what we predicted would happen and what actually happened, and we’ll show this on a graph in just a second. The results that we get from either R or from Python, normally provides a set of numbers that gives us some insight in how we can evaluate this model.

Looking at the adjusted R-squared for this model, we can see that we’re basically accounting for about 53% of the variability within the model– or correction, 53% of the uncertainty in the model.

So what does this look like on the graph, and what does it look like going back to the output from your statistical software? If you recall from our toy problem, we had a simple formula that for every customer, we increased revenue by $25.

In this case, when we model this, we get a perfect line that matches our number of customers and our revenue. And the R-squared for this is– magically enough– one because there is, number one, no uncertainty in the model. And there’s no uncertainty in the actual generative process that underlies that model.

So in this case, for every customer we have, we get $25 more in revenue. It’s pretty easy to build a model that matches that generative process.

If we go to the slightly more complex model, we can start to look at, based on the number of friends we have, how much time do we spend online? The generative process that built this data set, has a little bit more uncertainty and a little bit more randomness than the simple toy problem of $25 of extra revenue for every customer.

So as we start to look at this model, we’re trying to draw the line that best fits all the different data points that are in the data set. We’re trying to minimize the residual sum of squares so we’re trying to minimize this distance between the actual value and what we predict.

In this case, for data that’s this dispersed, or has this much range, it’s hard to get a model with a very large R-squared. But on this particular R-squared, we’re only accounting for about 43% of the uncertainty within the data.

At this point, if we start to evaluate this model, we can think that a linear model might not be the best way to fit this data. Or there’s another piece of information that we need, besides just the number of friends, in order to determine how much time someone spends online.

So let’s take a quick minute and talk about p-values. P-values are often used to make a statement regarding the impact of a particular variable on the response. For a low p-value, it’s highly unlikely to occur randomly. Therefore, that particular predictor variable is probably significant.

So these two values right here, are very low p-values, therefore, it’s highly unlikely that it occurred randomly. Therefore, it’s probably significant. If it’s a high p-value, the coefficient might actually be zero, and therefore, we should consider removing it from the model.

As we look at how we validate the model, we get three different techniques we can try. One is, collect new data. We could collect new data and see how well our model performs against this new data. Were we able to predict what that outcome would be?

We could compare the data. We could compare the results with theoretical expectation, with earlier empirical studies. Or we could use simulation, as we’ll see later in the course, when we try to pick winners and losers from games.

And finally, the most likely– the most used– is splitting the data. We could split the original data into a training set and a testing set. Using the training set, we could build the model. And then, using the testing set, we could validate the model.

While the statistical side does a lot of focus on r squared and other statistical measures, machine learning side does a lot on, can we predict the actual outcome given our data and our model?

2.9 Modeling

2.10 Introduction to Case Study

Tags:

Updated: