Video 6 of 6 of the Understanding Statistics guide

Multiple Regression 27:59

Antony Davies shows how statisticians use multiple regression to estimate the relationship between an outcome variable and more than one factor variable, and concludes the lecture series by explaining why understanding statistics is a crucial part of establishing a good working knowledge of the world.

Transcript

Antony Davies: Let me show you something else. What we are doing there goes by the name simple regression. What I’m going to show you here is called multiple regression. Works the same way except that your relationship can involve more than two variables. We can have several things here. So, in this example you’re a trucking company, and your goal is to schedule the trucks, and as part [00:00:30] of your scheduling the trucks it’s necessary for you to be able to estimate how long the trucks will be gone.

So on the one hand I don’t want a bunch of idle trucks sitting around. On the other hand, I don’t want someone to come in and say I’d like to move a shipment from A to B, and I have and trucks to give him. So, this becomes important to me to be able to estimate how long trucks will be gone. And I walk into the room assuming that there are really two things that influence the travel time. Maybe there are other things as well, but they [00:01:00] fall out into the noise. The two important things, the things that I can control or have some vision into are these. Miles traveled, and number of deliveries.

So I figure the more miles a truck travels on average, the more time it’s going to take to leave and come back. The more deliveries it makes, the more time it’s going to take, right? Because this goes out and makes lots of deliveries, it’s going to take more time before it comes back to the shipping yard. So what we see here is data on [00:01:30] travel time for a handful of deliveries that our trucks have made, miles that the trucks have traveled, and deliveries that they’ve made. And you can see in this data set we’ve got a total of 87 hours of travel time. These trucks traveled a combined 4000 miles, and made a combined 29 deliveries. So the question is this. How can I use this information to predict travel time?

Let’s try a couple of straightforward things. So the way [00:02:00] I’m going to use this analyses is to answer the question I have a truck that’s going to travel 325 miles, and make two deliveries, and I’d like to know how long is this truck going to be gone? I’m going to use this data to try and come up with an estimate for how long the truck will be gone, and let’s start off by doing something that seems pretty straightforward. I have here in my data set a total of 87 hours of travel time, and these trucks that traveled 87 hours traveled a combined [00:02:30] 4000 miles. So if I simply do the division, my trucks on average are taking .022 hours for every mile they travel. So, if I’ve got a truck that’s going to travel 325 miles, I can do 325 times .022, and I get 7.2 hours is my estimate for how long the truck will be gone. That’s not bad. It’s a straightforward thing to do. I’ve calculated average hours per mile, and done the multiplication.

Here’s the problem. [00:03:00] I have data not just on average miles, on the number of miles traveled. I also have data on number of deliveries. And I could use number of deliveries to estimate how long my truck will be gone, so in this data set my trucks were gone a total of 87 hours, and combined they made 29 stops. That’s 3 hours per delivery. So if I have a truck that’s going to be gone making two deliveries, three hours per delivery times two deliveries, that’s 6 hours. [00:03:30] Notice my problem now. I’ve got two conflicting estimates for how long my truck with be gone.

On the one hand, if I look at hours per mile traveled, I’m estimating that my truck is going to be gone 7.2 hours. On the other hand if I look at deliveries per, or hours per delivery, I find that I estimate my truck is going to be gone six hours. So, which is it? Is it 7.2 hours, or is it six hours? You might be tempted to just take the average of the two [00:04:00] and say look, I figure I’m estimating how long they’re going to be gone. If I estimated according to miles, it’s 7.2 hours. If I estimate according to deliveries it’s 6 hours, just average the two together. All right. It’s not very satisfying because it’s kind of just ad hoc. Why would you necessarily average these together, you know? Why not add them? Why not do powers or something? Right?

A much better approach here is to use what’s called a multiple regression analysis, and in the multiple regression analysis [00:04:30] we walk into the room, and we say, “Look. I believe there’s a relationship between hours of travel time, miles traveled, and deliveries made. And furthermore, I believe the relationship looks like this. Hours is some number A plus some other number B times miles, plus some other number C times deliveries, plus noise.”

Now, the computer can tell me what A, B, and C are. You the noise, these are things [00:05:00] that affect the hours, the travel time, other than miles and deliveries. So things like my driver got pulled over and inspected, or he spilled a cup of coffee on himself and had to stop. He had a fight with his wife and he’s distracted, and took a wrong turn. All these little pieces of noise that will influence hours. All that stuff gets lumped into you, and I want to blow all of that away so that I’m looking at this pristine relationship with all the noise gone. [00:05:30] Just show me the relationship between miles, deliveries, and hours.

So if we run a regression of this, we feel the whole thing through the comp, the computer comes back and says, “Okay. You’ve got this cloud of data, that’s miles, deliveries, and hours. In the line, or in this case a plane because it’s three dimensional that fits the data most closely is this. You’re estimated hours that your truck is going to eat up is 1.13 plus .01 times miles, plus .92 times deliveries. [00:06:00] So let’s look at these numbers separately. The .01 times miles, what is that? The .01 remember measures the magnitude of the relationship between miles and hours. So this tell me on average, traveling an additional mile will add .01 hours to your trip. On average, traveling an additional mile will add .01 hours to the trip.

.92 is the coefficient, is [00:06:30] the parameter that’s attached to deliveries, and that measures the magnitude of the relationship between deliveries and hours. So this says on average, on average, making an additional delivery will add .92 hours to your travel time. Making an additional delivery will add .92 hours to your travel time. And here’s where life gets fascinating. Regression analysis, when you have more than one factor in here like we have miles and deliveries, [00:07:00] we’ve got two factors here trying to explain hours. When you put more than one factor into a regression analysis, what the regression analysis gives you back is what’s called the marginal effect. So the .01 technically speaking, we would call the marginal effect of miles on hours. The .92 we would call the marginal effect of deliveries on hours.

What does that mean? [00:07:30] It means that .01 is the effect of an additional mile on hours after filtering out the effect of deliveries on hours. .01 is the effect of an additional mile on hours after filtering out the effect of deliveries on hours. Similarly, .92 is the [00:08:00] effect of an additional delivery on hours after filtering out the effect of miles on hours, and if you start to think about it like this, you’ll notice where we would have gone wrong but taking our simple averages that we had and putting them together> So we had when we looked at just miles separately and deliveries separately, with miles we had an estimate of 7.2 [00:08:30] hours to go 325 miles, and with deliveries we had an estimate of 6 hours to make two deliveries. And so our knee jerk reaction was well, just average those two numbers together, and you get a nice estimate for how many hours it’s going to take you.

Here’s the problem. Deliveries and miles are going to be related. On average, on average, the further he travels the more opportunities he has to deliver stuff, so I would expect him to be doing more deliveries. [00:09:00] And for shorter trips, I would expect fewer deliveries to be happening, because he’s not going that far. So because these two things are related, if I calculate my estimate is 7.2 hours based on miles, and my estimate is 6 hours based on deliveries, and then somehow combine them, I end up double counting. I end up double counting the effect of miles and deliveries because miles and deliveries are themselves interrelated. And this is the beauty [00:09:30] of the multiple regression. When you see these effects, the .01 marginal effect of miles on hours, and the .92 marginal effect of deliveries on hours, these are the effects of miles on hours after filtering out the effect of delivery on hours. So we use this term marginal effect to describe this phenomenon.

One other thing that we see here, now this is going to your earlier question, what’s the A mean? The A here does have an interesting interpretation. [00:10:00] So A, 1.13 is our estimate for hours when miles are zero, and deliveries are zero. So imagine a truck that goes nowhere, and deliveries nothing. According to my model, it’s going to take 1.13 hours to do that, and you might wonder why it should take any time at all? What is this thing measuring?

What it’s measuring might be the fixed cost of my driver [00:10:30] going to the dispatch office, getting the keys, getting the map of where it is he’s going, whatever it is, going to the truck, checking his load, his pressure in the air tires, and the oil, and the fuel, and all of that stuff, backing the thing out of wherever it is, and turning onto the road. All of that involved traveling nowhere, and delivering nothing, but it contributed to the hours involved, the time. [00:11:00] So we think of it as a fixed cost. And matter what you do, and matter how many hours you, no matter how many miles you drive, no matter how many deliveries you make, you’re going to have this overhead of 1.13 hours, so that’s an interesting interpretation for A in this example.

So now, we’ll come to my point. My point was I’ve got a truck that’s going to travel 325 miles, and make two deliveries. I can plug it into my estimated regression model. The 325 for miles, the two for deliveries, and the thing comes back and tells me 6.2 [00:11:30] hours. Does this mean that my truck will take exactly 6.2 hours? No, it probably won’t take 6.2 hours. What it says is, on average a truck that travels 325 miles, and makes 2 deliveries, I can expect to take 6.2 hours. And when it doesn’t it, it might be more, it might be less, that’s random noise. Random things happen that have nothing to do with miles, nothing to do with deliveries, to influence the actual hours.

[00:12:00] So, we can turn now, we were talking about magnitude effects. We can turn now to the P value, so I have this hypothesis after accounting for deliveries, my walk-in hypothesis is after accounting for deliveries, miles have and effect on hours, right? So my walking in hypothesis is there’s no effect here. So I look at the P value that goes with my miles parameter, and I see a very low P value. Very low P value means the data contradicts me. [00:12:30] The data says, “No, you are wrong. There does appear to be a relationship between miles and hours even after filtering out the effect of deliveries.

Similarly, I can say, “Look, my walking in hypothesis is after I account for miles, deliveries have no effect on hours. And again I can look at the P value. I see a very low P value, meaning that no, the data contradict me. There does appear to be a relationship between deliveries and hours. And then finally for this data set I’m getting R squared to .9, [00:13:00] which says what? Of all the things that influence the travel time, of all the things that influence travel time, miles and deliveries account for about 90%. The other 10% is due to the noise, the spilling of the coffee, the argument with the wife, the getting pulled over by the police. The point of regression is to filter out the noise, and to find the underling real effects.

And the reason this becomes very important in [00:13:30] social sciences, particularly in economics is because it’s so difficult for us to conduct experiment. When you can conduct an experiment, you control for things. So I want to know if I feed a plant coffee versus water, will the plant grow better? So what I do is I control for everything I can control for. I have two plants. I put them in the same temperature, the same humidity, the same light level, all of this. I feed them the same quantity of liquid. The only thing I change is what the liquid is. This one gets coffee. This [00:14:00] one gets water. And then I observe. I measure how much they grow over time. In a controlled experiment, the point is to control everything except for the one thing you’re testing for. That thing we’re going to vary.

And so when I see a difference in the plants, I conclude it must be due to the coffee versus the water, because everything else was the same. In economics, you rarely can do that. I can’t control for things. I have to just take the data that’s shown to me. So when I take [00:14:30] the data that’s shown to me, if I put it into a multiple regression model, I get the same sort of things. So I say, for example, I wonder if, I wonder if increasing the income tax rate would cause people to work less? And so I look around and I have data on how many hours people work, and I have data on their marginal income tax rates. And I can run a simple regression and I can show some results, and someone is going to put their hand [00:15:00] up and say, “But wait a minute. There’s lots of things that affect people’s willingness to work other than their income tax rate. How do I know that this difference I’m seeing is due to the income tax rate, not something else?

If the person were a plant, I’d put one of them in this box, and one of them in this box, and everything would be the same except for the tax rates, and I’d watch how much they work. I can’t do that, so what I do is I run a multiple regression model, and on this side I have the thing I’m trying to explain. How many hours do you work? [00:15:30] Over here I have the thing that I’m asking, does this affect that? And that’s the income tax rate. And then I have a whole bunch of other stuff, and all these other things are the things I’d like to control for, but can’t. So, things like the household income. The household he lives in, his education level, his age. How much money he has in his savings account. All these other things that might affect also his willingness to work. And when I run the regression [00:16:00] model, what I get over here is the marginal effect of the tax rate on his willingness to work. That is, it’s the effect of the tax rate on his willingness to work, after you filter out the impacts of all these other things.

So multiple regression is an attempt to get what I would get if I had run a controlled experiment, but I can’t run a controlled experiment.

Student: Do you take any issue with that sometimes? [00:16:30] I know that one example we always learned in school was that’s how they got the value of a park, but what are the kind of things that are unseen if that park wasn’t there, and is that an appropriate way to value a park?

Antony Davies: Right. So first off, in many cases when it comes to economic data, there are problems with running regression analysis, but it beats the alternative. And the alternative is throwing up your hands, and walking away and doing nothing, right? [00:17:00] What’s important is that we be aware of what the problems are. So one problem for example is I’m assuming that when I make this list of things that affect your willingness to work, that I’ve identified all of the important ones. I may have left something out, and if I’ve left something out, then the results I get are not meaningful anymore. The technical term is they’re biased, right? So I’m going to get numbers over here that might look, and smell, and taste good, but actually they’re meaningless. Right? They’re wrong.

[00:17:30] Another problem is I’ve assumed that the relationship is linear, right? This whole thing when we do regression, it’s all linear relationships. A one unit change in this causes some fixed change in this thing over here. It’s possible the relationship isn’t linear. Maybe it’s the case that at low levels of tax rate, raising the tax rate a bit has a big impact on your willingness to work, but at high tax rates maybe raising your tax rate doesn’t have much of an effect. It’s a non-linear relationship. [00:18:00] If it’s a non-linear relationship and I put together a linear regression, then I’m again going to get results that aren’t meaningful. But I have to be aware that that’s a possibility, and there are all kind of tests in the background that I can run to verify if indeed I’ve set this thing up correctly, right?

So it’s not as simple as simply throwing the things into the pot and seeing what comes out. In fact, I tell my students who have reached this level that they are now [00:18:30] officially dangerous. They know enough to be able to put data into the machine, and to run the thing, and to get results, and to interpret the results, but the don’t know enough to be aware of where they might have gone wrong, such that the results, good as they look, are actually meaningless.

Student: So how much can we trust stats that we encounter kind of in media or in a newspaper, and what are some red flags to look for to kind of see bad stats?

Antony Davies: That’s a good question. So [00:19:00] this goes to the thing people like to say, there are lies, damn lies, and statistics, which I like to respond there are liars, damn liars, and people who don’t understand statistics but repeat them anyway. I think the problem with statistics … now, there are stats out there that are just wrong. Usually you find them in memes, but if you’re looking at actual stat stats from reputable places, government statistics, Gallup, these kinds of things, [00:19:30] the problem isn’t the statistics. Statistics don’t lie, but the humans who are presenting them can.

For example, there was a survey of something, a hundred, hundred and fifty economists not too long ago asking about the minimum wage. And a large number of the economists reported that they didn’t think that the minimum wage caused unemployment. Or at least this is the way it was presented, so what you saw for [00:20:00] people who are pro minimum wage who were presenting this research, they said, “Look at all these economists. One third of them, ” or whatever the number was, it was a large number, “conclude that increasing the minimum wage does not cause unemployment.”

And you look at that, and you say, “Wow, the economists think minimum wage doesn’t cause unemployment.” The problem wasn’t the statistic, the problem was the person who was repeating it. If you dig into the question that the economists were asked, they were not asked does [00:20:30] increasing minimum wage cause unemployment. They were asked does increasing the minimum wage cause significant unemployment? And that’s where you’ve got one third of them saying, “No.” They weren’t saying no to unemployment causing, or to minimum wage causing unemployment, they were saying no to unemployment, minimum wage causing significant unemployment.

Did some of them maybe think that they had no effect at all? Possibly, but the question wasn’t asking what the person who was repeating [00:21:00] the statistic said that the question was asking. And that one little word, significant maybe didn’t … I’m not saying that the person who was repeating the statistic was deliberately lying. The person may not have understand the significance of that one little word, but that one little word makes a big difference to how the people who are reading the question answer, and can make a big difference to what the statistic actually means. So, in summary I would say the thing to be careful of isn’t so much the numbers as it is the person who is [00:21:30] telling you what the numbers mean.

Student: So, what you’ve described is a very empirical approach to economics and social science. How does this contrast with the more a priori approach of deducing from some basic assumptions and principles about purposeful human action that’s employed by Austrian economists?

Antony Davies: Yeah, it’s an interesting question. Actually, I’ve written what I thought was a pretty good paper, and lots of people write me even still [00:22:00] about it for Cato on exactly that topic. And the question is, how do I as an economist who loves the Austrian school, but also loves statistical analysis, rectify those two viewpoints?

And I don’t see them in opposition, right? Remember, as we talked about statistics and P values, and regression, at each step, at each example, what I said is [00:22:30] you walk into the room with an assumption as to how the world works, and you ask the data, “Do you, data, confirm or deny this thing that I’m assuming?”

And left unsaid here is how do you come you with the thing that you’re assuming, that you’re walking in with? And the Austrian approach gives us lot of interesting things here, principally because they start from first principles. Right? This is how we believe humans behave, and if humans behave this way, then the following things [00:23:00] result from that. That’s the hypo that we walk in the door with.

One of the things that we haven’t discussed, and is true and it’s a problem, is data mining. And this is becoming more of a problem as we get more and more data and computing power that, you know, on the phone in my pocket I can do stuff that NASA couldn’t do with its roomfuls of computers 30, 40 years ago. One of the things that means is it becomes very easy to search [00:23:30] lots, and lots of data very quickly. And if you have the ability to search lots and lots of data very quickly, you will find stuff that looks real. It looks like sales of keyboards are influenced by colors of pens that people have. And you look at the data, and lo and behold, every time when we sell more red pens, keyboard sales go down. When we sell more blue pens, keyboard sales go up. We found this wonderful thing. What you found is a spurious [00:24:00] relationship. By random chance, that happens to be the case.

Now, the real problem with data mining is you multiple by orders of magnitude the likelihood of you finding spurious relationships because you’re looking at all kinds of things, right? You’re guaranteed if you look hard enough, you’ll find these spurious relationships. What’s needed to help guard against finding spurious relationships is connecting the analysis [00:24:30] you’re performing to some defensible hypothesis. That is, there is no defensible hypothesis that says colors of pen sales influence keyboard sales, so I shouldn’t even be looking at that relationship. What the Austrians give us is the reminder that we must be rooted in first principles, and that guides the sort of things that we analyze. And to that extent, I think the two, data analysis and Austrian [00:25:00] economics actually are complementary.

Humans are story tellers. We like stories. From the time we’re little, people tell us things, and when we’re older it’s anecdotes. Anecdotes catch our imaginations. They stick in our memories, and I think humans have evolved to be anecdotal creatures because prior to the invention of writing, and widespread literacy, which is a very recent thing, [00:25:30] that’s the way we pass down wisdom from generation to generation. Want plant not to eat, what animal to stay away from because it’s going to eat you. These sorts of things you tell stories about uncle Joe who got to close to that big cat, and it just bit his leg off, right? And I’m not going to do that, right? And we embellish it with all kinds of other things to make it more scary so the kids don’t go anywhere near the thing, right?

And we’re anecdotal creatures. Anecdotes give us on top of that, an entertaining, colorful way [00:26:00] to look at, to grapple with the world around us. The problem is good decisions are more often made by statistics, or analysis of statistics than by consulting anecdotes. And the problem with statistics is they’re not colorful. They’re not interesting, right? We have to force kids to sit through stats classes because nobody likes this stuff.

[00:26:30] So I think one of the takeaway messages here is while there’s no way to make statistics more palatable, it is at least necessary to communicate to people who maybe aren’t interested in knowing about statistics that there are great drawbacks to using anecdotes as your map for the world as you go about making decisions. And at least help people to understand things like [00:27:00] when you make a decision to ban a drug or to ban a particular type of weapon, or to subsidize this thing or tax that thing, that it’s important that we withhold our colorful, our joyful, our desire to help others to do the right thing, to behave well, to hold that back a little bit and remember that the data are the things that we should consult in making decisions.

Your heart is [00:27:30] a wonderful thing to consult in asking the question what is it we should be making decisions about, but the decision itself needs to be driven by data.