# Multiple linear regression: What to do if I have a low Durbin-Watson value?

#### Miz123

##### New Member
I am investigating to what extent do Temperature, Rainfall, GDP per capita, Population, Livestock and Arable Land (IVs) predict Food Calories (kcal/cap/day) (DV) in 6 countries.

So I assume I will need to carry out 6 separate multiple linear regression models (one per country)?

My first country is Niger. My sample size (N) is 52 (1961-2013).

My Durbin-Watson value is 0.894. I understand that this indicates positive autocorrelation within my data. How do I overcome this problem?

Before I had more than 6 IVs but I removed them because their sample size was incomplete with a lot of missing values or the time frame started from 1991 instead of 1961. Can anyone help?

#### ondansetron

##### TS Contributor
I am investigating to what extent do Temperature, Rainfall, GDP per capita, Population, Livestock and Arable Land (IVs) predict Food Calories (kcal/cap/day) (DV) in 6 countries.

So I assume I will need to carry out 6 separate multiple linear regression models (one per country)?

My first country is Niger. My sample size (N) is 52 (1961-2013).

My Durbin-Watson value is 0.894. I understand that this indicates positive autocorrelation within my data. How do I overcome this problem?

Before I had more than 6 IVs but I removed them because their sample size was incomplete with a lot of missing values or the time frame started from 1991 instead of 1961. Can anyone help?
Can you explain your variables a little more? For example, is temperature the average temperature for the year in the country? This kind of explanation for the rest will help make sure you definitely need to investigate autocorrelation (although it sounds like it's possible).

It's also probably possible to just use 5 dummy variables for the 6 countries (so one MLR), but it'll depend on issues that pop up if you need a different or new kind of model.

Did your software report an approximate p-value on the DW statistic?

#### Miz123

##### New Member
Can you explain your variables a little more? For example, is temperature the average temperature for the year in the country? This kind of explanation for the rest will help make sure you definitely need to investigate autocorrelation (although it sounds like it's possible).

It's also probably possible to just use 5 dummy variables for the 6 countries (so one MLR), but it'll depend on issues that pop up if you need a different or new kind of model.

Did your software report an approximate p-value on the DW statistic?
Yes, for temperature and rainfall, it is the average per year. I'm using SPSS and I believe the p-value for the DW statistic is 0.004. Any suggestions on what I should do next?

#### noetsi

##### Fortran must die
Why do you need to do one per country? That essentially assumes that these variables have entirely different impact by country which seems strange to me. Why would rainfall, controlling for other variables, behave entirely different in one country than another (to use one variable as an example)?

It would seem you could run models with country being a dummy variable so that you could for example compare Nigeria to all other countries and run only one multiple linear regression. Or have five dummies one for each country and a last country you are comparing it to in the reference group.

Durbin Watson is not normally an issue unless you are running time series data, which does not seem to be the case here. If you suspect serial correlation there are better test than Durbin Watson which captures only first order AR and does not work at all with lagged predictors I believe. It is possible to have serial correlation when you are not running time series, but my sense is that is pretty rare.

There are various ways to correct for serial correlation including robust SE and regression with AR error. But they are usually only used when you have time series data.

#### ondansetron

##### TS Contributor
Why do you need to do one per country? That essentially assumes that these variables have entirely different impact by country which seems strange to me. Why would rainfall, controlling for other variables, behave entirely different in one country than another (to use one variable as an example)?

It would seem you could run models with country being a dummy variable so that you could for example compare Nigeria to all other countries and run only one multiple linear regression. Or have five dummies one for each country and a last country you are comparing it to in the reference group.

Durbin Watson is not normally an issue unless you are running time series data, which does not seem to be the case here. If you suspect serial correlation there are better test than Durbin Watson which captures only first order AR and does not work at all with lagged predictors I believe. It is possible to have serial correlation when you are not running time series, but my sense is that is pretty rare.

There are various ways to correct for serial correlation including robust SE and regression with AR error. But they are usually only used when you have time series data.
This is what I was getting at with my post also. As for the bold, though, I disagree that time-series is ruled out (based on the post). I think it's possible the OP has time series data because the experimental unit seems to be a country-year (i.e. he is looking at a country in a specific year and collecting data specific for the country-year, then taking another country-year and getting data and so on, which may well fit the bill). However, since it's not entirely clear based on his or her post, I think the OP should elaborate on the variables as I asked-- this will allow us to better identify the experimental unit (which, to me, seem like it could be a unit of time, making a DW test appropriate). If we find the experimental unit is not a unit of time, then I agree that checking for autocorrelated errors isn't really appropriate.

OP: if you can detail the list of variables better than you have that might be useful.

Edit: I will say that I wouldn't call time-series my strong suit, and the OPs post lacks a bit of information to make clear a few important things. It just appears that the OP took a country, and obtained variable values over the specified time period (which have a natural ordering) and then did the same for the other countries.

I'll also add this since it seems like a similar application: https://onlinecourses.science.psu.edu/stat510/node/77

Last edited:

#### noetsi

##### Fortran must die
If they are going to treat this as time series their modeling is going to be very complex. They would have to run regression with autoregressive error, ARDL, or vector autoregressive models. That is difficult to do if you don't have a lot of background in time series [its difficult to do even if you are familiar with it].

They could eliminate the serial autocorrelation I think by using robust standard errors. That is not ideal because their model will not truly reflect the driving forces over time. But if they don't know time series they may not have much choice.

I would use one of the more recent measures of autoregressive error rather than durbin Watson which has many limitations.

#### ondansetron

##### TS Contributor
If they are going to treat this as time series their modeling is going to be very complex. They would have to run regression with autoregressive error, ARDL, or vector autoregressive models. That is difficult to do if you don't have a lot of background in time series [its difficult to do even if you are familiar with it].
I'm vaguely familiar with some of those ideas but the topics have never been more than a peripheral glance in any courses I took which is disappointing. It always seemed like a big can of worms, but then again, I'd imagine that's how something like MLR or logistic regression is for the casual observer.

They could eliminate the serial autocorrelation I think by using robust standard errors. That is not ideal because their model will not truly reflect the driving forces over time. But if they don't know time series they may not have much choice.

I would use one of the more recent measures of autoregressive error rather than durbin Watson which has many limitations.
This is why I was hoping to clarify some of their items to make sure we knew which direction they actually needed to go. I think the robust SEs won't actually remove the autocorrelation but rather just appropriately account for it in the calculation (as far as I'm aware/can't tell from a few resources). As you said, though, if they need to do a more complex time series modeling, I think I'll be outside of my ability to offer much help aside from what we've kind of bounced around so far.

#### Miz123

##### New Member
Why do you need to do one per country? That essentially assumes that these variables have entirely different impact by country which seems strange to me. Why would rainfall, controlling for other variables, behave entirely different in one country than another (to use one variable as an example)?

It would seem you could run models with country being a dummy variable so that you could for example compare Nigeria to all other countries and run only one multiple linear regression. Or have five dummies one for each country and a last country you are comparing it to in the reference group.

Durbin Watson is not normally an issue unless you are running time series data, which does not seem to be the case here. If you suspect serial correlation there are better test than Durbin Watson which captures only first order AR and does not work at all with lagged predictors I believe. It is possible to have serial correlation when you are not running time series, but my sense is that is pretty rare.

There are various ways to correct for serial correlation including robust SE and regression with AR error. But they are usually only used when you have time series data.
I didn't even realise I was able to include all 6 countries into one multiple linear regression model. I have just had a look on SPSS and I think I am now able to do this. So I can essentially input my 6 independent variables (av. temp, av. rainfall, total population, GDP per capita, livestock production and arable land %) for each of my 6 countries and compare them against each other?

#### Miz123

##### New Member
This is what I was getting at with my post also. As for the bold, though, I disagree that time-series is ruled out (based on the post). I think it's possible the OP has time series data because the experimental unit seems to be a country-year (i.e. he is looking at a country in a specific year and collecting data specific for the country-year, then taking another country-year and getting data and so on, which may well fit the bill). However, since it's not entirely clear based on his or her post, I think the OP should elaborate on the variables as I asked-- this will allow us to better identify the experimental unit (which, to me, seem like it could be a unit of time, making a DW test appropriate). If we find the experimental unit is not a unit of time, then I agree that checking for autocorrelated errors isn't really appropriate.

OP: if you can detail the list of variables better than you have that might be useful.

Edit: I will say that I wouldn't call time-series my strong suit, and the OPs post lacks a bit of information to make clear a few important things. It just appears that the OP took a country, and obtained variable values over the specified time period (which have a natural ordering) and then did the same for the other countries.

I'll also add this since it seems like a similar application: https://onlinecourses.science.psu.edu/stat510/node/77
These are the variables I am using: av. temp, av. rainfall, total population, GDP per capita, livestock production and arable land %

My time period is between 1961-2013.

I have available data for these variables for all 6 countries.

I also have a few other variables but I thought I needed to exclude them from my model because the time periods started at 1991 instead of 1961. Was i correct in doing this?

#### Miz123

##### New Member
Does my data need to be standardised before I do the regression model?

#### noetsi

##### Fortran must die
I didn't even realise I was able to include all 6 countries into one multiple linear regression model. I have just had a look on SPSS and I think I am now able to do this. So I can essentially input my 6 independent variables (av. temp, av. rainfall, total population, GDP per capita, livestock production and arable land %) for each of my 6 countries and compare them against each other?
You can, you have to make five dummy variables one for one country. The sixth country will not have a dummy, it will the reference level. I strongly suggest looking at an online source on dummy variables before you do this. What this is going to tell you is the impact of country controlling for all the other variables in the model such as rainfall. Similarly it will tell you the impact of rain controlling for a specific country. What that means in practice, what country is actually measuring is not clear to me. You will have to think carefully about that.

A critical question is whether rainfall etc has the same impact on your dependent variable, essentially the same slope, for each country. You need to test for interaction between the country dummies and variables such as rainfall. If there is interaction then things get a lot more interesting in your interpretation....

#### noetsi

##### Fortran must die
hi,
using the gls function from the nlme package you can specify an autocorrelation structure for the errors. That should take care of the problem.

https://www.r-bloggers.com/linear-regression-with-correlated-data/
It will take care of the serial correlation. It won't deal with bias due to specifying the model incorrectly. If lags of the dependent variable or the independent variable are left out when they should be in you have the same bias you would for excluding any other variable. That tends to be forgotten in time series...

#### rogojel

##### TS Contributor
Yes, this is true.
However, if necessary, you can build a hierarchy of different models with different structures and see which is best. If the problem is only a high autocorrelation of the residuals, some simple model will probably be sufficient.

regards

#### noetsi

##### Fortran must die
How would you define best? Obviously not R squared. AIC or BIC?

Even if you know which model is best, defined this way, it does not mean that individual slopes are not biased. It would work for the model not the individual predictors.

#### rogojel

##### TS Contributor
It is difficult to reason about this without actual data, but maybe I could just look at the correlation structure of the residuals first, as this seems to be the problem?

regards