which regression model for time series?

#1
Hi guys, please help. I need to do regression analysis for demographic data. I am not sure whether linear or logistic regression would be appropriate... Also I would like to be able to extrapolate trends...

can use SPSS or STATA.

thanks,
S
 

bugman

Super Moderator
#2
I am not sure whether linear or logistic regression would be appropriate...
S
I am not sure either, since you haven't given much detail.

Linear regression for two continuous variables that have approximately normal residuals. If your response is binary, then consider logistic regression.
 
#3
Thanks. Basically, I am trying to do population projections. So I have data from 1980 till 2010 and I want to project those in the future. I keep reading on the internet and it seems that autocorrelation could be something to consider... not sure... and what if I wanted to add other variables to the model, like economic growth... thanks for helping with the model.

Sylvia
 
#4
It's not quite that simple ... :shakehead

You could do just normal regression. It probably wouldn't be very wrong. The problem is just that for most time series, the residuals will be autocorrelated (each value correlated with the previous one), so they are not independent, and the regression error will be underestimated (the Durbin-Watson test can be used to check for this condition if I remember correctly).

So, one solution is to first do a regression,
then calculate the residuals,
then estimate their AR(1) coefficient rho (in essence the degree of autocorrelation) by regressing all residuals onto their previous values,
then use this value to remove the autocorrelation (this is called pre-whitening),
then regress again (couple of extra tricks there)

All this has to be done with rather meticulous attention to detail, so unless you feel heroic it would be best to use a standard package for it ... don't know if it's in SPSS or STATA but surely there is a package for R!
 
#5
It's not quite that simple ... :shakehead

You could do just normal regression. It probably wouldn't be very wrong. The problem is just that for most time series, the residuals will be autocorrelated (each value correlated with the previous one), so they are not independent, and the regression error will be underestimated (the Durbin-Watson test can be used to check for this condition if I remember correctly).

So, one solution is to first do a regression,
then calculate the residuals,
then estimate their AR(1) coefficient rho (in essence the degree of autocorrelation) by regressing all residuals onto their previous values,
then use this value to remove the autocorrelation (this is called pre-whitening),
then regress again (couple of extra tricks there)

All this has to be done with rather meticulous attention to detail, so unless you feel heroic it would be best to use a standard package for it ... don't know if it's in SPSS or STATA but surely there is a package for R!
Thanks for responding Ohammer. I will attempt that tomorrow with SPSS and STATA. Seems a bit complex, so will probably revert with more questions...
 
#6
It's not quite that simple ... :shakehead

You could do just normal regression. It probably wouldn't be very wrong. The problem is just that for most time series, the residuals will be autocorrelated (each value correlated with the previous one), so they are not independent, and the regression error will be underestimated (the Durbin-Watson test can be used to check for this condition if I remember correctly).

So, one solution is to first do a regression,
then calculate the residuals,
then estimate their AR(1) coefficient rho (in essence the degree of autocorrelation) by regressing all residuals onto their previous values,
then use this value to remove the autocorrelation (this is called pre-whitening),
then regress again (couple of extra tricks there)

All this has to be done with rather meticulous attention to detail, so unless you feel heroic it would be best to use a standard package for it ... don't know if it's in SPSS or STATA but surely there is a package for R!
Dear Ohammer, dear all,

This indeed is not easy. Is there a way to do something simpler - like doing a normal regression and then extrapolating the trend? (if so what to do with years - can it be treated as independent varaible or does it need to be transformed?)

and how about a logistic curve? does it make sense to do logistic regression and then extrapolate (how?)

THANKS
 
#7
Yes forget the pre-whitening, I don't think it would make any practical difference, and people are regressing time-series without it all the time. (Only, if you are going to publish this, and get one of those pedantic reviewers we all hate, it might be criticized).

So, all you need then is an ordinary least squares regression with time as independent variable (no transformation). Write down the equation for the modeled function (linear, logit, whatever), plug in the year 2020, and tell us our future :rolleyes:
 
#8
Yes forget the pre-whitening, I don't think it would make any practical difference, and people are regressing time-series without it all the time. (Only, if you are going to publish this, and get one of those pedantic reviewers we all hate, it might be criticized).

So, all you need then is an ordinary least squares regression with time as independent variable (no transformation). Write down the equation for the modeled function (linear, logit, whatever), plug in the year 2020, and tell us our future :rolleyes:
This is giving me a strange result

Coefficients(a)
Unstandardized Coefficients Standardized Coefficients
Model B Std. Error​
Beta t Sig.
1 (Constant) -1.631E8 4801028.350 -33.973 .000
Year 83544.071 2407.111 .989 34.707 .000
a. Dependent Variable: Singapore



Beta is 83,544
how should I treat variable "year" in the equation?
 
#9
You have done a linear regression with year t as independent variable? It looks like you have slope 83544 and intercept -163100000:

y=83544*t-163100000

Plug t=2010 into this, and get about 4.8 million, which is (Wikipedia ... Singapore ... hang on ...) quite close to the 2010 population of Singapore?

BUT: The population before ca. 1952 was negative according to this model - maybe you need to consider e.g. an exponential instead ...
 
#10
You have done a linear regression with year t as independent variable? It looks like you have slope 83544 and intercept -163100000:

y=83544*t-163100000

Plug t=2010 into this, and get about 4.8 million, which is (Wikipedia ... Singapore ... hang on ...) quite close to the 2010 population of Singapore?

BUT: The population before ca. 1952 was negative according to this model - maybe you need to consider e.g. an exponential instead ...
Bless you Ohammer. So it gives around 6.1 mln in 2025 which is very reasonable. But I will add x square to the model as suggested.

Another way of looking at it would be that for each year we would get 83,544 extra people, correct?

I will also try logistic reg to compare.
 

Dason

Ambassador to the humans
#11
I've sort of been wondering this the entire thread... How are you going to use logistic regression here? What outcome are you modeling?
 
#12
I've sort of been wondering this the entire thread... How are you going to use logistic regression here? What outcome are you modeling?
Yes, logistic regression doesn't seem to work...as it is for binary variables only. linear regression yielded plausible results. This makes me wonder how do demographers make their logistic curves....

If you have a suggestion for a not very complex model, please do let me know.
 
#13
I think maybe there is a confusion about the word "logistic" here: Logistic regression often refers to (GLM) regression of binary data using a logit/probit link, maybe this is what you tried? You may instead be thinking of fitting to a logistic (sigmoid) function often used for population growth, something like

y=a/(1+b*exp(-cx))

for parameters a, b and c ?

This model is difficult to linearize by transformation (at least if all three parameters have to be estimated), so you may have to use a nonlinear regression method :( .
 
#14
I think maybe there is a confusion about the word "logistic" here: Logistic regression often refers to (GLM) regression of binary data using a logit/probit link, maybe this is what you tried? You may instead be thinking of fitting to a logistic (sigmoid) function often used for population growth, something like

y=a/(1+b*exp(-cx))

for parameters a, b and c ?

This model is difficult to linearize by transformation (at least if all three parameters have to be estimated), so you may have to use a nonlinear regression method :( .
Yes.

This is the output:

Logistic
Model Summary
R R Square Adjusted R Square Std. Error of the Estimate
.995 .990 .990 .014
The independent variable is Year.


ANOVA
Sum of Squares df Mean Square F Sig.
Regression .569 1 .569 2771.558 .000
Residual .006 27 .000
Total .575 28
The independent variable is Year.

Coefficients
Unstandardized Coefficients Standardized Coefficients
B Std. Error Beta t Sig.
Year .984 .000 .370 3166.533 .000
(Constant) 537904.449 338891.324 1.587 .124
The dependent variable is ln(1 / SEAsia).


Is A- contsant, B- year? where is C?........
 
#15
I think maybe there is a confusion about the word "logistic" here: Logistic regression often refers to (GLM) regression of binary data using a logit/probit link, maybe this is what you tried? You may instead be thinking of fitting to a logistic (sigmoid) function often used for population growth, something like

y=a/(1+b*exp(-cx))

for parameters a, b and c ?

This model is difficult to linearize by transformation (at least if all three parameters have to be estimated), so you may have to use a nonlinear regression method :( .
Yet again, it seems not easy. the book says:

we used NCSS statistical package to estimate the parameters of the logistic curve because - unlike SPSS - its logaritm does not require a user defined value for the upper asymptote" (A)....
 
#16
Yes.

This is the output:

Logistic
Model Summary
R R Square Adjusted R Square Std. Error of the Estimate
.995 .990 .990 .014
The independent variable is Year.


ANOVA
Sum of Squares df Mean Square F Sig.
Regression .569 1 .569 2771.558 .000
Residual .006 27 .000
Total .575 28
The independent variable is Year.

Coefficients
Unstandardized Coefficients Standardized Coefficients
B Std. Error Beta t Sig.
Year .984 .000 .370 3166.533 .000
(Constant) 537904.449 338891.324 1.587 .124
The dependent variable is ln(1 / SEAsia).


Is A- contsant, B- year? where is C?........

After further reading it seems that I will have to estimate A (upper asymptote - population ceiling), but which is B and which is C from the above output? please help.
 
#17
After further reading it seems that I will have to estimate A (upper asymptote - population ceiling), but which is B and which is C from the above output? please help.
Correct, if you want to fit the logistic function using linearization, you must estimate a independently. Honestly I don't quite understand what transformation SPSS did for you (it says the dependent variable is (1/SEAsia), which I can't quite fit in with the logistic model?).

If you use a nonlinear procedure instead, you can fit all three parameters simultaneously.

(The program Past does this automatically by first setting a to the max value of the data as an initial guess, then estimating b and c by linearization and regression, then optimizing all the parameters with the Levenberg method).
 
#18
:confused:
Correct, if you want to fit the logistic function using linearization, you must estimate a independently. Honestly I don't quite understand what transformation SPSS did for you (it says the dependent variable is (1/SEAsia), which I can't quite fit in with the logistic model?).

If you use a nonlinear procedure instead, you can fit all three parameters simultaneously.

(The program Past does this automatically by first setting a to the max value of the data as an initial guess, then estimating b and c by linearization and regression, then optimizing all the parameters with the Levenberg method).
Thanks. I think I got it. It's the curve estimation function.

Here is the area and the equation:

y=1 / ( 0 + 13677.83972385804 * 0.9853271471417606**x )

graph doesnt want to copy...

But I think it's ok. It is South East Asia. Based on the above formula, I can announce that he population of South East Asia in 2025 will be 730,423,364:D

Next ARIMA, but that will be a long process..........