Seeking advise on predictive modeling approach

Hello dear forum members!

Currently, I am working on a project that aims to predict a certain cancer-related outcome (y) using a number of control (c) and predictor (X) variables:

y(i) = a + c(it) + X(it) + u (1)

In Equation (1): y(i) is continuous in nature, data is available only as means of values aggregated from 2009 to 2013; c(it) is a vector of several longitudinal (yearly) control variables available from 2009 through 2013; and X(it) is a vector of several longitudinal (yearly) predictor variables available from 2010 through 2013.

As you can see, the outcome does not vary over time as it is available only in the aggregated form of means; however the controls and predictors are in the panel form. Facing such a limitation, panel models do not seem applicable. Therefore, my approach is to firstly estimate:

y(i) = a + c(i) + X(i) + u (2), where c(i) and X(i) are aggregated as means

And secondly to (a) ensure consistency of the coefficients, and (b) test for lagged effects estimate:

y(i) = a + c(it-1) + X(it-1) + u (3), where c(it-1) and X(it-1) are from 2012 only
y(i) = a + c(it-2) + X(it-2) + u (4), where c(it-2) and X(it-2) are from 2011 only
y(i) = a + c(it-3) + X(it-3) + u (5), where c(it-3) and X(it-3) are from 2010 only

Please advice if my modeling approach seems plausible (considering the limitation related to DV data).


Less is more. Stay pure. Stay poor.
How does variability of parameters get in model? If it does not, I would imagine SE values may be under-represented and you risk type I errors. In the back of my mind your approach seems like what economist may do. Perhaps one of them can chime in on the pros/cons. Using the means would also not control for the trends within years, so you wouldn't know if it was going up and then down the next year; you would only have the level changes but not the trend changes, but I understand you are trying to do the best with what you have.

I didn't understand what you were alluding to in the secondary part, looking for autocorrelation?
Dear hlsmith,

Thank you for response and also issues you emphasized. Perhaps, some "pooled" model could be used, e.g.,:

y(i) = a + c(it) + c(it-1) + c(it-2) + c(it-2) + X(it) + X(it-1) + X(it-2) + X(it-2) + u (6)

In Equation (6): N(c) = 13, N(X) = 46, and N(obs) = 2,779

Actual estimation results (obtained via OLS w/robust SE's) are quite intriguing (accepting the limitation that 'y = Mean[2009-2013]'. E.g., consider an attached plot of quantiles of residuals against the quantiles of normal distribution. Evidently, up to a point the model fit is very good (as indicated by residual points forming a straight line) Also, note the attached plot of the residuals against DV: Testing for heteroskedasticity, the test statistic fails to reject null of constant variance (i.e., assumption of homoskedasticity is met). I think further quantile regression analysis seems appropriate.

As for my initial lagged effects approach, yes, the goal was to ensure robustness of the coefficients, as autocorrelation is present in some controls and predictors.
Last edited:


Less is more. Stay pure. Stay poor.
Just a side note, I read a position piece on why quantiles regression is limited in the published research arena. It said if you use quantiles, the external validity becomes limited, in that others will have different quantiles than you and generalizing your results becomes hinder, unlike say OLS where those results can possibly be interpolated to any pseudo realization of the population (just plug in values). I am not doing the article justice, no reference - sorry.