# Solar energy prediction

#### omar

##### New Member
Hi,
I am prediction one-day-ahead solar energy output using 30 days historical data. The data sets are hourly, so the prediction is done hourly from sunrise to sunset.
I have doe the prediction using sliding window technique, When I am predicting 01/06, I am using 30 days historical data (from 02/05 – 31/05 ) for the training dataset that will be used to build the model, the training dataset include weather variables (global horizontal irradiance, direct normal irradiance, temperature and humidity) as input to the model and actual solar energy output (not the predicted) as dependent variable. I do this for 360 days.
I got comment: So many lagged dependent variables will raise multicollinearity issue, in linear regression model. Did you check it?

What is the answer for this.

Hope some one can help me on this.

Regards,

#### obh

##### Member
Hi Omar,

Generally, you should think about what variables to insert into the model, don't just insert into the model every possible variable.
You should use some theoretical knowledge when choosing the predictors.

Multicollinearity happened when some of the predictors are highly correlated.

For example, if X1 cause Y and X2 doesn't cause Y
But X1 is highly correlated with X2.
X1 may result as an insignificant predictor, despite the fact it should be significant.

http://blog.minitab.com/blog/unders...ling-multicollinearity-in-regression-analysis

#### omar

##### New Member
Hi,
But I am talking about dependent variable and not independent variables, So I am talking about correlation between Y's and not X's.
If I have auto-correlation between Y's (found using acf function), what is the solution? is it just to include the lagged Y's as input to the model?
Regards,
Omar

#### obh

##### Member
Okay ...

How exactly did you calculate the Predicated Y based on the 30 days historical data?
Did you use regression over the last 30 days?

#### omar

##### New Member
Yes, the perdition is done hourly for 24 hours ahead, using 30 days historical data.
using MLR, input variables are weather variables, and the output is the energy.

#### obh

##### Member
Hi Omar,

One of the regression assumptions is Independence of errors.
Since there is probably a correlation between the Y of the last 30 days, you probably should check this assumption.

#### omar

##### New Member
Hi,
I did, and there is relation of error as I can see from the Q-Q plot. is there another way to check that? and if there is relation what is the solution?
Many thanks.

#### obh

##### Member
Hi Omar,

QQ plot is a graphical check for the residual normality assumption.
There are several methods for the normality test, you should combine a statistic test method with a graphical method.

For the "independence of errors" assumption, I assume it is better to use the residuals plot.

https://www.ics.uci.edu/~jutts/110/Lecture3.pdf

If you try the following, it will also calculate the residuals normality assumption using the Shapiro Wilk test and the Homoscedasticity assumption.
http://www.statskingdom.com/410multi_linear_regression.html
Or you can run only the Shapiro-Wilk test over the residuals:
http://www.statskingdom.com/320ShapiroWilk.html

#### omar

##### New Member
Hi,
I really appreciate all your help, but one more question.
What is the solution if there is correlation between the Y's, that is there is relation between the residuals? do I need to use Lagged Y, that is Yt-1 as independent variable?
Note: My data is 30 days history and one day ahead, I am using sliding window technique to prediction one day ahead for 365 days.
Regards,

#### obh

##### Member
Hi Omar,

Do you mean you have data of 365 days but you try to predict each day based on the previous 30 days, and then compare the prediction to the actual result?

Did you check and find that data based on 30 days produce a better prediction that longer period? 60 days 360 day?
Using only 30 days seems to rely on the correlation between the predicted day and the previous days...

I think you may find some explanation in the following link:
http://people.duke.edu/~rnau/testing.htm
look for the "Violations of independence"

#### omar

##### New Member
Hi,
Do you mean you have data of 365 days but you try to predict each day based on the previous 30 days, and then compare the prediction to the actual result? Yes.
Each day means 24 observation of energy and weather variables.
30 days is the best, as I increase or decreased the MAE get higher.

After reading what you sent me, it seams the only way is to use lagged Y, in order to solve the problem. Is that correct?

#### obh

##### Member
I guess so... , but let me know if you really got a better result? I guess you should recheck the window when using lags, may be with lags the optimum won't be 30 days?.