Hi,
I have read into the subject of finding good estimators to determine the goodness of fit when the regression on a trainingset is projected on a testset (unseen data). I have found a lot of scientific papers but I get completely lost in terminologie and very complex equations I do not understand.
My own approach, that has suited me well so far, is like this:
1) Train the training dataset
2) Apply the equation derived from step 1 on the test dataset
The problem is that the last step is very expensive in terms of computitive power. I am trying to find another approach.
I will illustrate my dilemma in an extensive example.
We have a total of 20 observations like this:
I have split the data in two subsets. We want to predict the income based on IQ.
Now we do some queries on the datasets:
Where:
The outcome of all these values (except maybe not RMSE) correspond with the values that R calculates.
So for subset one and two we come up with the following equations to predict the income based on IQ:
What I normally do now is project equation subset 1 on dataset 2 and equation subset 2 on dataset 1, so in this case a 2-fold cross-validation.
When I query this in Sql it would look something like this:
I get the following Residual Sum Squares when I cross validate:
Note that the cross-validated residuals for subset 1 (23968.2996788169) is way higher then the not cross validated residuals (3401.8329971).
Although the difference is very high this does not matter for this example.
Now comes my question.
We have used a very small dataset here of a total of 20 observations. Assume the dataset contains millions of observations. Then the cross-validated query would be very expensive. Is there a more efficient way for this last step of cross validation without the need to query the whole dataset again?
I have read this question:
http://stats.stackexchange.com/questions/85507/what-is-the-rmse-of-k-fold-cross-validation
But this is not what I am looking for. It is still not applying the outcome of a regression on unseen data. Also I know R has some great packages that deal with cross validation but I am trying to find an algorithm that I can use in my own environment.
I am sorry if I mixed any terminology in this post. I am not a scientist in the field of linear or matrix algebra at all.
Any help is appreciated.
I have read into the subject of finding good estimators to determine the goodness of fit when the regression on a trainingset is projected on a testset (unseen data). I have found a lot of scientific papers but I get completely lost in terminologie and very complex equations I do not understand.
My own approach, that has suited me well so far, is like this:
1) Train the training dataset
2) Apply the equation derived from step 1 on the test dataset
The problem is that the last step is very expensive in terms of computitive power. I am trying to find another approach.
I will illustrate my dilemma in an extensive example.
We have a total of 20 observations like this:
HTML:
Subset IQ Income
1 82 30
1 87 75
1 93 25
1 99 45
1 105 50
1 110 40
1 122 70
1 122 65
1 126 110
1 140 95
2 83 45
2 86 30
2 86 50
2 92 45
2 95 120
2 101 75
2 111 65
2 118 85
2 119 120
2 133 200
Now we do some queries on the datasets:
HTML:
Subset Count Sum(IQ) Sum(IQ^2) Sum(Income) sum(Iq*Income) Variance(Income)
Both 20 2110 228578 1440 160900 1745.78947
1 10 1086 121112 605 69045 769.16667
2 10 1024 107466 835 91855 2622.50000
From this we can populate the matrices to calculate the predictors (coefficients) and some basic information of the regression:
Subset DesignMatrix ScoreMatrix (Design Matrix)-1 (Design Matrix)-1 x ScoreMatrix ESS TSS RSS RMSE R Squared
All 20 2110 1440 1.9134270886 -0.0176628160 -86.6120877281 13500.820358279 33170 19669.179641721 983.4589820861 0.4070190039
2110 228578 160900 -0.0176628160 0.0001674201 1.5034321112
1 10 1086 605 3.8176774682 -0.0342327575 -53.9058756777 3520.6670029 6922.5 3401.8329971 170.091649855 0.5085831712
1086 121112 69045 -0.0342327575 0.0003152188 1.0534611020
2 10 1024 835 4.1199969330 -0.0392577825 -165.8261769667 15463.5795890201 23602.5 8138.9204109799 406.946020549 0.65516702
1024 107466 91855 -0.0392577825 0.0003833768 2.4348259469
Where:
HTML:
ESS=Explained Sum Squared. We derive this value from using the formula: CoefficientPredictor^2 * Row(1,1) From (Design Matrix)-1. In the case of Subset 1 this would be 1.0534611020^2 * 0.0003152188^-1
TSS=Total Sum Squared=Variance(Income) * (Count-1)
RSS=Residual Sum Squared=TSS-ESS
RMSE=Root Mean Squared Error=RSS/Count. I am not sure if this is the correct way to calculate this. Maybe we need to take the square root of this value.
R2=ESS/TSS
The outcome of all these values (except maybe not RMSE) correspond with the values that R calculates.
So for subset one and two we come up with the following equations to predict the income based on IQ:
HTML:
For subset 1: -53.9058756777 + 1.0534611020 * IQ
For subset 2: -165.8261769667 + 2.4348259469 * IQ
When I query this in Sql it would look something like this:
HTML:
....
Case When SubSet=2 Then -53.9058756777199 + (1.05346110200479 * [IQ]) When SubSet=1 Then -165.826176966724 + (2.43482594694066 * [IQ]
....
HTML:
Subset 1 23968.2996788169
SubSet 2 21837.2586074513
Total 45805.5582862682
Although the difference is very high this does not matter for this example.
Now comes my question.
We have used a very small dataset here of a total of 20 observations. Assume the dataset contains millions of observations. Then the cross-validated query would be very expensive. Is there a more efficient way for this last step of cross validation without the need to query the whole dataset again?
I have read this question:
http://stats.stackexchange.com/questions/85507/what-is-the-rmse-of-k-fold-cross-validation
But this is not what I am looking for. It is still not applying the outcome of a regression on unseen data. Also I know R has some great packages that deal with cross validation but I am trying to find an algorithm that I can use in my own environment.
I am sorry if I mixed any terminology in this post. I am not a scientist in the field of linear or matrix algebra at all.
Any help is appreciated.