Hi,
I have read into the subject of finding good estimators to determine the goodness of fit when the regression on a trainingset is projected on a testset (unseen data). I have found a lot of scientific papers but I get completely lost in terminologie and very complex equations I do not understand.
My own approach, that has suited me well so far, is like this:
1) Train the training dataset
2) Apply the equation derived from step 1 on the test dataset
The problem is that the last step is very expensive in terms of computitive power. I am trying to find another approach.
I will illustrate my dilemma in an extensive example.
We have a total of 20 observations like this:
I have split the data in two subsets. We want to predict the income based on IQ.HTML Code:Subset IQ Income 1 82 30 1 87 75 1 93 25 1 99 45 1 105 50 1 110 40 1 122 70 1 122 65 1 126 110 1 140 95 2 83 45 2 86 30 2 86 50 2 92 45 2 95 120 2 101 75 2 111 65 2 118 85 2 119 120 2 133 200
Now we do some queries on the datasets:
Where:HTML Code:Subset Count Sum(IQ) Sum(IQ^2) Sum(Income) sum(Iq*Income) Variance(Income) Both 20 2110 228578 1440 160900 1745.78947 1 10 1086 121112 605 69045 769.16667 2 10 1024 107466 835 91855 2622.50000 From this we can populate the matrices to calculate the predictors (coefficients) and some basic information of the regression: Subset DesignMatrix ScoreMatrix (Design Matrix)-1 (Design Matrix)-1 x ScoreMatrix ESS TSS RSS RMSE R Squared All 20 2110 1440 1.9134270886 -0.0176628160 -86.6120877281 13500.820358279 33170 19669.179641721 983.4589820861 0.4070190039 2110 228578 160900 -0.0176628160 0.0001674201 1.5034321112 1 10 1086 605 3.8176774682 -0.0342327575 -53.9058756777 3520.6670029 6922.5 3401.8329971 170.091649855 0.5085831712 1086 121112 69045 -0.0342327575 0.0003152188 1.0534611020 2 10 1024 835 4.1199969330 -0.0392577825 -165.8261769667 15463.5795890201 23602.5 8138.9204109799 406.946020549 0.65516702 1024 107466 91855 -0.0392577825 0.0003833768 2.4348259469
The outcome of all these values (except maybe not RMSE) correspond with the values that R calculates.HTML Code:ESS=Explained Sum Squared. We derive this value from using the formula: CoefficientPredictor^2 * Row(1,1) From (Design Matrix)-1. In the case of Subset 1 this would be 1.0534611020^2 * 0.0003152188^-1 TSS=Total Sum Squared=Variance(Income) * (Count-1) RSS=Residual Sum Squared=TSS-ESS RMSE=Root Mean Squared Error=RSS/Count. I am not sure if this is the correct way to calculate this. Maybe we need to take the square root of this value. R2=ESS/TSS
So for subset one and two we come up with the following equations to predict the income based on IQ:What I normally do now is project equation subset 1 on dataset 2 and equation subset 2 on dataset 1, so in this case a 2-fold cross-validation.HTML Code:For subset 1: -53.9058756777 + 1.0534611020 * IQ For subset 2: -165.8261769667 + 2.4348259469 * IQ
When I query this in Sql it would look something like this:I get the following Residual Sum Squares when I cross validate:HTML Code:.... Case When SubSet=2 Then -53.9058756777199 + (1.05346110200479 * [IQ]) When SubSet=1 Then -165.826176966724 + (2.43482594694066 * [IQ] ....
Note that the cross-validated residuals for subset 1 (23968.2996788169) is way higher then the not cross validated residuals (3401.8329971).HTML Code:Subset 1 23968.2996788169 SubSet 2 21837.2586074513 Total 45805.5582862682
Although the difference is very high this does not matter for this example.
Now comes my question.
We have used a very small dataset here of a total of 20 observations. Assume the dataset contains millions of observations. Then the cross-validated query would be very expensive. Is there a more efficient way for this last step of cross validation without the need to query the whole dataset again?
I have read this question:
http://stats.stackexchange.com/quest...oss-validation
But this is not what I am looking for. It is still not applying the outcome of a regression on unseen data. Also I know R has some great packages that deal with cross validation but I am trying to find an algorithm that I can use in my own environment.
I am sorry if I mixed any terminology in this post. I am not a scientist in the field of linear or matrix algebra at all.
Any help is appreciated.
Tweet |