Hi guys, I am a bit confused about what k-fold Cross Validtation does. Is it better to remove non significant variables and tune with k-fold cross validation or the other way around? Or CV has nothing to do with tuning the model?
CV is a method to estimate the test error of your model - i.e. the performance on a new data set. IMO the best choice is to tune your model using CV - that would be equivalent to picking the model with the best chances of having a low test error.
I was reviewing K-fold CV for my own use and thought I would update this thread. I still agree with my above statements, and it is also common to run multiple candidate models through the K-fold process and compare their mean values of RMSE or Accuracy.
I have not seen a good description on potential criteria to use on selecting between the candidate models beyond the lowest error and say parsimony. However, I feel like the form approach does not fully address the measure of dispersion around the means, unless you were able to run say a higher number of folds. Perhaps you can compare the lower 95% Confidence interval values?
The criteria for model selection is varied and depends on business context. Do you need the model to be comprehensible? Then using components from PCA to make a better fitted model may not be the best choice, even if test (out-of-sample) accuracy and estimation speed are improved. Maybe you want just the best (lowest) test error, or you may take a little more potential error for having less variables (e.g., k in k-means clustering). Often criteria don't even determine clearly a single best model. Sure, numerically there is a lowest, but you may have like 5 models very near that and then a jump in test error that clearly separates them. Which do you choose? That's where parsimony may step in to say the fewer parameters within those 5 is the better option.
Agreed BG. I have not used PCA yet, I was thinking it was more reserved for continuous covariates? I have also seen the process (CV not PCA), with the number of predictors plotted versus error in a graph. Have you ever also interpreted the KS values from K-fold CV?
To me the whole thing just seems like another subjective/contextual statistical decision.
Yeah, PCA is typically used for continuous variables that you want to reduce down to a few variables that capture the same information (dimensionality reduction). In the context of model selection, if you're doing image classification models, for instance, you're absolutely not going to know the meaning of the "variables", and it's even further confounded if, as you probably would in this case, use PCA to reduce the many columns of your image bitmap to a few columns that get the main information along the columns. Iterpretability is thrown out the window!
Not sure what the KS is you're referring to.
There are typically 2 CV charts you might want to explore: training vs test error across increasing number of parameters (predictors) and training vs test error across sample size (as n increases). These will help you understand the bias vs variation trade-off, and if maybe more data is required to improve your model fit (or when it won't help) vs when you may need to complicate your model with more parameters (e.g., more k in k-means).