+ Reply to Thread
Results 1 to 9 of 9

Thread: Cross validation question

  1. #1
    Points: 5, Level: 1
    Level completed: 9%, Points required for next Level: 45

    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Cross validation question




    Hi guys, I am a bit confused about what k-fold Cross Validtation does. Is it better to remove non significant variables and tune with k-fold cross validation or the other way around? Or CV has nothing to do with tuning the model?

  2. #2
    TS Contributor
    Points: 12,287, Level: 72
    Level completed: 60%, Points required for next Level: 163
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,471
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Cross validation question

    hi,
    CV is a method to estimate the test error of your model - i.e. the performance on a new data set. IMO the best choice is to tune your model using CV - that would be equivalent to picking the model with the best chances of having a low test error.

    regards

  3. #3
    Omega Contributor
    Points: 38,410, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,003
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Cross validation question

    I believe you can use CV many different ways. The way I traditionally think of k-fold is that the same model gets fitted on the different folds, then an averaged effect is determined.
    Stop cowardice, ban guns!

  4. #4
    Omega Contributor
    Points: 38,410, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,003
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Cross validation question

    Stop cowardice, ban guns!

  5. #5
    Omega Contributor
    Points: 38,410, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,003
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Cross validation question

    I was reviewing K-fold CV for my own use and thought I would update this thread. I still agree with my above statements, and it is also common to run multiple candidate models through the K-fold process and compare their mean values of RMSE or Accuracy.


    I have not seen a good description on potential criteria to use on selecting between the candidate models beyond the lowest error and say parsimony. However, I feel like the form approach does not fully address the measure of dispersion around the means, unless you were able to run say a higher number of folds. Perhaps you can compare the lower 95% Confidence interval values?
    Stop cowardice, ban guns!

  6. #6
    Probably A Mammal
    Points: 31,087, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,564
    Thanks
    398
    Thanked 618 Times in 551 Posts

    Re: Cross validation question

    The criteria for model selection is varied and depends on business context. Do you need the model to be comprehensible? Then using components from PCA to make a better fitted model may not be the best choice, even if test (out-of-sample) accuracy and estimation speed are improved. Maybe you want just the best (lowest) test error, or you may take a little more potential error for having less variables (e.g., k in k-means clustering). Often criteria don't even determine clearly a single best model. Sure, numerically there is a lowest, but you may have like 5 models very near that and then a jump in test error that clearly separates them. Which do you choose? That's where parsimony may step in to say the fewer parameters within those 5 is the better option.
    You should definitely use jQuery. It's really great and does all things.

  7. The Following User Says Thank You to bryangoodrich For This Useful Post:

    hlsmith (11-30-2016)

  8. #7
    Omega Contributor
    Points: 38,410, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,003
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Cross validation question

    Agreed BG. I have not used PCA yet, I was thinking it was more reserved for continuous covariates? I have also seen the process (CV not PCA), with the number of predictors plotted versus error in a graph. Have you ever also interpreted the KS values from K-fold CV?


    To me the whole thing just seems like another subjective/contextual statistical decision.
    Stop cowardice, ban guns!

  9. #8
    Probably A Mammal
    Points: 31,087, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,564
    Thanks
    398
    Thanked 618 Times in 551 Posts

    Re: Cross validation question

    Yeah, PCA is typically used for continuous variables that you want to reduce down to a few variables that capture the same information (dimensionality reduction). In the context of model selection, if you're doing image classification models, for instance, you're absolutely not going to know the meaning of the "variables", and it's even further confounded if, as you probably would in this case, use PCA to reduce the many columns of your image bitmap to a few columns that get the main information along the columns. Iterpretability is thrown out the window!

    Not sure what the KS is you're referring to.

    There are typically 2 CV charts you might want to explore: training vs test error across increasing number of parameters (predictors) and training vs test error across sample size (as n increases). These will help you understand the bias vs variation trade-off, and if maybe more data is required to improve your model fit (or when it won't help) vs when you may need to complicate your model with more parameters (e.g., more k in k-means).
    You should definitely use jQuery. It's really great and does all things.

  10. #9
    Omega Contributor
    Points: 38,410, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    7,003
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Cross validation question


    I wonder if there may be any merit in using a multiple imputation pooling algorithm on the statistics generated in the CV-folds?
    Stop cowardice, ban guns!

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats