+ Reply to Thread
Results 1 to 2 of 2

Thread: model selection - keeping binned and continous version of variables in same model

  1. #1
    Points: 51, Level: 1
    Level completed: 2%, Points required for next Level: 49

    Thanked 0 Times in 0 Posts

    model selection - keeping binned and continous version of variables in same model


    I have a dataset of about 160,000 observations and I'm trying to build a binary classifier. Lets call my variable to be classified Y and my predictors X1 through X10. As you can see I don't have a very onerous model selection process because I only have about 10 variables I am looking at, which are pretty uncorrelated.

    Most of the predictor variables are continuous but some of them look like they could/should be "binned" because they have a lot of repeating values and not a lot in between, or because they seem to only make sense about a certain value. For example, say X1 has no effect on Y for 0<X1<10, but for X1>=10 there seems to be a linear relationship. I'd like to keep X1 in the model as a continuous variable while also including a dummy variable for whether or not X1 is above or below 10.

    My colleagues said they would not do that - they would choose one or the other, not both. But isn't the categorical version of X1 essentially just a transformation on X1? And isn't it commonplace to keep transformations on X in a model WITH X, for example, X^2?

    I did some initial tests keeping my binned and continuous variables in the model and they came out as both significant. There's a bit of correlation but nothing too bad, maybe like .5 or so. (Plus, going back to my example, wouldn't X^2 necessarily be somewhat correlated with X, and yet we still keep them both in the model, right?

    Please let me know if I'm completely off here. My goal is to build a model that predicts the best, i.e. high recall and precision rates. While this implies that coefficients have to be stable and accurate so that they perform well on a test set, I'm less concerned with interpreting what each coefficient means.


  2. #2
    Points: 3,631, Level: 37
    Level completed: 88%, Points required for next Level: 19
    staassis's Avatar
    New York
    Thanked 41 Times in 39 Posts

    Re: model selection - keeping binned and continous version of variables in same model

    As long as you have enough data, you are perfectly sane in using both variables in the model, the original continuous variable and the threshold indicator. Together they represent a simple non-linear, discontinuous dependence of the response variable on the original continuous variable. If both parameters are statistically significant, you know you can keep both variables in the model. Otherwise either the dependency structure is simpler than you thought or you do not have enough data to estimate a small effect.

  3. The Following User Says Thank You to staassis For This Useful Post:

    Ctiger06 (04-22-2014)

+ Reply to Thread


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Advertise on Talk Stats