+ Reply to Thread
Results 1 to 7 of 7

Thread: Machine Learning Classification (R)

  1. #1
    Omega Contributor
    Points: 38,253, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,989
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Machine Learning Classification (R)




    I have a continuous variable that I want to dichotomize as a classifier for a binary dependent variable. I had split it last year using logic based from the receiver operator characteristic curve. I have run decision trees on these problems before and they get the same split as ROC approaches (which the former process takes me some time to do). Probably both are using a gini or accuracy logic in the same way.


    I was curious if anybody had done such splits, say based on Random Forests or K-means clusters (based on bootstrap resampling) or maybe even SVM, which I haven't used. In order to maximize out-of-sample success, I thought it may be useful to average the values found in these approaches instead of my standard ROC style method. So create a type of ensemble.


    I would be interested in hearing people's ideas as well as example code (which can be in R).


    Thanks.
    Stop cowardice, ban guns!

  2. The Following 2 Users Say Thank You to hlsmith For This Useful Post:

    TheEcologist (12-01-2016), trinker (12-01-2016)

  3. #2
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Machine Learning Classification (R)

    hi,
    why would you want to do this? I would say you will lose information by the dichotomisation, it can't possibly be better then using the original values, or do I miss something?

  4. #3
    Omega Contributor
    Points: 38,253, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,989
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Machine Learning Classification (R)

    In medicine we constantly create thresholds for clinical decision making. Blood pressure for hypertension, blood sugar for diabetes, cholesterol for dyslipidemia,..., etc. Some thresholds are for risk and others are actually two underlying distributions that the variable of interest is a mixture of, say white blood counts for people with or without HIV - where the disease impacts the variable. Another example is say age and heart disease, which age is a proxy of other latent things. So are guidelines for older men to get prostate exams, these threholds help clinicians make recommendations.


    I am just establishing a cutoff point that best classifies patient risk for an event.
    Stop cowardice, ban guns!

  5. The Following User Says Thank You to hlsmith For This Useful Post:

    rogojel (12-02-2016)

  6. #4
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Machine Learning Classification (R)

    I see,
    I guess I would try all the cutoff points and use CV to see which one gives the best prediction - kind of like a regression tree.

    regards

  7. The Following User Says Thank You to rogojel For This Useful Post:

    hlsmith (12-02-2016)

  8. #5
    Probably A Mammal
    Points: 31,087, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,564
    Thanks
    398
    Thanked 618 Times in 551 Posts

    Re: Machine Learning Classification (R)

    Yeah, I agree with rogojel. Your best bet is to probably use CV and search (loop) across many different cutoff options. The one that gives the best out-of-sample prediction (via CV) would be your optimal choice for predictive success, at least given the model and data you're using. You might even compare that to what a regression tree (forest?) would use, as a pre-modeling effort to pick the best features and feature engineering (in this case, your discretizing the continuous variable). This way, you can see if the parameter search + cv versus random tree/forests result in the same outcome for best features. This is also not an uncommon problem in ML model development. Feature engineering and feature selection may often involve ML modeling to help determine them for the ML model you want to operationalize!
    You should definitely use jQuery. It's really great and does all things.

  9. The Following User Says Thank You to bryangoodrich For This Useful Post:

    hlsmith (12-02-2016)

  10. #6
    Omega Contributor
    Points: 38,253, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,989
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Machine Learning Classification (R)

    There is a group of people from Berkeley that run all models; parametric or not with CV, weight them, and then use an ensemble. I have seen them do this with many inferential statistics, but not with a cut-off problem. The approach out performs everything else and is based on the logic that all models are wrong but their average and use of models with minimal assumptions could be best. This is what is driving my question. I have been wanting to explore the question for about a year (at least in the back of my mind), but it requires me knowing how to run many types of approaches. And currently I am just savvy with logistic.
    Stop cowardice, ban guns!

  11. #7
    Probably A Mammal
    Points: 31,087, Level: 100
    Level completed: 0%, Points required for next Level: 0
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,564
    Thanks
    398
    Thanked 618 Times in 551 Posts

    Re: Machine Learning Classification (R)


    Yeah, ensembles are supposed to always be better than any single model. It's not unlike doing GAMs. You just need a way to pull the models together. Typically this is bagging or boosting approaches, and I forget the exact difference between them. They're very cool to put into use, but you also move very far from explanatory models to very much just predictive models (and need to always tune and update as the conditions change to ensure it still applies to the business problem it was designed for).
    You should definitely use jQuery. It's really great and does all things.

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats