How many predictors are you using? What are your sample sizes?
I am doing a test logistic regression to predict whether employees will stay in the company for more than 3 years.
After the model is trained, the predictions done using the model gives only the probabilities of "1" and "2.2204E-16 (essentially 0)".
I thought normally the probabilities will lies somewhere between 0 and 1. Is this case due to the lack of training data? Or model convergence problem? Are there ways to solve this problem?
How many predictors are you using? What are your sample sizes?
I don't have emotions and sometimes that makes me very sad.
Can you post your full output?
Stop cowardice, ban guns!
Thanks for helping
I used the Matlab function "fitglm" to implement the logistic regression by setting the 'Distribution' parameter equals to 'binomial' :
Logi_COE_P = fitglm(training_data_matrix, result_data_matrix, 'linear', 'CategoricalVars', CategorialVariables, 'Distribution', 'binomial', 'Link', 'logit', 'BinomialSize', 1, 'DispersionFlag', true, 'Weights', OverllDataWeight);
During the training process, it gives the warnings:
Warning: Removing terms where categorical variables
appear in powers higher than linear.
> In FormulaProcessor>FormulaProcessor.removeCategoricalPowers at 510
In TermsRegression>TermsRegression.removeCategoricalPowers at 396
In GeneralizedLinearModel>GeneralizedLinearModel.fit at 1244
In fitglm at 133
In Forecast at 248
Warning: Iteration limit reached.
> In glmfit at 368
In GeneralizedLinearModel>GeneralizedLinearModel.fitter at 919
In FitObject>FitObject.doFit at 220
In GeneralizedLinearModel>GeneralizedLinearModel.fit at 1245
In fitglm at 133
In Forecast at 248
Warning: Regression design matrix is rank deficient
to within machine precision.
> In TermsRegression>TermsRegression.checkDesignRank at 98
In GeneralizedLinearModel>GeneralizedLinearModel.fit at 1262
In fitglm at 133
In Forecast at 248
For the predictions given by the trained model, it gives:
Probability of employee staying more than 3 years: 1 1 1 1 1 1 2.22E-16 2.22E-16 2.22E-16 2.22E-16 2.22E-16 2.22E-16 2.22E-16 2.22E-16
Employee number: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Any idea what all these mean?
No idea, not familiar enough with STATA or the procedure. I would consult the documentation for the procedure. Is this a crossvalidation procedure? Some times you can change the number of iterations in programs, but you seem to have other issues as well.
Stop cowardice, ban guns!
Why do you have some many predictors?
I don't have emotions and sometimes that makes me very sad.
Thanks for the effort
Ya there are a few issues, not sure which cause the unwanted results...
Of the 180000 observations how many have the outcome of interest? The general rule is the you take the smaller proportion group of the outcome (e.g., 50%, so 9,000) and you my be able to support a predictor for each 10-20 values in that group (so 450 to 900).
Though big picture you seem to be fishing for results instead of making advances base on prior knowledge. You should work on building the model up. Can you get you model to run with a few predictors?
Stop cowardice, ban guns!
The observations with desired outcome is about 1/10 of the sample size.
By taking a smaller proportion group of the outcome, do you mean I should pick a portion which contains similar number of desired and undesired outcome?
I think you are right. I shouldn't be fishing for results and should try to use a few predictors first, then improve upon that.
Thanks for you suggestions
So if you had 18000 observation, with 1800 1s and 16200 0s then you may be powered for 90 to 180 predictors. That is a pseudo generality.
Stop cowardice, ban guns!
Tweet |