hi,
maybe you could try discriminant analysis or regression trees to see whether you get a better model?
regards
rogojel
Hi all, I'm doing a project to predict probability of delinquent for individual loans. Seems the model I fit is not good and I want to improve the model...However, I'm kinda confused by the results I got and don't know what to do next. Can anyone kindly give me some instructions on the project? I'd really appreciate that!!
Here's a brief about the model.
The outcome is binary, where 1 stands for failing to pay in the next month and 0 stands for successfully making the payment. There are both categorical and continuous variables in the model.
The data I have include loans originated within 1998 to 2007, with their original information, like loan size, zip code, credit score, interest rate, purpose code(primary residence or investing) etc. and dynamic information, like payment history, loan age, current fico score, current loan to value ratio, cumulative home price appreciation etc. The entire data table is too huge, so I did random sample selection stratified by origination year to get a sample dataset with 1 million loans.
You can see my approach of treating data here -- although loan data is time series, I treat them as individual data points. Say loan A has a perfect payment history as of 1/1/2005(that is the borrower made every payment since loan origination till now), I assume the probability of the borrower failing to make payment in 2/1/2005 solely depends on the variables I mentioned above.
To estimate the coefficients of the variables, I ran a logistic regression. Since the model includes continuous variable, I think it is logistic regression with ungrouped data, so overdispersion wouldn't be a concern here.
The following is what I get by analyzing the result in SAS.
Actual weighted average delinquency rate vs predicted wavg. Delinquency rate(See the attached chart)
The chart above looks fine, however, if we checked the Pseudo R square, it’s only 0.0826.
Log-likelihood ratio shows we should reject null hypothsis.
Hosmer Lemeshow is significant, indicating the model doesn't fit well.
ROC is 0.64, seems prediction power is quite poor.
I also used proc glm with tolerance option to test the multicollinearity, the code is as below.
proc glm data = ltm.smp_PA_fixed_amort_3;
class purpose_code prop_type_desc fmonth FICO_bucket year_range(...);
model DLQ = orig_amt purpose_code FICO_bucket months_in_DLQ
loan_age prop_type_desc cumu_HPA fmonth year_range(...)/tolerance;
run;
However, I'm not sure how to interpret VIF of categorical variables, since each level of a categorical variables has its own tolerance.
Last, I also checked the specification error. Used the code below:
proc logistic data = temp.PA_fixed_amort_3_dlq_5;
model dlq = pred pred2;
output out = temp.speci_test;
ods output parameterestimates = temp.speci_test_coeff;
run;
Here pred2 = pred**2, pred is the linear predictor from the logistic regression. Both pred and pred2 are significant, indicating misspecification.
Summary:
It looks good when we looked at the fitted chart; A projection forward using monte carlo simulation also shows okay result for now(we project from 2011 to 2020 and compare with the actual rate from 2011 to 2014). But I doubt the robustness of the model, given many statistics indicating bad fit…
Can anyone give me some thoughts about where to go next?
Is there any issue with my approach (like identifying multicollinearity, specification error) or my interpretation of the results?
How to utilize the tolerance data to decide if a categorical variable is correlated with other variables/should be removed or transformed?
Too many questions..
Great great thanks if someone can help me out..
Thanks.
hi,
maybe you could try discriminant analysis or regression trees to see whether you get a better model?
regards
rogojel
sufeipopo (08-25-2014)
thanks regojel, I'll take a look into that!
Tweet |