I have 15 time series consisting of returns on classic asset class investment (Stocks, Bonds, etc.) and some more advanced strategies which are less correlated with the classic one. The result of my PCA shows that the first PC mostly loads on these classic investments whereas the loadings of other PC are rather split among different strategies so I do not have a clear cut picture. In the next step I regress the first three components on some macro variables like market volatility and inflation to test how sensitive they are towards different macro regimes.

So the question here is whether it is legitimate to apply the varimax rotation to the three components and use rotated scores in the further regression analysis. On one hand it would facilitate a better interpretation of the components (what I need) but on the other hand it also affects the significance levels in the further analysis.

As I’m new to this topic I would very appreciate any comments.

Thanks a lot! ]]>

Please help me with the following issue. Consider the attached model.

All the IVs at both stages (Xs and Ys) are at organizational level. The DV (Z) was collected at individual level and then aggregated to organizational level in the form of "% of the respondents who A,B,C" (came like that from a data set).

I tried 2SLS, OLS, SEM estimation for systems of simultaneous equations, but they all yield rather low R squared of about 5% for Z. I believe there is a multilevel issue with the DV and that is exactly why I am getting low R squared.

My initial exploration of this problem directs me to "Mixed-effects linear regression".

Please let me know if my thoughts on all this are correct or not. Also, what would you recommend on how to estimate this model with such DV?

The outcome is binary, where 1 stands for failing to pay in the next month and 0 stands for successfully making the payment. There are both categorical and continuous variables in the model.

The data I have include loans originated within 1998 to 2007, with their original information, like loan size, zip code, credit score, interest rate, purpose code(primary residence or investing) etc. and dynamic information, like payment history, loan age, current fico score, current loan to value ratio, cumulative home price appreciation etc. The entire data table is too huge, so I did random sample selection stratified by origination year to get a sample dataset with 1 million loans.

You can see my approach of treating data here -- although loan data is time series, I treat them as individual data points. Say loan A has a perfect payment history as of 1/1/2005(that is the borrower made every payment since loan origination till now), I assume the probability of the borrower failing to make payment in 2/1/2005 solely depends on the variables I mentioned above.

To estimate the coefficients of the variables, I ran a logistic regression. Since the model includes continuous variable, I think it is logistic regression with ungrouped data, so overdispersion wouldn't be a concern here.

Actual weighted average delinquency rate vs predicted wavg. Delinquency rate(See the attached chart)

The chart above looks fine, however, if we checked the Pseudo R square, it’s only 0.0826.

Log-likelihood ratio shows we should reject null hypothsis.

Hosmer Lemeshow is significant, indicating the model doesn't fit well.

ROC is 0.64, seems prediction power is quite poor.

I also used proc glm with tolerance option to test the multicollinearity, the code is as below.

proc glm data = ltm.smp_PA_fixed_amort_3;

class purpose_code prop_type_desc fmonth FICO_bucket year_range(...);

model DLQ = orig_amt purpose_code FICO_bucket months_in_DLQ

loan_age prop_type_desc cumu_HPA fmonth year_range(...)/tolerance;

run;

However, I'm not sure how to interpret VIF of categorical variables, since each level of a categorical variables has its own tolerance.

Last, I also checked the specification error. Used the code below:

proc logistic data = temp.PA_fixed_amort_3_dlq_5;

model dlq = pred pred2;

output out = temp.speci_test;

ods output parameterestimates = temp.speci_test_coeff;

run;

Here pred2 = pred**2, pred is the linear predictor from the logistic regression. Both pred and pred2 are significant, indicating misspecification.

It looks good when we looked at the fitted chart; A projection forward using monte carlo simulation also shows okay result for now(we project from 2011 to 2020 and compare with the actual rate from 2011 to 2014). But I doubt the robustness of the model, given many statistics indicating bad fit…

Can anyone give me some thoughts about where to go next?

Is there any issue with my approach (like identifying multicollinearity, specification error) or my interpretation of the results?

How to utilize the tolerance data to decide if a categorical variable is correlated with other variables/should be removed or transformed?

Too many questions..

Great great thanks if someone can help me out..

Thanks.