Confusions regarding Regression, Multicollinearity and Factor Analysis

I am currently working with a data set that contains about 26 IVs of almost all sorts of scale of measurement (binary, nominal, ordinal and interval scale variables). There are strong reasons to suspect that some variables are probably highly correlated, while some may not be related to any other IVs to a great extent.

I came across with great suggestions to resolve this problem in this site (which was an useful advice to use optimally scaled variables in the FA procedure and to use derived factor score as the IVs). But due to my inexperience in this field I am in need of some expert advice on the following issues:

How should I check if Multicollinearity really exists?

I am not sure how to check Multicollinearity with such a heterogeneous data. I may calculate the Heterogeneous Correlation Matrix (or Spearman's Rank Correlation) by somehow forcing me to consider the nominal variables as ordinal but even if I do it what should be the value of the correlation coefficient at which Multicollinearity can be ignored? I am also not sure if it is going to give any insight at all, as I am missing something like a VIF measure!

Should I take only those variables for a FA which are highly correlated?

Say, if I can find two sets of variables (one set containing 8 IVs and another containing 4 IVs) quite highly correlated to each other within each set, then should I use only those 12 variables for FA and derive FA scores for those two factors to use them as IVs? Clearly my intention is to use the other 14 variables separately as IVs along with the two derived scores. I am confused if I should actually use not the 12, but all 26 variables in the FA in this scenario. Remember in that case my FA scores are weighted by the other 14 unrelated variables too!

Is there any problem to categorize a proportion type DV for an ordinal logistic regression?

I've actually found people using logistic regression instead. But I want to mention here that unfortunately I don't know the number of cases (or trials) out of which each proportion was calculated. So I cannot use the number of trials as the weights in the logistic regression. In that case a logistic regression may not be accurate enough. So, as I only know the proportions, won't it be good to categorize the proportions by median split or by quartile split? So that I can use it as a DV in a logistic or in an ordinal logistic regression?

I am thankful for reading this thread patiently and hoping some expert advice.



Less is more. Stay pure. Stay poor.
You can generate the VIF and Tolerance statistics using regression. There are some standard cutoffs people use and yes some people convert the categorical data in to numeric to check variables, when appropriate.

Factor analysis (can be complicated and daunting but keep investigating) and cronbach alpha can help to understand groupings or constructs represented by variables. Looking at these coupled with the VIF may help you with removing or analyzing variables. Kind of depends on what your goals and aims are.

Your last question theme can possibly work, comes back to what you are truly trying to do.


No cake for spunky
Generally you want as many variables as you can get in factor analysis, the method works better as the number of variables increases. Also excluding variables can cause you to miss latent factors. I am not sure how well EFA actually works with just 14 or 12 variables. However, you appear to be using EFA for reasons other than I have seen it used commonly so my comments may be incorrect in your case.