# Thread: Factor analysis, what type of data can I use?

1. ## Factor analysis, what type of data can I use?

Hello everyone,

I have a social science dataset composed of circa 40 variables.

The data types I am using are quite diverse. The variables I am using (briefly put) are: are age, gender, educational background(1-7 scale), 7 point perception scales measuring the degree of trust in institutions (likert scale like), and binary variables.

It is a given that I will need multivariate analysis, and I am looking at factor analysis methods. I know that the best method to use for binary vars is logistic regression, but for the non-binary ones, I would like to use factor analysis.

My worries currently are:

1) Can I use factor analysis for all types of data?
2) Can I use binary data in factor analysis?

Any feedback or insight is much appreciated.

Ramon

2. ## Re: Factor analysis, what type of data can I use?

I spent years working to figure out that answer Including lots of time here bothering people discussing it. While I don't believe there is a concensus on this issue, in general I believe most feel that you can not use factor analysis with binary or categorical data (including probably likert scale which is formally ordinal normally) if you use the software defaults. That is because they use pearson's correlations in the correlation matrix EFA utilizes which assume interval data and will generate errors if the data is not interval.

The best option as far as I have been able to tell is to use polytomous correlations. SAS will do this with a macro, MPLUS will do it automatically (SPSS uses R to do this).

3. ## The Following User Says Thank You to noetsi For This Useful Post:

parsec2011 (07-02-2013)

4. ## Re: Factor analysis, what type of data can I use?

noetsi,

The polytomous correlation concept is new to me.
I have read about it and how to get it in R and will be using it in my research.

Thanks

5. ## Re: Factor analysis, what type of data can I use?

It was new to me before I took SEM courses two years ago. Its scary how much data I have run without knowing things like that

Good luck in your research. If you want some articles that review the EFA literature I can send you some. Of course, as always in research, different writers disagree on these issues, but they still might be useful if you have not worked a lot with EFA or CFA.

6. ## Re: Factor analysis, what type of data can I use?

Originally Posted by parsec2011
It is a given that I will need multivariate analysis, and I am looking at factor analysis methods. I know that the best method to use for binary vars is logistic regression, but for the non-binary ones, I would like to use factor analysis.
Logistic regression and factor analysis have completely different goals though. It isn't quite clear from your post what you're trying to substantively achieve with your analysis.

1) Can I use factor analysis for all types of data?
2) Can I use binary data in factor analysis?
There are quite flexible implementations for factor analysis nowadays, but like noetsi says you may not be able to use default settings (esp. in something like SPSS). I think by "polytomous correlation" noetsi means polychoric correlation though?

If you use MPLUS and specify each variable's structure (e.g. binary, continuous, categorical, nominal), it'll automatically calculate the appropriate type of correlation coefficient for each pairwise correlation (e.g., polychoric, biserial, Pearsons, etc). It then has a variety of estimation methods suitable for situations when you're not just dealing with continuous data.

7. ## Re: Factor analysis, what type of data can I use?

yeah I meant polychoric. Boy that was dumb on my part. Read an article on polytomous yesterday

MPLUS is awesome, if you are going to work with latent factors you should use it.

8. ## Re: Factor analysis, what type of data can I use?

Originally Posted by CowboyBear
Logistic regression and factor analysis have completely different goals though. It isn't quite clear from your post what you're trying to substantively achieve with your analysis.
Hello,

The main purpose of my study is to examine the influence that a range of social and political variables have on the likelihood of citizens turning up to a ballot box during elections. I have a comprehensive survey including many variables that can be categorized into three main groups:

One response variable
The act of voting, which is binarily coded, 1 for participants who voted in the last elections and 0 for those who abstained.

Independent variables (two examples)

-(x1) is the extent to which a survey participant trusts his/her government. The survey question simply asks: "To what extent do you trust your government?" The results are coded according to a 0 to 7 perceptions scale, ranging from "no trust at all" to "neutral" (midpoint) to "absolute trust".
In my view, this var is ordinal but there is much debate about where we should exactly put likert-scales.

-(x2) measures whether a survey respondent has participated in a public protest in the last three years. The responses are also coded binarily; score 1 for those who have, else 0.

There are many more independent variables in the study, but the vast majority are either binary, or perceptions based (Likert-like) scaled.

My approach for the analysis has been to use logistic regression to examine how good a predictor is x on y in the following way:

x1(binary) on y(binary)
x2(ordinal) on y(binary)
x1 and x2 (together) on y(binary)

I am unsure if mixing data types in this way in the context of logistic regression could compromise the accuracy of the results.

Originally Posted by CowboyBear
There are quite flexible implementations for factor analysis nowadays, but like noetsi says you may not be able to use default settings (esp. in something like SPSS). I think by "polytomous correlation" noetsi means polychoric correlation though?.
Fancy names they give to these concepts. Yesterday I was googling it under polychloric correlation and kept on wondering what my search had anything to do with Chemistry. lol

9. ## Re: Factor analysis, what type of data can I use?

The type of data for a predictor variable does not matter in logistic regression at all. So you can mix interval, dummy, and ordinal variables as predictors.

I have always wondered why exactly they call it polychoric. I think that binary variables are actually addressed by tetrachoric correlations although the same software runs both.

In statistics, polychoric correlation is a technique for estimating the correlation between two theorised normally distributed continuous latent variables, from two observed ordinal variables. Tetrachoric correlation is a special case of the polychoric correlation applicable when both observed variables are dichotomous. These names derive from the polychoric and tetrachoric series, mathematical expansions once, but no longer, used for estimation of these correlations.
http://en.wikipedia.org/wiki/Polychoric_correlation

Sorry about the confusion in the wording. John Uebersax has a really good discussion of the issue on his website.

10. ## The Following User Says Thank You to noetsi For This Useful Post:

parsec2011 (07-03-2013)

11. ## Re: Factor analysis, what type of data can I use?

Originally Posted by noetsi
It was new to me before I took SEM courses two years ago. Its scary how much data I have run without knowing things like that

Good luck in your research. If you want some articles that review the EFA literature I can send you some. Of course, as always in research, different writers disagree on these issues, but they still might be useful if you have not worked a lot with EFA or CFA.
Unfortunately, in social and political science we don't get a too advanced statistical teaching, which shows its limitations when trying to get one step above.
It would be great if you could share that literature. Thanks a lot

12. ## Re: Factor analysis, what type of data can I use?

I was in public administration research and I agree entirely.

The document needs some clean up so it will probably be early next week before I can send it.

13. ## Re: Factor analysis, what type of data can I use?

Originally Posted by noetsi
I was in public administration research and I agree entirely.

The document needs some clean up so it will probably be early next week before I can send it.
Sure thanks. I also studied public admin.

14. ## Re: Factor analysis, what type of data can I use?

You have my sympathy

15. ## Re: Factor analysis, what type of data can I use?

Originally Posted by noetsi
I have always wondered why exactly they call it polychoric. I think that binary variables are actually addressed by tetrachoric correlations although the same software runs both
from greek. poly = many
choric = to cut or to divide

(my thesis was on this thing so i feel like i have an advanced degree in anything polychoric)

that's the main reason why 'tetrachoric' is used with binary data, because of the 4-cuts done on the continuous, latent bivariate distribution to obtain the binary responses. if you were to a tabuliate your responses in a contingency table, you would have 4 cells: (0,0), (0,1), (1,0) and (1,1)

16. ## Re: Factor analysis, what type of data can I use?

I should have guessed

Do you know how the software calculates these correlations differently with binary data as compared to say a four point likert scale?

One of the philisophical problems with polychoric correlations, which I naturally ignore, is that you have to assume there is a latent continuous variable behind it which you can not know. In some cases this would not make much sense to assume, such as gender. Yet you still can run polychoric/tetronic correlations with these - and how much does that impact the accuracy of the results?

Thankfully I am a data analyst not a statistican so I don't have to worry about such

I just have to worry about staying out of the comming cataclysim between bots and raptors

17. ## Re: Factor analysis, what type of data can I use?

Originally Posted by noetsi
I should have guessed
at this point in my life i have come to realize that if a word is very weird and has lots of vowels, it comes from the greek

Originally Posted by noetsi
Do you know how the software calculates these correlations differently with binary data as compared to say a four point likert scale?
most of the time they get estimated through maximum likelihood. Olsson (1970) (<-- yes, i can't believe i know this from memory) published the closed-form expressions in psychometrika for the likelihood equations of this which, as long as you can assume a latent, bivariate normal distribution, can be easily extended to any arbitrarily large number of cut points. Joreskog extended them to the case of the multivariate normal distribution and, with such, got all the fame and glory for being able to develop both the theory and implementation (through LISREL) of categorical data analysis for Structural Equation Modelling.

you *COULD*, however, quote my thesis where i worked on the estimation of this correlation coefficient using Markov Chain Monte Carlo methods (so not Maximum Likelihood) and with bivaraite log-normal latent distributions (so they can handle skewed data which tends to mess up the polychoric correlation) and get up my citation points through the use of the spunky method

Originally Posted by noetsi
One of the philisophical problems with polychoric correlations, which I naturally ignore, is that you have to assume there is a latent continuous variable behind it which you can not know. In some cases this would not make much sense to assume, such as gender. Yet you still can run polychoric/tetronic correlations with these - and how much does that impact the accuracy of the results?
well... i guess it impacts it in its totality? the computer does not have a mind of its own. if you give it vectors of 0's and 1's and ask for the tetrachoric correlation it is gonna give you an estimate of it. if you give it more and more data so that the computer has more information to work this out, it's going to give you a better estimate of the correlation between the underlying, continuous variables. however, if your 0's and 1's mean, i dunno, gender and pets then god knows how you would interpret "oh, women and cats are correlated at 0.8". i think you could maybe try and make the argument that women are more likely to have cats as pets (or something like that) but, in reality, this is more of a conceptual issue, not one of statistics or algorithms. if you're only concerned with accuracy as 'there is no bias in my parameter estimate' it works the same way it works with all basic maximum likelihood implementations: the larger the sample size, the more accurate (i.e. less biased) your estimate will be.

18. ## The Following User Says Thank You to spunky For This Useful Post:

parsec2011 (07-04-2013)

Page 1 of 2 1 2 Last

 Tweet