Factor analysis, what type of data can I use?

P

parsec2011

Guest
#1
Hello everyone,

I have a social science dataset composed of circa 40 variables.

The data types I am using are quite diverse. The variables I am using (briefly put) are: are age, gender, educational background(1-7 scale), 7 point perception scales measuring the degree of trust in institutions (likert scale like), and binary variables.

It is a given that I will need multivariate analysis, and I am looking at factor analysis methods. I know that the best method to use for binary vars is logistic regression, but for the non-binary ones, I would like to use factor analysis.

My worries currently are:

1) Can I use factor analysis for all types of data?
2) Can I use binary data in factor analysis?

Any feedback or insight is much appreciated.

Ramon
 

noetsi

Fortran must die
#2
I spent years working to figure out that answer :p Including lots of time here bothering people discussing it. While I don't believe there is a concensus on this issue, in general I believe most feel that you can not use factor analysis with binary or categorical data (including probably likert scale which is formally ordinal normally) if you use the software defaults. That is because they use pearson's correlations in the correlation matrix EFA utilizes which assume interval data and will generate errors if the data is not interval.

The best option as far as I have been able to tell is to use polytomous correlations. SAS will do this with a macro, MPLUS will do it automatically (SPSS uses R to do this).
 
P

parsec2011

Guest
#3
noetsi,

The polytomous correlation concept is new to me.
I have read about it and how to get it in R and will be using it in my research.

Thanks
 

noetsi

Fortran must die
#4
It was new to me before I took SEM courses two years ago. Its scary how much data I have run without knowing things like that :(

Good luck in your research. If you want some articles that review the EFA literature I can send you some. Of course, as always in research, different writers disagree on these issues, but they still might be useful if you have not worked a lot with EFA or CFA.
 

CowboyBear

Super Moderator
#5
It is a given that I will need multivariate analysis, and I am looking at factor analysis methods. I know that the best method to use for binary vars is logistic regression, but for the non-binary ones, I would like to use factor analysis.
Logistic regression and factor analysis have completely different goals though. It isn't quite clear from your post what you're trying to substantively achieve with your analysis.

1) Can I use factor analysis for all types of data?
2) Can I use binary data in factor analysis?
There are quite flexible implementations for factor analysis nowadays, but like noetsi says you may not be able to use default settings (esp. in something like SPSS). I think by "polytomous correlation" noetsi means polychoric correlation though?

If you use MPLUS and specify each variable's structure (e.g. binary, continuous, categorical, nominal), it'll automatically calculate the appropriate type of correlation coefficient for each pairwise correlation (e.g., polychoric, biserial, Pearsons, etc). It then has a variety of estimation methods suitable for situations when you're not just dealing with continuous data.
 

noetsi

Fortran must die
#6
yeah I meant polychoric. Boy that was dumb on my part. Read an article on polytomous yesterday :(

MPLUS is awesome, if you are going to work with latent factors you should use it.
 
P

parsec2011

Guest
#7
Logistic regression and factor analysis have completely different goals though. It isn't quite clear from your post what you're trying to substantively achieve with your analysis.
Hello,

The main purpose of my study is to examine the influence that a range of social and political variables have on the likelihood of citizens turning up to a ballot box during elections. I have a comprehensive survey including many variables that can be categorized into three main groups:

One response variable
The act of voting, which is binarily coded, 1 for participants who voted in the last elections and 0 for those who abstained.

Independent variables (two examples)

-(x1) is the extent to which a survey participant trusts his/her government. The survey question simply asks: "To what extent do you trust your government?" The results are coded according to a 0 to 7 perceptions scale, ranging from "no trust at all" to "neutral" (midpoint) to "absolute trust".
In my view, this var is ordinal but there is much debate about where we should exactly put likert-scales. :)

-(x2) measures whether a survey respondent has participated in a public protest in the last three years. The responses are also coded binarily; score 1 for those who have, else 0.

There are many more independent variables in the study, but the vast majority are either binary, or perceptions based (Likert-like) scaled.

My approach for the analysis has been to use logistic regression to examine how good a predictor is x on y in the following way:

x1(binary) on y(binary)
x2(ordinal) on y(binary)
x1 and x2 (together) on y(binary)

I am unsure if mixing data types in this way in the context of logistic regression could compromise the accuracy of the results.

There are quite flexible implementations for factor analysis nowadays, but like noetsi says you may not be able to use default settings (esp. in something like SPSS). I think by "polytomous correlation" noetsi means polychoric correlation though?.
Fancy names they give to these concepts. Yesterday I was googling it under polychloric correlation and kept on wondering what my search had anything to do with Chemistry. lol
 

noetsi

Fortran must die
#8
The type of data for a predictor variable does not matter in logistic regression at all. So you can mix interval, dummy, and ordinal variables as predictors.

I have always wondered why exactly they call it polychoric. I think that binary variables are actually addressed by tetrachoric correlations although the same software runs both.

In statistics, polychoric correlation is a technique for estimating the correlation between two theorised normally distributed continuous latent variables, from two observed ordinal variables. Tetrachoric correlation is a special case of the polychoric correlation applicable when both observed variables are dichotomous. These names derive from the polychoric and tetrachoric series, mathematical expansions once, but no longer, used for estimation of these correlations.
http://en.wikipedia.org/wiki/Polychoric_correlation

Sorry about the confusion in the wording. John Uebersax has a really good discussion of the issue on his website.
 
P

parsec2011

Guest
#9
It was new to me before I took SEM courses two years ago. Its scary how much data I have run without knowing things like that :(

Good luck in your research. If you want some articles that review the EFA literature I can send you some. Of course, as always in research, different writers disagree on these issues, but they still might be useful if you have not worked a lot with EFA or CFA.
As I learn more about statistics, I wonder how I didn't know these things way before.
Unfortunately, in social and political science we don't get a too advanced statistical teaching, which shows its limitations when trying to get one step above.
It would be great if you could share that literature. Thanks a lot :)
 

noetsi

Fortran must die
#10
I was in public administration research and I agree entirely.

The document needs some clean up so it will probably be early next week before I can send it.
 

spunky

Super Moderator
#13
I have always wondered why exactly they call it polychoric. I think that binary variables are actually addressed by tetrachoric correlations although the same software runs both
from greek. poly = many
choric = to cut or to divide

(my thesis was on this thing so i feel like i have an advanced degree in anything polychoric)

that's the main reason why 'tetrachoric' is used with binary data, because of the 4-cuts done on the continuous, latent bivariate distribution to obtain the binary responses. if you were to a tabuliate your responses in a contingency table, you would have 4 cells: (0,0), (0,1), (1,0) and (1,1)
 

noetsi

Fortran must die
#14
I should have guessed :p

Do you know how the software calculates these correlations differently with binary data as compared to say a four point likert scale?

One of the philisophical problems with polychoric correlations, which I naturally ignore, is that you have to assume there is a latent continuous variable behind it which you can not know. In some cases this would not make much sense to assume, such as gender. Yet you still can run polychoric/tetronic correlations with these - and how much does that impact the accuracy of the results?

Thankfully I am a data analyst not a statistican so I don't have to worry about such:)

I just have to worry about staying out of the comming cataclysim between bots and raptors :(
 

spunky

Super Moderator
#15
I should have guessed :p
at this point in my life i have come to realize that if a word is very weird and has lots of vowels, it comes from the greek :D

Do you know how the software calculates these correlations differently with binary data as compared to say a four point likert scale?
most of the time they get estimated through maximum likelihood. Olsson (1970) (<-- yes, i can't believe i know this from memory) published the closed-form expressions in psychometrika for the likelihood equations of this which, as long as you can assume a latent, bivariate normal distribution, can be easily extended to any arbitrarily large number of cut points. Joreskog extended them to the case of the multivariate normal distribution and, with such, got all the fame and glory for being able to develop both the theory and implementation (through LISREL) of categorical data analysis for Structural Equation Modelling.

you *COULD*, however, quote my thesis where i worked on the estimation of this correlation coefficient using Markov Chain Monte Carlo methods (so not Maximum Likelihood) and with bivaraite log-normal latent distributions (so they can handle skewed data which tends to mess up the polychoric correlation) and get up my citation points through the use of the spunky method :D

One of the philisophical problems with polychoric correlations, which I naturally ignore, is that you have to assume there is a latent continuous variable behind it which you can not know. In some cases this would not make much sense to assume, such as gender. Yet you still can run polychoric/tetronic correlations with these - and how much does that impact the accuracy of the results?
well... i guess it impacts it in its totality? the computer does not have a mind of its own. if you give it vectors of 0's and 1's and ask for the tetrachoric correlation it is gonna give you an estimate of it. if you give it more and more data so that the computer has more information to work this out, it's going to give you a better estimate of the correlation between the underlying, continuous variables. however, if your 0's and 1's mean, i dunno, gender and pets then god knows how you would interpret "oh, women and cats are correlated at 0.8". i think you could maybe try and make the argument that women are more likely to have cats as pets (or something like that) but, in reality, this is more of a conceptual issue, not one of statistics or algorithms. if you're only concerned with accuracy as 'there is no bias in my parameter estimate' it works the same way it works with all basic maximum likelihood implementations: the larger the sample size, the more accurate (i.e. less biased) your estimate will be.
 
P

parsec2011

Guest
#16
I just read an article about polychoric correlation coefficient. J Erkstrom, the author, used Karl Pearson's smallpox recovery data, which is composed of four sets of total counts for participants who either recovered or died (0, 1 values) and individuals who weren't or were vaccinated (coded 0,1) to illustrate the limitations of polychoric correlations used in non-normally distributed data.

Using the chi-square method, it is well known that the vaccine is effective with a p value <0.0001 (n=2081)
However, the polychoric correlation coefficient is significant, but far from 1. Using R, the data yielded a coefficient of 0.60 I believe this divergence of results is due to the skewed nature of the sample, as it is composed of highly polarized, thereby skewed sets of binary data.

Any further articles, ideas, thesis arguments are welcome.

PS I am including a brief summary of the R script I used as it took a while to get the results due to a small misspelling issue in the R interface.

packages you need to install:
mvtnorm, sfsmisc, polycor

library(polycor)
x <- read.csv("file_name.csv", header=FALSE)
polychor(x)


Reference:

(1)
http://www.google.nl/url?sa=t&rct=j...=7PkhzZ9rW0uOdvZmGhPf5w&bvm=bv.48705608,d.ZWU
 
Last edited by a moderator:

noetsi

Fortran must die
#17
I don't understand how you can use logistic regression in the context of EFA. I have never seen logistic regression used for data reduction, it does not generate latent factors. If you can do that, that's amazing.

I would think M or S estimators would be even better for skewed data, but they can not be used with a non-interval dependent variable.
 
P

parsec2011

Guest
#18
I don't understand how you can use logistic regression in the context of EFA. I have never seen logistic regression used for data reduction, it does not generate latent factors. If you can do that, that's amazing.
You're totally right. Sorry for the misunderstanding. The way I view this is that Exploratory Factor Analysis helps researchers to identify relationships among variables, pretty much like it's done in a simple correlation matrix. It is kind of a first step exploratory method. In my research case, its implementation is important, much more so because I have 45 variables. I must say it is the first time I conduct EFA, I just downloaded an article that will help me understand the method into depth.

According to the results I get from EFA, I can use multivariate techniques (logistic regression in my case) as a second research step in order to ascertain the relationship and predictive power of independent vars over a response variable. For instance, I could test how well an ordinal variable (i.e. respondent's perception about the government of their nation) predicts voting or not voting.
 
Last edited by a moderator:

noetsi

Fortran must die
#19
I still have not figured out how to add papers here. Here are some useful links.

This shows how to do polychoric correlations in SAS.

http://support.sas.com/kb/25/010.html

A brief discussion of this (the use of polychoric correlations) process including other software that uses it.

http://www.john-uebersax.com/stat/sem.htm

This stakes out an important difference between EFA and PCA (you will do the later I believe if you use the SAS default for EFA and probably other software as well). Not all agree with this view.

http://www2.sas.com/proceedings/sugi30/203-30.pdf

A list of assumptions.
http://en.wikiversity.org/wiki/Exploratory_factor_analysis/Assumptions

http://sites.stat.psu.edu/~ajw13/stat505/fa06/17_factor/03_factor_assump.html

General articles on EFA methods (summaries of the state of the art to some extent).

http://pareonline.net/pdf/v10n7.pdf

http://mvint.usbmed.edu.co:8002/ojs/index.php/web/article/viewFile/464/605

http://www.bama.ua.edu/~jcsenkbeil/gy523/Factor Analysis.pdf

http://www.cob.unt.edu/slides/paswan/busi6280/Z-Conway_Huffcutt.pdf

http://psych.unl.edu/psycrs/948_2011/2b_EFA_PCA.pdf