+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 19

Thread: Factor analysis, what type of data can I use?

  1. #1
    Points: 4,711, Level: 43
    Level completed: 81%, Points required for next Level: 39

    Location
    Amsterdam, The Netherlands
    Posts
    70
    Thanks
    20
    Thanked 9 Times in 9 Posts

    Factor analysis, what type of data can I use?




    Hello everyone,

    I have a social science dataset composed of circa 40 variables.

    The data types I am using are quite diverse. The variables I am using (briefly put) are: are age, gender, educational background(1-7 scale), 7 point perception scales measuring the degree of trust in institutions (likert scale like), and binary variables.

    It is a given that I will need multivariate analysis, and I am looking at factor analysis methods. I know that the best method to use for binary vars is logistic regression, but for the non-binary ones, I would like to use factor analysis.

    My worries currently are:

    1) Can I use factor analysis for all types of data?
    2) Can I use binary data in factor analysis?

    Any feedback or insight is much appreciated.

    Ramon

  2. #2
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Factor analysis, what type of data can I use?

    I spent years working to figure out that answer Including lots of time here bothering people discussing it. While I don't believe there is a concensus on this issue, in general I believe most feel that you can not use factor analysis with binary or categorical data (including probably likert scale which is formally ordinal normally) if you use the software defaults. That is because they use pearson's correlations in the correlation matrix EFA utilizes which assume interval data and will generate errors if the data is not interval.

    The best option as far as I have been able to tell is to use polytomous correlations. SAS will do this with a macro, MPLUS will do it automatically (SPSS uses R to do this).
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  3. The Following User Says Thank You to noetsi For This Useful Post:

    parsec2011 (07-02-2013)

  4. #3
    Points: 4,711, Level: 43
    Level completed: 81%, Points required for next Level: 39

    Location
    Amsterdam, The Netherlands
    Posts
    70
    Thanks
    20
    Thanked 9 Times in 9 Posts

    Re: Factor analysis, what type of data can I use?

    noetsi,

    The polytomous correlation concept is new to me.
    I have read about it and how to get it in R and will be using it in my research.

    Thanks

  5. #4
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Factor analysis, what type of data can I use?

    It was new to me before I took SEM courses two years ago. Its scary how much data I have run without knowing things like that

    Good luck in your research. If you want some articles that review the EFA literature I can send you some. Of course, as always in research, different writers disagree on these issues, but they still might be useful if you have not worked a lot with EFA or CFA.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  6. #5
    TS Contributor
    Points: 18,889, Level: 87
    Level completed: 8%, Points required for next Level: 461
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    2,062
    Thanks
    121
    Thanked 427 Times in 328 Posts

    Re: Factor analysis, what type of data can I use?

    Quote Originally Posted by parsec2011 View Post
    It is a given that I will need multivariate analysis, and I am looking at factor analysis methods. I know that the best method to use for binary vars is logistic regression, but for the non-binary ones, I would like to use factor analysis.
    Logistic regression and factor analysis have completely different goals though. It isn't quite clear from your post what you're trying to substantively achieve with your analysis.

    1) Can I use factor analysis for all types of data?
    2) Can I use binary data in factor analysis?
    There are quite flexible implementations for factor analysis nowadays, but like noetsi says you may not be able to use default settings (esp. in something like SPSS). I think by "polytomous correlation" noetsi means polychoric correlation though?

    If you use MPLUS and specify each variable's structure (e.g. binary, continuous, categorical, nominal), it'll automatically calculate the appropriate type of correlation coefficient for each pairwise correlation (e.g., polychoric, biserial, Pearsons, etc). It then has a variety of estimation methods suitable for situations when you're not just dealing with continuous data.

  7. #6
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Factor analysis, what type of data can I use?

    yeah I meant polychoric. Boy that was dumb on my part. Read an article on polytomous yesterday

    MPLUS is awesome, if you are going to work with latent factors you should use it.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  8. #7
    Points: 4,711, Level: 43
    Level completed: 81%, Points required for next Level: 39

    Location
    Amsterdam, The Netherlands
    Posts
    70
    Thanks
    20
    Thanked 9 Times in 9 Posts

    Re: Factor analysis, what type of data can I use?

    Quote Originally Posted by CowboyBear View Post
    Logistic regression and factor analysis have completely different goals though. It isn't quite clear from your post what you're trying to substantively achieve with your analysis.
    Hello,

    The main purpose of my study is to examine the influence that a range of social and political variables have on the likelihood of citizens turning up to a ballot box during elections. I have a comprehensive survey including many variables that can be categorized into three main groups:

    One response variable
    The act of voting, which is binarily coded, 1 for participants who voted in the last elections and 0 for those who abstained.

    Independent variables (two examples)

    -(x1) is the extent to which a survey participant trusts his/her government. The survey question simply asks: "To what extent do you trust your government?" The results are coded according to a 0 to 7 perceptions scale, ranging from "no trust at all" to "neutral" (midpoint) to "absolute trust".
    In my view, this var is ordinal but there is much debate about where we should exactly put likert-scales.

    -(x2) measures whether a survey respondent has participated in a public protest in the last three years. The responses are also coded binarily; score 1 for those who have, else 0.

    There are many more independent variables in the study, but the vast majority are either binary, or perceptions based (Likert-like) scaled.

    My approach for the analysis has been to use logistic regression to examine how good a predictor is x on y in the following way:

    x1(binary) on y(binary)
    x2(ordinal) on y(binary)
    x1 and x2 (together) on y(binary)

    I am unsure if mixing data types in this way in the context of logistic regression could compromise the accuracy of the results.

    Quote Originally Posted by CowboyBear View Post
    There are quite flexible implementations for factor analysis nowadays, but like noetsi says you may not be able to use default settings (esp. in something like SPSS). I think by "polytomous correlation" noetsi means polychoric correlation though?.
    Fancy names they give to these concepts. Yesterday I was googling it under polychloric correlation and kept on wondering what my search had anything to do with Chemistry. lol

  9. #8
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Factor analysis, what type of data can I use?

    The type of data for a predictor variable does not matter in logistic regression at all. So you can mix interval, dummy, and ordinal variables as predictors.

    I have always wondered why exactly they call it polychoric. I think that binary variables are actually addressed by tetrachoric correlations although the same software runs both.

    In statistics, polychoric correlation is a technique for estimating the correlation between two theorised normally distributed continuous latent variables, from two observed ordinal variables. Tetrachoric correlation is a special case of the polychoric correlation applicable when both observed variables are dichotomous. These names derive from the polychoric and tetrachoric series, mathematical expansions once, but no longer, used for estimation of these correlations.
    http://en.wikipedia.org/wiki/Polychoric_correlation

    Sorry about the confusion in the wording. John Uebersax has a really good discussion of the issue on his website.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  10. The Following User Says Thank You to noetsi For This Useful Post:

    parsec2011 (07-03-2013)

  11. #9
    Points: 4,711, Level: 43
    Level completed: 81%, Points required for next Level: 39

    Location
    Amsterdam, The Netherlands
    Posts
    70
    Thanks
    20
    Thanked 9 Times in 9 Posts

    Re: Factor analysis, what type of data can I use?

    Quote Originally Posted by noetsi View Post
    It was new to me before I took SEM courses two years ago. Its scary how much data I have run without knowing things like that

    Good luck in your research. If you want some articles that review the EFA literature I can send you some. Of course, as always in research, different writers disagree on these issues, but they still might be useful if you have not worked a lot with EFA or CFA.
    As I learn more about statistics, I wonder how I didn't know these things way before.
    Unfortunately, in social and political science we don't get a too advanced statistical teaching, which shows its limitations when trying to get one step above.
    It would be great if you could share that literature. Thanks a lot

  12. #10
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Factor analysis, what type of data can I use?

    I was in public administration research and I agree entirely.

    The document needs some clean up so it will probably be early next week before I can send it.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  13. #11
    Points: 4,711, Level: 43
    Level completed: 81%, Points required for next Level: 39

    Location
    Amsterdam, The Netherlands
    Posts
    70
    Thanks
    20
    Thanked 9 Times in 9 Posts

    Re: Factor analysis, what type of data can I use?

    Quote Originally Posted by noetsi View Post
    I was in public administration research and I agree entirely.

    The document needs some clean up so it will probably be early next week before I can send it.
    Sure thanks. I also studied public admin.

  14. #12
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Factor analysis, what type of data can I use?

    You have my sympathy
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  15. #13
    TS Contributor
    Points: 22,344, Level: 92
    Level completed: 99%, Points required for next Level: 6
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: Factor analysis, what type of data can I use?

    Quote Originally Posted by noetsi View Post
    I have always wondered why exactly they call it polychoric. I think that binary variables are actually addressed by tetrachoric correlations although the same software runs both
    from greek. poly = many
    choric = to cut or to divide

    (my thesis was on this thing so i feel like i have an advanced degree in anything polychoric)

    that's the main reason why 'tetrachoric' is used with binary data, because of the 4-cuts done on the continuous, latent bivariate distribution to obtain the binary responses. if you were to a tabuliate your responses in a contingency table, you would have 4 cells: (0,0), (0,1), (1,0) and (1,1)
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  16. #14
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Factor analysis, what type of data can I use?

    I should have guessed

    Do you know how the software calculates these correlations differently with binary data as compared to say a four point likert scale?

    One of the philisophical problems with polychoric correlations, which I naturally ignore, is that you have to assume there is a latent continuous variable behind it which you can not know. In some cases this would not make much sense to assume, such as gender. Yet you still can run polychoric/tetronic correlations with these - and how much does that impact the accuracy of the results?

    Thankfully I am a data analyst not a statistican so I don't have to worry about such

    I just have to worry about staying out of the comming cataclysim between bots and raptors
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  17. #15
    TS Contributor
    Points: 22,344, Level: 92
    Level completed: 99%, Points required for next Level: 6
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: Factor analysis, what type of data can I use?


    Quote Originally Posted by noetsi View Post
    I should have guessed
    at this point in my life i have come to realize that if a word is very weird and has lots of vowels, it comes from the greek

    Quote Originally Posted by noetsi View Post
    Do you know how the software calculates these correlations differently with binary data as compared to say a four point likert scale?
    most of the time they get estimated through maximum likelihood. Olsson (1970) (<-- yes, i can't believe i know this from memory) published the closed-form expressions in psychometrika for the likelihood equations of this which, as long as you can assume a latent, bivariate normal distribution, can be easily extended to any arbitrarily large number of cut points. Joreskog extended them to the case of the multivariate normal distribution and, with such, got all the fame and glory for being able to develop both the theory and implementation (through LISREL) of categorical data analysis for Structural Equation Modelling.

    you *COULD*, however, quote my thesis where i worked on the estimation of this correlation coefficient using Markov Chain Monte Carlo methods (so not Maximum Likelihood) and with bivaraite log-normal latent distributions (so they can handle skewed data which tends to mess up the polychoric correlation) and get up my citation points through the use of the spunky method

    Quote Originally Posted by noetsi View Post
    One of the philisophical problems with polychoric correlations, which I naturally ignore, is that you have to assume there is a latent continuous variable behind it which you can not know. In some cases this would not make much sense to assume, such as gender. Yet you still can run polychoric/tetronic correlations with these - and how much does that impact the accuracy of the results?
    well... i guess it impacts it in its totality? the computer does not have a mind of its own. if you give it vectors of 0's and 1's and ask for the tetrachoric correlation it is gonna give you an estimate of it. if you give it more and more data so that the computer has more information to work this out, it's going to give you a better estimate of the correlation between the underlying, continuous variables. however, if your 0's and 1's mean, i dunno, gender and pets then god knows how you would interpret "oh, women and cats are correlated at 0.8". i think you could maybe try and make the argument that women are more likely to have cats as pets (or something like that) but, in reality, this is more of a conceptual issue, not one of statistics or algorithms. if you're only concerned with accuracy as 'there is no bias in my parameter estimate' it works the same way it works with all basic maximum likelihood implementations: the larger the sample size, the more accurate (i.e. less biased) your estimate will be.
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  18. The Following User Says Thank You to spunky For This Useful Post:

    parsec2011 (07-04-2013)

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats