Polychoric correlation (assumptions)

Martin Marko

Member
Hi, are there some assumptions for polychoric (tetrachoric) correlation estimation for 0/1 dichotomous variables (such as ratio of 0 and 1)?

Thank you!
M.

Last edited:

spunky

Can't make spagetti
the main assumption is that there is an underlying bivariate (or multivariate) normal distribution underlying the discrete observations. so the variable that you are studying is in reality continuous but it becomes discretized by the very act of measuring it.

the degree in which said distribution deviates from normal will result in more and more biased estimates of the polychoric correlation.

noetsi

No cake for spunky
I think that assumption is a central difference between point biserial and polychoric correlations as the former does not assume a latent continuous variable. It is unclear what difference this assumption makes. Some treatments of logistic regression assume the same and others do not but the analysis and results are the same. Moreover, this assumption of a latent continuous variable can never be verified in practice and it is not clear to me if it would actual influence the results if it was not true. It seems as much philisophical as empirical.

spunky

Can't make spagetti
Moreover, this assumption of a latent continuous variable can never be verified in practice
i guess it depends on what you mean by "this assumption". because there exists a chi-square test that you can run on the contingency table of observed probabilities of the items to see if the assumption of underlying normality is tenable or not. so yes, there are forms to verify that aspect of it.

if it would actual influence the results if it was not true.
i think i mentioned already that the degree towards it departures from normality tends to bias the estimates of the polychoric correlation coefficient, which is understandable. if the probabilities in the contingency table do not match those obtained from integrating over the bi- (or multi-) variate normal distribution function, you're not going to end up with a polychoric correlation estimate that matches the one between the latent, continuous variables.

noetsi

No cake for spunky
It was always stressed to me that it was impossible to every know for sure a latent variable existed. Or if it had a specific distribution. Certain elements can suggest it, so can SEM, but that is not proof it exists. But that may simply be a quibble.

It is interesting that there is such a test.

Dason

It was always stressed to me that it was impossible to every know for sure a latent variable existed.

What about in cases where the binary variable really is the result of transforming a continuous response. Something like

Obese = 1 if BMI > 30, 0 otherwise

Clearly each observation came from looking at a person's BMI and then just categorizing it.

spunky

Can't make spagetti
It was always stressed to me that it was impossible to every know for sure a latent variable existed.
well, Dason ninja'd my post but it was esentially something along those lines. the difference relies on how the latent variable comes to be. the latent variables you and i are used to are completely theory-driven, made-up, socially-agreed constructs like "attitudes" or "personality". but there are many other ones. for instance, in epidemiology, you can have a latent variable being a 0/1 indicator of whether a certain group of people carry a disease or not. you can model that probability of carrying the disease and, eventually, the "latent" variable will become "manifest" which helps support whether or not your epidemiological model was correct.

Or if it had a specific distribution. Certain elements can suggest it, so can SEM, but that is not proof it exists. But that may simply be a quibble.
well, for your and my cases, the characteristics of a latent variable depend on whether or not it is tenable to assume it exists. but (and i think you've said this over and over again, and i totally agree with you) you cannot prove anything with Statistics in the Social Sciences. i mean, you can prove plenty of stuff within the mathematical framework where Statistics exist (lots of theorems out there) but you can never be sure with absolute certainty that something exists or causes somethign else. best example that comes to mind is the tobacco industry. how many years of collecting evidence have gone by and you can still get sued if you go on record saying "smoking causes cancer"? you can say it's strongly associated with cancer or it's linked to cancer or that it may cause cancer. but you cannot say it causes cancer. but that just happens to be because years and years have gone by where we've seen the evidence piling up towards "proving" that smoking causes cancer". but i don't think you'll ever be able to say it (without getting a big fat lawsuit on your hands). if i were to tell you that the 1-Factor model of intelligence is true, would you believe me? maybe after years and years of mounting evidence favouring the 1-Factor model, you would be more inclined to think that maybe, possibly it is true... but still, i don't think you'll ever be able to stand on a soapbox and declare the 1-Factor model of intelligence to be the absolute truth.

now, if you believe in a latent variable like the ones you and i work with (which is something you just simply need to make a leap of faith towards) then yes, you can test a lot of things about that latent variable to see wehther your assumptions are tenable or not. but you ahve to *believe* in it first.

noetsi

No cake for spunky
What about in cases where the binary variable really is the result of transforming a continuous response. Something like

Obese = 1 if BMI > 30, 0 otherwise

Clearly each observation came from looking at a person's BMI and then just categorizing it.
True dason, but my sense is most analysis does not involve this. It includes naturally occuring data that is measured dichotomly. Not data that was measured intervally that you then recoded. Indeed turning interval data into a dichotomy is normally discouraged.

noetsi

No cake for spunky
Intelligence is the classical example of a latent variable that is broadly assumed to exist, but which no researcher has ever been able to directly observe (we observe actions tied that characteristic). Indeed what intelligence really is physically is unclear and different models of intelligence assume different realities.

Pretty much every statistical text I ever read includes somewhere the statement that statistics can not prove causality. Hume's black swans is one element of this, but there are many others. Some argue that experiments where you deliberately manipulate levels can prove causality, but that assumes of course that you can generalize from the research setting to the larger environment -which may or may not be true.