Phi and Cramer's V

#1
There must be something I'm not getting with these measures of association for nominal variables. Please help me understand.

What I've got so far is that they are (1) based on chi-squared and the X^2 value is calculated with the expectation that the variables are independent of each other {i.e., not associated}, (2) phi {for 2x2 tables} is the square root of X^2 divided by the total number of observations {Cramer's V, for bigger tables, is slightly more complicated, with a diviision also by the lesser of rows or columns minus one.} and that the possible range is 0 to 1. So how can I get phi values > 1?

Say the question is assocation between a particular surname and Y-DNA matches. The surname has a frequency within the population of 0.3% {making it one of the more common}; the random-chance expected frequency would be 0.003. Observation of whether Y-DNA matches agree with the surname or disagree are 147 agree and 648 disagree, for a total of 795 matches. That gives us a table like this:
- - - - - - Observed - - Expected - - (O-E)^2/E
Agree - - - 147 - - - - - - - 3 - - - - - 6912
Disagree - - 648 - - - - - 792 - - - - - - 26
Total - - - - 795- - - - - - 795 - - - - - 6938

Phi = SQRT(X^2/n) = SQRT(6938/795) = 2.95 ???

Thanks in advance for helping me see where I've gone wrong.
-rt_/)
 
Last edited:

Dragan

Super Moderator
#2
There must be something I'm not getting with these measures of association for nominal variables. Please help me understand.

What I've got so far is that they are (1) based on chi-squared and the X^2 value is calculated with the expectation that the variables are independent of each other {i.e., not associated}, (2) phi {for 2x2 tables} is the square root of X^2 divided by the total number of observations {Cramer's V, for bigger tables, is slightly more complicated, with a diviision also by the lesser of rows or columns minus one.} and that the possible range is 0 to 1. So how can I get phi values > 1?

Say the question is assocation between a particular surname and Y-DNA matches. The surname has a frequency within the population of 0.3% {making it one of the more common}; the random-chance expected frequency would be 0.003. Observation of whether Y-DNA matches agree with the surname or disagree are 147 agree and 648 disagree, for a total of 795 matches. That gives us a table like this:
- - - - - - Observed - - Expected - - (O-E)^2/E
Agree - - - 147 - - - - - - - 3 - - - - - 6912
Disagree - - 648 - - - - - 792 - - - - - - 26
Total - - - - 795- - - - - - 795 - - - - - 6938

Phi = SQRT(X^2/n) = SQRT(6938/795) = 2.95 ???

Thanks in advance for helping me see where I've gone wrong.
-rt_/)

Your fundamental problem is that you are not computing a chi-square test of independence based on a 2x2 contingency table. Rather, you are computing a basic chi-square goodness-of-fit test with 2 categories.
 
#3
I would appeciate if you could please explain how a "chi-square test of independence" differs from a "basic chi-square goodness-of-fit test" when the hypothesis is that the veriables are independent?

If the variables are independent, one would expect the name to occur in DNA matches no more frequently than in the population. Part of the problem may be that the expected frequency for the surname variable is so low (<5). But, that's built into the situation; the most common surname in the US (Smith) is held by less than one percent of the population (880:100,000); including variants (Smythe, etc.) adds only a small bit.
 

Dragan

Super Moderator
#4
I would appeciate if you could please explain how a "chi-square test of independence" differs from a "basic chi-square goodness-of-fit test" when the hypothesis is that the veriables are independent?
No, based on your calculations, you are not testing that 2 variables are independent of each other. Rather, you're testing the hypothesis of how the observed data "fits" the 2 expected frequencies you have provided - which is different.

I would suggest that you find a textbook and review the differences between these two chi-square statistics.

That said, the basic difference that you need to understand is that you need a contingency table e.g. 2X2 with 4 observations and 4 expected frequencies. You don't have that.

You have 2 observations and 2 expected frequencies.
 
#5
I figured it out from an analytic geometry viewpoint:
With only one degree of freedom, the chi-squared distribution is hyperbolic and asymptotic to the X & Y axes. Therefore, the X2 value can exceed N. When X2>N, X2/N>1 and the square root is also >1.

(The satistics texts didn't help much, except for the graphics of the distributions.)