Re:Today I Learned: calculate the significance of Correspondence Analysis' dimensions
Even though in
Correspondence Analysis (CA) exploratory stage and testing for significance are two different sides of the coin (at least from the French/European scholarly tradition tracing back to the early work of Benzecri), I habe been trying to calculate the significance of the CA's dimensions following the procedure described by
Greenacre M.,
Correspondence Analysis in Practice (2007) (
Amazon link) (pages. 198-199).
Fortunately enough, I came accross an R package, named '
InPosition', which has finally allowed me to put Greenacre's method to work. Given a crosstabulation, the package is able to calculate the dimensions, as well as the total inertia, of n simulated (i.e., permutated) tables.
In what follows, I will describe:
-how to plot the permutated explained inertia of CA's first two dimensions and calculate their significance (Steps 1-3)
-how plot and calcultate the significance of the CA total inertia (Steps 4-5).
For comparability's sake, I will use the dataset used by Greenacre (p. 199, Exhibit 25.4). It is shown below:
Code:
C1 C2 C3
E1 5 7 2
E2 18 46 20
E3 19 29 39
E4 12 40 49
E5 3 7 16
Step 1 (entering the data)
Code:
[I]#load the InPosition package[/I]
library(InPosition)
[I]#read the table selecting the file saved on the PC[/I]
dat <- read.table(file=file.choose(), header=TRUE)
[I]# use the apposite command to simulate [B]999 (permutated) contingency tables[/B]. It takes about 15 secs[/I]
res<-epCA.inference.battery(dat, test.iters=999)
Step 2 (plotting permutated inertia of dimensions' 1 and 2)
Code:
[I]#plot a scatterplot of the permutated inertia of the first two dimensions[/I]
plot(res$Inference.Data$components$eigs.perm[,1], res$Inference.Data$components$eigs.perm[,2], main=" Scatterplot of permutated Dimensions' inertia", sub="solid line: obs. inertia; dashed line: 95 percentile of the permut. distrib. (=alpha 0.05 threshold)", xlab="inertia of permutated 1 Dim", ylab="inertia of permutated 2 Dim")
[I]##add reference lines[/I]
[I]#reference lines indicating the 95 percentile of the permutated distribution of the first and second dimensions' inertia[/I]
abline(v=quantile(res$Inference.Data$components$eigs.perm[,1], c(0.95)), lty=2, col="blue")
abline(h=quantile(res$Inference.Data$components$eigs.perm[,2], c(0.95)), lty=2, col="blue")
[I]#reference lines indicating the observed inertia of the first two dimensions[/I]
abline(v=res$Inference.Data$components$eigs[1])
abline(h=res$Inference.Data$components$eigs[2])
The above code produces the attached Plot_n1. The dashed references lines indicate the threshold of the 95 percentile of the permutated distribution of the first two dimensions' inertia. Observed inertia lying beyond that threshold is signifcant at alpha 0.05
Step 3 (calculating the significance of the 1 and 2 dimensions)
Code:
[I]#count the number of permutated eigenvalues of the first dimensions that are greater than the observed eigenvalue of the same dimension[/I]
first.dim.ratio<-length(which(res$Inference.Data$components$eigs.perm[,1] > res$Inference.Data$components$eigs[1]))
[I]#compute the p value of the first dimension (p=0.003) [/I]
pvalue.first.dim<-first.dim.ratio/1000
[I]#the same applies for the second dimension (p=0.07)[/I]
second.dim.ratio<-length(which(res$Inference.Data$components$eigs.perm[,2] > res$Inference.Data$components$eigs[2]))
pvalue.second.dim<-second.dim.ratio/1000
Step 4 (plotting the distribution of the permutated total inertia)
Code:
[I]#plot the distribution of the permutated total inertia[/I]
d <- density(res$Inference.Data$omni$inertia.perm)
plot(d, main="Kernel density of CA permutated total inertia", sub="solid line: obs. inertia; dashed line: 95 percentile of the permut. distrib. (=alpha 0.05 threshold)")
polygon(d, col="red", border="blue")
[I]##add reference lines[/I]
[I]#add reference line indicating the observed total inertia[/I]
abline(v=sum(res$Fixed.Data$ExPosition.Data$eigs))
[I]#add reference line indicating the 95 percentile of the permutated total inertia[/I]
abline(v=quantile(res$Inference.Data$omni$inertia.perm, c(0.95)), lty=2, col="blue")
The above code produces the attached Plot_n2.
Step 5 (calculating the significance of the the total inertia)
Code:
[I]##calculate the significance of the total inertia[/I]
[I]#count the number of time the permutated total inertia is greater than the observed total inertia[/I]
inertia.ratio<-length(which(res$Inference.Data$omni$inertia.perm> sum(res$Fixed.Data$ExPosition.Data$eigs)))
[I]#compute the significance of the observed total inertia (p=0.002)[/I]
pvalue.inertia<-inertia.ratio/1000
Final notes:
-for the time being, I use the basic plotting facility of R.
-InPosition natively returns the significance of the dimensions and of the total inertia. However, I am not particularly happy with it since the way in which those figures are calculated is not obvious and is not explained in the accompanying documentation. So, I believe that tackling the issue from the long way as done above can be more useful, for one thing, from a didactical standpoint.
Hope this helps.
Regards
Gm