Correlation multivariate trinomials

mrod

New Member
#1
I have a dataset of 6 multiple choice questions with each question having 3 mutually exclusive choices.

The dataset is split into a number of time frames where each time frame hold, say, 1000 questionnaire results and each of the 18 datapoints provided in the time frame hold the fraction of the 1000 participants ticking the specific choice/question combination.

Now to calculate the correlation matrix for the dataset can I just calculate this between the 18 datapoints or do I need to consider the fact that I am dealing with 6 trinomial variables?
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Explain the datasets part in greater detail - not following. Do you want to compare the same question between different time points (say 18 total)? Are you talking about comparing different questions.

I am sure you can probably do what you want given that you use the right correlation, however I am not following what you want to compare with correlations.
 

mrod

New Member
#3
The comparison is question/choice combination between time frames. The resulting correlation matrix is 18x18 arriving at 18 by multiplying the 6 questions with 3 choices.

I am suspecting that correlation between outcomes in a trinomial variable needs to dependent somehow. This suspicion from the fact that knowing the number of respondents on two of the choices the third can be backed out by substracting these from the total (1000 in the example I gave earlier).

When constructing the pairwise correlation matrix assuming no structure (i.e. the fact its 6 trinomial variables) perhaps something goes missing.
 

BGM

TS Contributor
#4
We know that if \( (X_1, X_2, X_3) \) follows a trinomial \( (1; p_1, p_2, p_3) \), then

\( Cov[X_1, X_2] = -p_1p_2 \) and \( Corr[X_1, X_2] = \frac {-p_1p_2} {\sqrt{p_1(1 - p_1)p_2(1 - p_2)}} \)

which can be estimated by its MLE (by replacing with the sample proportions)

So this would be the "structure" of the correlation in between one trinomial vector.

Whether you add any structural assumption between the correlations of random variables inside different trinomial vectors is another issue. But since you are talking about the correlation between two Bernoulli trials so it can be also expressed in terms of the proportion when you construct the 2 by 2 contingency table.
 

mrod

New Member
#5
Correlated multivariate trinomials - can time series data improve estimation?

Attached is a simplified data sample with only two questions and couple of time frames. The average reply rate for each question is calculated on row 21.

What I would like is to arrive at a better estimate on each question/answer combination than just multiplying the average reply rates. In cell D23 I have done this calculation resulting in an estimate of 7,635 for question 1/alternative 1 and question 2/alternative 1 replies.

Given that I have the time series data, how can I use this to my advantage to arrive at a better estimate. In the attached file I have calculated the resulting correlation matrix where the formula provided by BGM have been used to calculate the correlation between answers in each question (gray background) and the sample correlation using excel's CORREL formula have been used to calculate the correlation of the answers between the questions.
 
Last edited: