# Joint distribution from multiple marginals

#### Angelorf

##### New Member
Consider an experiment consisting of a repeated trial with two random Bernoulli (=binary) variables, A and B. Each trial consists of multiple outcomes for both A and B. Each trial has the same number of samples and the underlying joint distribution of A and B is the same everywhere.

Of course it's not possible to infer the joint probability distribution for a single trial, since we don't assume A and B to be independent.

But is it possible to make an estimation of the joint probability distribution from the set of marginals we obtained?

Just to make things clearer: here's an example.
Trial 1 gives:
p(a=0) = .3 (so p(a=1)=.7)
p(b=0) = .2 (so p(b=1)=.8)
Trial 2 gives:
p(a=0) = .1
p(b=0) = .5
Trial 3 gives:
p(a=0) = .7
p(b=0) = .9
Trial 4 gives:
p(a=0) = .4
p(b=0) = .6

My question is: how can I estimate p(a=0,b=0), p(a=0,b=1), p(a=1,b=0) and p(a=1,b=1) ?

Remember that I don't have access to the individual results of each sample in a trial - I only know the marginals of each trial.

#### BGM

##### TS Contributor
Well this is a very interesting question. Here is my two cents:

Let $$p_{ab} = \Pr\{A = a, B = b\}, a,b \in \{0, 1\}$$

Usually we will collect the frequency data and put in a two by two table like this:

$$\begin{tabular}{|c|c|c|} \hline & A = 0 & A = 1 \\ \hline B = 0 & W & Y \\ \hline B = 1 & X & Z \\ \hline \end{tabular}$$

and we know that the counts $$(W, X, Y, Z) \sim \text{Multinomial}(n;p_{00}, p_{01}, p_{10}, p_{11})$$

So the MLE is $$(\hat{p}_{00}, \hat{p}_{01}, \hat{p}_{10}, \hat{p}_{11}) = \left(\frac {W} {n}, \frac {X} {n}, \frac {Y} {n}, \frac {Z} {n}\right)$$

Note that the parameters satisfy $$p_{00} + p_{01} + p_{10} + p_{11} = 1$$ and the counts satisfy $$W + X + Y + Z = n$$. They both have 3 degrees of freedom.

Now when you only observe the value of the marginal counts $$(W + X, W + Y)$$ (and the sample size $$n$$ of course), the dimension of the statistic is less than the number of parameters now and surely you have loss some information. I think it is somehow like a censored data/missing data which you would see below.

To find the MLE in this case, first I try to write the likelihood:

$$L(p_{00}, p_{01}, p_{10}; a, b) = \Pr\{W + X = a, W + Y = b\}$$

$$= \sum_{w = 0}^{\min\{a, b\}} \Pr\{W = w, X = a - w, Y = b - w\}$$

$$= \sum_{w = 0}^{\min\{a, b\}} \frac {n!} {w!(a-w)!(b-w)!(n-a-b+w)!} p_{00}^w p_{01}^{a-w} p_{10}^{b-w} (1 - p_{00} - p_{01} - p_{10})^{n - a - b + w}$$

Then some how you have a likelihood here and you may try to find the MLE by (numerically) maximizing the function. I am not sure if this function is easy to maximize numerically as it is a high degree polynomial. To combine the information from different independent trials we just multiply the likelihood together as before.

If $$n = 10, a = 3, b = 2$$, the above likelihood should be like the objective function I posted here:

For other references you may also look at EM algorithm for missing data and the MLE of survival curves of censored data.

#### Angelorf

##### New Member
$$= \sum_{w = 0}^{\min\{a, b\}} \frac {n!} {w!(a-w)!(b-w)!(n-a-b+w)!} p_{00}^w p_{01}^{a-w} p_{10}^{b-w} (1 - p_{00} - p_{01} - p_{10})^{n - a - b + w}$$
Why do you take the sum over all possible $$w$$? I would say you don't have this sum, you should just have:
$$L = \frac {n!} {w!(a-w)!(b-w)!(n-a-b+w)!} p_{00}^w p_{01}^{a-w} p_{10}^{b-w} (1 - p_{00} - p_{01} - p_{10})^{n - a - b + w}$$
And then maximize this likelihood expression w.r.t. $$w$$ somehow.

#### BGM

##### TS Contributor
Because $$W$$ is not observed (somehow like a latent variable) and actually this indicate the information loss. Now you can only infer that $$W$$ lies in a certain range but cannot tell the exact realized value.

#### Angelorf

##### New Member
Dear BGM,

I've been working on a totally different method for a while now, that is why it took so long for me to answer. (It is also why I was confused about your method in the first place.) I will not go into my method. Instead, let me reply to the method you had proposed.

I wanted to add that we can get a fair estimate of the true marginal probabilities by taking the mean over the marginals of all the trials. With these we can infer all 4 joint probabilities given one of them. This reduces the complexity of the problem severely. Now we only have to estimate one parameter, $$p_{11}$$ for example.

Note, however, that the total formula for all trials grows linearly in the number of trials N. Given that in my case N will generally be large, the method becomes computationally too intense.

The problem as I stated it was an oversimplification. Actually, the number of samples per trial, n, is not given; I will want the most likely joints, when taking $$n \to \infty$$. This makes it justified to use Stirlings approximation of the factorial function, so that we can differentiate and/or integrate if we need to.

Note that taking this limit makes testing the method kind of weird; As $$n \to \infty$$, the variance of the marginals shrinks to zero, giving me no more information than I would get from a single trial, in which case we know I cannot infer the joints.

A totally different and computationally easy way to estimate the joints is generated when supposing the correlation of the marginals is the same as the correlation of A and B. We can then just compute the correlation between the observed marginals, suppose it coincides with the correlation between A and B and finally infer which joint probabilities would give rise to that correlation (using again the estimation of the true marginals from the average marginals).
For a binomial distribution, we have in general:
Correlation: $$\rho_{A,B} = \frac {p_{11} - p_{1?}p_{?1}} {\sqrt{p_{?1}(1-p_{?1})p_{1?}(1-p_{1?})}}$$, where $$p_{1?}, p_{?1}$$ are estimated marginal probabilities.
We can rewrite this to:
$$p_{11} = p_{1?}p_{?1}+ \rho_{A,B} \sqrt{p_{?1}(1-p_{?1})p_{1?}(1-p_{1?})}$$
We can now use this to estimate $$p_{11}$$, from which we can infer the other joint probabilities, given that we have estimates for the marginals.

This method is very simple, and it does give good results; the average error on $$p_{11}$$ was .003, on data sets with $$n=100, N=5000$$.

However, the assumption that the two correlations coincide is unjustified. This is very problematic for me.

Last edited: