Joint distribution from multiple marginals

Consider an experiment consisting of a repeated trial with two random Bernoulli (=binary) variables, A and B. Each trial consists of multiple outcomes for both A and B. Each trial has the same number of samples and the underlying joint distribution of A and B is the same everywhere.

Of course it's not possible to infer the joint probability distribution for a single trial, since we don't assume A and B to be independent.

But is it possible to make an estimation of the joint probability distribution from the set of marginals we obtained?

Just to make things clearer: here's an example.
Trial 1 gives:
p(a=0) = .3 (so p(a=1)=.7)
p(b=0) = .2 (so p(b=1)=.8)
Trial 2 gives:
p(a=0) = .1
p(b=0) = .5
Trial 3 gives:
p(a=0) = .7
p(b=0) = .9
Trial 4 gives:
p(a=0) = .4
p(b=0) = .6

My question is: how can I estimate p(a=0,b=0), p(a=0,b=1), p(a=1,b=0) and p(a=1,b=1) ?

Remember that I don't have access to the individual results of each sample in a trial - I only know the marginals of each trial.


TS Contributor
Well this is a very interesting question. Here is my two cents:

Let \( p_{ab} = \Pr\{A = a, B = b\}, a,b \in \{0, 1\} \)

Usually we will collect the frequency data and put in a two by two table like this:

\( \begin{tabular}{|c|c|c|}
\hline & $A = 0$ & $A = 1$ \\ \hline
$B = 0$ & $W$ & $Y$ \\ \hline
$B = 1$ & $X$ & $Z$ \\ \hline
\end{tabular} \)

and we know that the counts \( (W, X, Y, Z) \sim \text{Multinomial}(n;p_{00}, p_{01}, p_{10}, p_{11}) \)

So the MLE is \( (\hat{p}_{00}, \hat{p}_{01}, \hat{p}_{10}, \hat{p}_{11})
= \left(\frac {W} {n}, \frac {X} {n}, \frac {Y} {n}, \frac {Z} {n}\right) \)

Note that the parameters satisfy \( p_{00} + p_{01} + p_{10} + p_{11} = 1 \) and the counts satisfy \( W + X + Y + Z = n \). They both have 3 degrees of freedom.

Now when you only observe the value of the marginal counts \( (W + X, W + Y) \) (and the sample size \( n \) of course), the dimension of the statistic is less than the number of parameters now and surely you have loss some information. I think it is somehow like a censored data/missing data which you would see below.

To find the MLE in this case, first I try to write the likelihood:

\( L(p_{00}, p_{01}, p_{10}; a, b) = \Pr\{W + X = a, W + Y = b\} \)

\( = \sum_{w = 0}^{\min\{a, b\}} \Pr\{W = w, X = a - w, Y = b - w\} \)

\( = \sum_{w = 0}^{\min\{a, b\}} \frac {n!} {w!(a-w)!(b-w)!(n-a-b+w)!}
p_{00}^w p_{01}^{a-w} p_{10}^{b-w} (1 - p_{00} - p_{01} - p_{10})^{n - a - b + w} \)

Then some how you have a likelihood here and you may try to find the MLE by (numerically) maximizing the function. I am not sure if this function is easy to maximize numerically as it is a high degree polynomial. To combine the information from different independent trials we just multiply the likelihood together as before.

If \( n = 10, a = 3, b = 2 \), the above likelihood should be like the objective function I posted here:

For other references you may also look at EM algorithm for missing data and the MLE of survival curves of censored data.
\( = \sum_{w = 0}^{\min\{a, b\}} \frac {n!} {w!(a-w)!(b-w)!(n-a-b+w)!}
p_{00}^w p_{01}^{a-w} p_{10}^{b-w} (1 - p_{00} - p_{01} - p_{10})^{n - a - b + w} \)
Why do you take the sum over all possible \(w\)? I would say you don't have this sum, you should just have:
\( L = \frac {n!} {w!(a-w)!(b-w)!(n-a-b+w)!}
p_{00}^w p_{01}^{a-w} p_{10}^{b-w} (1 - p_{00} - p_{01} - p_{10})^{n - a - b + w} \)
And then maximize this likelihood expression w.r.t. \(w\) somehow.


TS Contributor
Because \( W \) is not observed (somehow like a latent variable) and actually this indicate the information loss. Now you can only infer that \( W \) lies in a certain range but cannot tell the exact realized value.
Dear BGM,

I've been working on a totally different method for a while now, that is why it took so long for me to answer. (It is also why I was confused about your method in the first place.) I will not go into my method. Instead, let me reply to the method you had proposed.

I wanted to add that we can get a fair estimate of the true marginal probabilities by taking the mean over the marginals of all the trials. With these we can infer all 4 joint probabilities given one of them. This reduces the complexity of the problem severely. Now we only have to estimate one parameter, \( p_{11} \) for example.

Note, however, that the total formula for all trials grows linearly in the number of trials N. Given that in my case N will generally be large, the method becomes computationally too intense.

The problem as I stated it was an oversimplification. Actually, the number of samples per trial, n, is not given; I will want the most likely joints, when taking \( n \to \infty \). This makes it justified to use Stirlings approximation of the factorial function, so that we can differentiate and/or integrate if we need to.

Note that taking this limit makes testing the method kind of weird; As \( n \to \infty \), the variance of the marginals shrinks to zero, giving me no more information than I would get from a single trial, in which case we know I cannot infer the joints.

A totally different and computationally easy way to estimate the joints is generated when supposing the correlation of the marginals is the same as the correlation of A and B. We can then just compute the correlation between the observed marginals, suppose it coincides with the correlation between A and B and finally infer which joint probabilities would give rise to that correlation (using again the estimation of the true marginals from the average marginals).
For a binomial distribution, we have in general:
Correlation: \( \rho_{A,B} = \frac {p_{11} - p_{1?}p_{?1}} {\sqrt{p_{?1}(1-p_{?1})p_{1?}(1-p_{1?})}} \), where \( p_{1?}, p_{?1} \) are estimated marginal probabilities.
We can rewrite this to:
\( p_{11} = p_{1?}p_{?1}+ \rho_{A,B} \sqrt{p_{?1}(1-p_{?1})p_{1?}(1-p_{1?})} \)
We can now use this to estimate \( p_{11} \), from which we can infer the other joint probabilities, given that we have estimates for the marginals.

This method is very simple, and it does give good results; the average error on \( p_{11} \) was .003, on data sets with \( n=100, N=5000 \).

However, the assumption that the two correlations coincide is unjustified. This is very problematic for me.
Last edited: