# Proper bayesian scaling of data

#### Yannick

##### New Member
I work as a Ph.D. student in structural bioinformatics, and build bayesian models for our data. It's been quite a while since I stumbled upon the problem I will expose, and it arises in a lot of cases in the modeling I do, regardless of the data source. I have the feeling it's due to my misunderstanding of some things, but couldn't find where to go in the literature to correct that.
A lot of times in physics, measurements are made on some quantity, and mock data can be generated to compare the measured data with the expected data in the likelihood function, but only up to a multiplicative constant.
Assume you measure N data points $$d_i$$, and you are trying to estimate the mean of some quantity. Given the current estimate of the mean, you can compute mock data $$\tilde{d}_i$$, for which you know that there exists a scale parameter gamma, so that,
$$\forall i, d_i = \gamma \tilde{d}_i$$
in the case of a perfect agreement between data and the model. Change the experimental conditions, and you will change gamma. But in the end, the value of gamma has no meaning, and we treat it as a nuisance parameter. I should say that we want a bayesian approach because the data is very sparse and the priors overwhelming, sorry for the frequentists out there.
Our model can be written using the Bayes principle and some independence assumptions as
$$p(X,\gamma,\sigma|D,I)\propto p(D|\gamma,\sigma,X,I)p(X|I)p(\gamma|I)p(\sigma|I)$$
where $$X$$ is the mean, $$D=\{d_i\}_i$$ is the dataset, I is the information that we have which is not the measured data points but all the remaining expert knowledge, $$\gamma$$ the scale factor I am talking about and $$\sigma$$ a scale, such that, as a concrete example (but irrelevant to my problem)
$$p(D|\gamma,\sigma,X,I) \propto \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma d_i}\exp \left( -\frac{1}{2\sigma^2}\log^2\left(\frac{d_i}{\gamma \tilde{d_i}}\right)\right)$$
e.g. a lognormal distribution on the $$d_i$$, which in case of a certain class of measurements is what we want. $$\tilde{d_i}$$ is a function of the mean, so I should write $$\tilde{d_i}(X)$$ but I don't to make it more readable. These are the back-calculated mock data points.
We assume a Jeffreys prior for $$\gamma$$ and $$\sigma$$, so $$p(\sigma,\gamma|I)\propto \frac{1}{\gamma \sigma}$$, and a much more complicated prior for X, given that it is a vector that represents the positions of at least a thousand atoms.
The posterior then is
$$p(X,\gamma,\sigma|D,I) \propto p(X|I)\frac{1}{\gamma}\frac{1}{\sigma^{N+1}}\prod_{i=1}^N \exp \left( -\frac{1}{2\sigma^2}\log^2\left(\frac{d_i}{\gamma \tilde{d_i}}\right)\right)$$
or equivalently
$$p(X,\gamma,\sigma|D,I) \propto p(X|I)\frac{1}{\gamma} \frac{1}{\sigma^{N+1}} \exp \left( -\frac{1}{2\sigma^2} \left[\log^2\left( \frac{\tilde{\gamma}}{\gamma} \right) + \log^2 (SS) \right] \right)$$
where I introduced the sufficient statistics
$$\tilde{\gamma} = \left(\prod_{i=1}^{N} \frac{d_i}{\tilde{d_i}} \right)^{\frac{1}{N}}$$
$$\log^2(SS) = \frac{1}{N}\sum_{i=1}^{N} \log^2\left( \frac{d_i/\tilde{d_i}}{\tilde{\gamma}} \right)$$

The problem arises because I know the scale of values of my mock data:
$$\exists (a,b) / \forall i, \tilde{d_i} \in [a,b]$$
where $$a$$ and $$b$$ are independent of any experiment. Thus I am looking to express my knowledge that $$\gamma$$ is such that observed data points fall back into that range, e.g. $$\gamma$$ is such that
$$\forall i, \frac{d_i}{\gamma} \in [a,b]$$. That would also allow me to use a more informative prior than Jeffrey's. However if I try to do this properly, this statement would mean expressing the prior for $$\gamma$$ as a probability distribution conditional on the data, which breaks the likelihood principle.
I've also tried treating $$\gamma$$ as a missing datum, but did not succeed here either, because you want to use $$p(\gamma|D)$$ and $$p(D|\gamma)$$ at the same time, which is forbidden.
The least ugly option I found for now is to normalize the data points by dividing them by the value of the smallest data point, and then only to use them in the modelling process, so that a prior on $$\gamma$$ could be set accordingly, such as $$p(\gamma|I)$$ gets very small when $$\frac{\min(\{d_i\})}{\gamma} = \frac{1}{\gamma}$$ is smaller than $$a$$. That is of course quite ugly, and the reason why I'm posting on this forum.

Thanks for your patience, and hope you'll find it interesting.