I have approximated two one-dimensional random variables and using Gaussian Mixture Models (3 gaussians). I used gmdistribution.fit Matlab function with 10000 values. The resulting distributions are called and (shown in attached Figure 1, being in red and in blue).

Now I have a few values (e.g. v=[-1 0 0.5 1 5 6 6.5 7 14], as in Figure 1). This vector produces a very sparse histogram, since there are not many values.

How probable is that these values were generated by distribution? or How probable is that these values were generated by distribution? I would like to obtain a probability value in order to classify the set of values to category A ( distribution) or category B ( distribution).

I had several ideas:

-> Joint distribution (product of probabilities)... but, what happens with outliers? Since and are approximations of the real distribution, a outlier might produce zero probability for some value of x (see attached Figure 2). So I am not sure about this way.

-> Average probability: This is a trivial solution I though, and probably it's wrong.

-> Hypothesis tests (Squared Chi, or Kolmogorov–Smirnov): In the case of squared Chi, data should be binned... what is the optimal size of these bins? In addition, these hypothesis tests produce a p-value, which is not the probability value I am looking for (as far as I understood).

Thanks in advance for your advice.

Best regards,

Emliio.

I'm trying to analyze the results of my Master Thesis with SPSS.

I've 3 experimental conditions (factor: Group) and for variable A every subject in each condition has 2 expressions (factor: Hemisphere).

I want to test the effects of these 2 factors on my dependent variable A. What I would like to test is:

- effects of Group on A

- effects of Group x Hemisphere on A

- effects of Hemisphere on A between my 3 Groups

- effects of Hemisphere on A whithin my 3 Groups

I wanted to use a repeated measures ANOVA. But then I came across mixed models. What is the difference between these two tests? Which one is more suitable for the kind of analysis that I want to perform?

Thank you very much for your help! ]]>

First, I read an example in « pseudoreplication is a pseudoproblem » where we wish to determine which of two urns contains the greater proportion of red to blue marbles. Each urns contains several thousand of marbles. Authors wrote to sample 10 times 10 marbles, to compute the frequency and perform a two-sampled t test with 18 degrees of freedom.

Why can't we sample one time 100 marbles, code in binary (1 for blue, 0 for red) and use a glm with binomial family ?

If I study diameter of marbles, should I sample in the same way ?

The authors wrote that urn can be considered as experimental unit for a design without replication. If now, we have a replication (4 urns, 2 for each condition). How can I include the replication in my model ? As a random effect ?

Another example, I want to know if there is a difference of mortality in fish between 2 conditions. I have two water compartments, one for each condition. Inside each compartment, fish raise in 3 different cages. I sample 10 times 10 individuals or 100 individuals in each cage.

Should the cages be a random effect ? If I have replications (4 water compartments) and cages, how to analyse data ? For information, I’m using R. Thank you. ]]>