examine whether data sets come from the same ditribution

#1
hi,
i want to check whether the two data sets can be drawn from the same distribution. can i do it with chi-square test? what are the assumptions i need for it?
if i can't use chi square test - what can i do?

notes about the data sets:
- discrete
- mostly increasing (in particular - not normally distributed)
- about 50-70 participants in each data set.


thanks!!
 
#4
thanks!
i wonder, why not chi-square?
Yes, you can use the chi-square.

(And maybe that is all the original poster wanted. Just to get a confirmation to go on with the method that he had decided to use anyway. Maybe he didn't want to any other ideas or suggestions.)

But for there rest of you:

i want to check whether the two data sets can be drawn from the same distribution

I was thinking if it is not possible to use two empirical distributions and do a qq.plot of that?


But in this link Huber gives some very good advices. "Practical Issues in the Use of
Probabilistic Risk Assessment"

Among other things Huber says: "The question is not “what is the best fit?”
(So don’t go on a distribution hunt!)
 
#6
Hi Greta,
how would you use the chi-squared test for this?

regards
Cut the data in say 5 size classes and count the number of observations in each class or cell. The two distributions would give a 2*5 table to be tested with a chi-square test. (Example: one is normal and one is uniform. The normal data gives many in the middle but the uniform many in the tails.)

[There are many procedures; logit, probit, tobit, heckit. Now we expect this user, "hagit", to come up with his own :) ]
 

rogojel

TS Contributor
#8
Yes, I have some doubts as well - e.g. how to pick the blocks, should one do a sensitivity analysis on block sizes etc? Also, in principle, the number in each cell is a rough approximation of the distribution in that range - so we would compare an approximation (with possibly a small data size) to another approximate number - and repeat the procedure for each cell. This sounds rather risky to me, unless we have large amounts of adat across the whole range.


regards
 
#9
from what i read, there is a test called "chi square goodness of fit" that can be used for that.

and as for to - why do i insist to perform chi square? - well i preferred it as i read a bit about it, but i won't insist on it of course. i just wanted to understand why i shouldn't use it or why other test are better for that situation.

however, it seems that i can't perform any chi-square test anyway , as only small amount of participants chose the first option out of 4 in one of the groups.
 
#11
I am wondering what would be the power of chi-square approach compared to that of mann-whitney test on the same data...
Although I don't know what the Wilcoxon-Mann-Whitney (WMW) test is really testing, but I believe that it is mainly a test if the location of the distribution is the same. A test where the null hypothesis is P(Y1 <Y2)=0.5.

But the WMW test is said to be sensitive for variances and skewness. Look att Fagerland Sandvik and their publications.

Of course two distributions can differ in many ways. They can differ in location and the usual tests (WMW and t-test etc) checks for that. But they can differ in shape and spread. And even if the the mean/median is roughly the same and the skewness is the same they can differ in how heavy the tails are, like in my example with the uniform distribution versus a normal distribution.

I believe that the chi-squared test is more powerful than the WMW test in this example.

Please improve this crude code:


Code:
# what power  chi-test vs  Mann-whitney?

set.seed(314)

y1 <- runif(n=100, min=0, max=5)
y2 <- rnorm(n=100, mean=2.5, sd=1)

mean(y1)
mean(y2)
hist(y1)
hist(y2)



y1_cat <- ifelse ((y1 >=   0 & y1 <= 1), 1 , 
                 ifelse((y1 > 1 & y1 <= 2 ), 2 , 
                        ifelse((y1 > 2 & y1 <= 3), 3 , 
                               ifelse((y1 > 3 & y1 <= 4), 4 , 
                                      ifelse((y1 > 4 & y1 <= 5 ), 5 , 9 )  )))) 

y1_cat

table(y1_cat)

y2_cat <- ifelse ((y2 >=  -100 & y2 <= 1), 1 , 
                  ifelse((y2 > 1 & y2 <= 2 ), 2 , 
                         ifelse((y2 > 2 & y2 <= 3), 3 , 
                                ifelse((y2 > 3 & y2 <= 4), 4 , 
                                       ifelse((y2 > 4 & y2 <= 5000 ), 5 , 9 )  )))) 

y2_cat
table(y2_cat)


grp <-  c(rep(1, times=100), rep(2, times=100 ))

table(grp)

y1y2_cat <- c(y1_cat, y2_cat)
table(y1y2_cat)

table(y1y2_cat, grp)


chisq.test(grp, y1y2_cat)

#Pearson's Chi-squared test
#
#data:  grp and y1y2_cat
#X-squared = 14.684, df = 4, p-value = 0.005404

wilcox.test(y1, y2) 
#Wilcoxon rank sum test with continuity correction
#
#data:  y1 and y2
#W = 4141, p-value = 0.03594
#alternative hypothesis: true location shift is not equal to 0



#########################

As you can se the chi-square is significant while WMW is not.
 

gianmarco

TS Contributor
#12
Of course two distributions can differ in many ways. They can differ in location and the usual tests (WMW and t-test etc) checks for that. But they can differ in shape and spread.
MW is not actually testing for a difference in central tendency, at least this is not the way it was formulated by its fathers. As you correctly point out in a section of your reply, MW tests if the values of one group tend to score significantly higher than the values of the other group. This is a broad formulation of the test's aim. If one is to use it as a test for a difference in central tendency (i.e., the median), other things must be controlled for (e.g., the shape of the distribution), as the article you were referring to seems to show (I just read the abstract; thanks for pointing it out).

I performed MW on the data from your code (thanks for providing that); I used the function that I have put togheter to visually display MW test results (http://cainarchaeology.weebly.com/r-function-for-visually-displaying-mann-whitney-test.html). The result is attached here.

Indeed, there is a significant difference (p < 0.05). The size of the difference can be labelled as MEDIUM (as the r measure of effect size indicates). The probability that one (randomly picked) observation from group y2 is larger than one (randomly picked) observation from group y1 is about 0.59. By the way, the notches of the boxplots visually indicates that there is a significant difference between the two groups as far as the median is concerned.

Side note: in this case, the difference in median is significant, as indicated by the non overlapping of the notches. But cases exist when the median does not differ, and yet MW test indicates a significant difference. This, as said, can be due to a difference in shape. That's why one must be clear about what he/she wants to actually test by means of MW. I am more happy with the broader formulation of the test's aim, to which I was referring at the very beginning of this reply.


The dataframe I fed into my function is the following (again, data from your example):
Code:
mydata <- structure(list(value = c(0.49415975692682, 1.35738908196799, 
3.83264105534181, 1.12321213586256, 1.01221770513803, 1.51629222207703, 
1.20408563525416, 1.85561218066141, 2.75313992518932, 3.73526555951685, 
1.84417331242003, 0.578607870265841, 1.21186430915259, 1.0697199520655, 
2.52527725999244, 1.95613155607134, 3.61032884567976, 2.59941169177182, 
0.19261657493189, 1.68614125344902, 3.10916279326193, 2.23391792387702, 
2.75043166591786, 1.23272773111239, 0.0517203623894602, 2.97924610320479, 
1.65540873538703, 0.173643813468516, 1.00408993894234, 2.91829610941932, 
1.22469975962304, 3.00195821677335, 0.0896409782581031, 2.33311263262294, 
2.1964464627672, 0.545266319531947, 4.797239912441, 4.58531251177192, 
2.85365675459616, 2.91124314651825, 1.90528847509995, 3.89290294144303, 
2.18658119556494, 4.3173506797757, 3.37909942143597, 4.13641578983516, 
3.03081218153238, 1.38210228993557, 0.298064620001242, 1.20231541222893, 
3.81099749472924, 3.18866602610797, 2.85814372240566, 3.75863300636411, 
1.32571235881187, 2.0117944595404, 2.95205376925878, 4.4393412291538, 
1.92686007241718, 0.982481347164139, 2.01477478491142, 3.62341637839563, 
0.990822073072195, 1.84538009925745, 0.456970855593681, 1.8560487322975, 
2.52459095790982, 0.805563454050571, 0.51174248685129, 1.91950084059499, 
0.0954484159592539, 3.13854704843834, 4.85609867726453, 2.72998588276096, 
4.06391276046634, 3.04217693279497, 3.11609872849658, 4.6662034781184, 
0.0546058127656579, 0.145432930439711, 1.93257200997323, 2.22994481329806, 
4.56581284059212, 1.35950993048027, 2.44224405963905, 1.29508371814154, 
4.77594632655382, 3.5582553956192, 1.30110892117955, 0.00767110614106059, 
3.17102727363817, 1.82283379836008, 2.68468673457392, 2.64864213881083, 
4.42606373922899, 4.54540394828655, 3.47331723547541, 1.72135463566519, 
2.17623205040582, 1.01567885023542, 1.99941568383293, 2.98451584509997, 
3.17101849957833, 3.1591215045531, 2.76128507891288, 1.05868129730513, 
2.15240102594209, 3.76951834270496, 3.05710195252457, 2.05365252234727, 
2.479941774849, 3.68466070508499, 2.51981825833059, 2.96560182103384, 
1.79279153698334, 2.66531799823898, 3.366970469833, 3.23384671541812, 
4.21241685953909, 2.69532333183055, 1.67812387445298, 3.85399933038685, 
0.15733317423097, 1.3542164216967, 1.18759335858714, 4.01393147154314, 
3.78876518356748, 2.60096231658387, 2.59862828000746, 2.31102366475749, 
3.357290945633, 0.665226778083162, 1.65566669034893, 0.960988914911963, 
2.42395342736647, 1.79129363755436, 3.06766896844903, 2.79869476424354, 
2.53975462812585, 3.82426990510161, 1.39952226973854, 2.73816053527092, 
2.90109357786274, 4.53340997732308, 4.21368537436252, 2.41655986873568, 
3.61227535495299, 2.70145273302254, 2.41336488396397, 3.76609184332577, 
2.35630731190855, 3.95200702224909, 1.80149037221888, 1.40149171247123, 
2.66642063503226, 2.32807191565919, 1.75028726978513, 0.38488172627738, 
2.32595101542126, 3.06375470352961, 2.77619747873566, 0.895148272730549, 
2.49833923636023, 2.54088905526205, 2.96233599470822, 1.93858052605669, 
0.930314415355915, 1.90547951027016, 3.29253372059681, 2.66936974878119, 
2.30837437297596, 2.22908091064181, 2.84727231426355, 3.48255081859603, 
1.76700892224096, 3.82801590697395, 1.35166294701206, 3.16491403239094, 
2.7637551126881, 4.37766285232054, 3.22441180085583, 2.95511632568627, 
2.26521606047478, 2.8030322550147, 1.58097984662203, 3.60421942721949, 
3.81361000997226, 2.35284994654843, 2.95526349083674, 3.36312919851444, 
2.55412869787749, 3.1506075093913, 0.624599552742331, 4.77283755081317, 
1.44581016599028, 2.86482134076611, 1.85255294311563, 2.46133152121247, 
3.58785575445079, 1.55662161433941), group = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("y1", "y2"), class = "factor")), .Names = c("value", 
"group"), row.names = c(NA, 200L), class = "data.frame")
 

rogojel

TS Contributor
#13
from what i read, there is a test called "chi square goodness of fit" that can be used for that.

and as for to - why do i insist to perform chi square? - well i preferred it as i read a bit about it, but i won't insist on it of course. i just wanted to understand why i shouldn't use it or why other test are better for that situation.

however, it seems that i can't perform any chi-square test anyway , as only small amount of participants chose the first option out of 4 in one of the groups.
Hi,
the goodness of fit is a test for the case where you want to decide whether a data-set is coming from a known distribution, such as a normal. This means you already know the theoretically expected percentages for one distribution and you compare the percentages of the datanset to that.

If you want to test whether two data sets come from the same distribution you double the uncertainty because you do not know the expected percentages for any of the distributions, you just check whether the two percentages in each cell are within limits similar or not. This does not seem to be a good use of the data imo, or as gianmarco put it, the power of the test will probably not be too good.

regards
 

rogojel

TS Contributor
#14
hi,
I put together a small script to test both tests with two beta distributions where I can change the shape of one and see when the tests start to see a difference.
Code:
comp=function(beta=1,beta1){
  par(mfrow=c(2,1))
  bins=seq(0,1,0.1)
  x=rbeta(200,beta,1)
  y=rbeta(200, beta1, 1)
  r=hist(x, breaks = bins)
  r1=hist(y,breaks=bins)
  res=chisq.test(r$counts,r1$counts)
  print(res)
  res=wilcox.test(x,y)
  print(res)
}
try
comp(beta1=1.5)


The Mann - Whitney seems to be a lot more powerful, maybe I am not doing the chi-squared right?

regards