# Data analysis - which test is best

J

#### Justice!

##### Guest
I would like to know which test would be appropriate for testing my hypothesis.

I am doing a research project on parasitic infections in snail hosts. Snails are infected when coming into contact with bird faeces.

My study takes samples of snails from 5 sites and dissects them to check for prevalence of infection.
Site 5 is 130 metres from a known bird roosting site.
Site 4 is 300 metres from known roosting site
Site 3 is 524 metres from roosting site
Site 2 is 664 metres " " "
Site 1 is 786 metres " " "

At each site I collected 50 snails.

Site 1 had 6 infected snails (out of 50)
Site 2 also had 6 infected snails (out of 50)
Site 3 had 7 infected snails

Hypothesis - More parasite infections would occur at site(s) closest to the known bird roosting site

Any thoughts on which test would be the best to check my hypothesis??

#### victorxstc

##### Pirate
Among all other potential ways, one way is possible. You can correlate the distance (in meter) with the infection status (which is something binary). You should use a Spearman correlation coefficient for that purpose.

J

#### Justice!

##### Guest
I have used Minitab to do a Spearman correlation coefficient. Would I be right in stating that the Pearson's r value is the 'p' value??

#### noetsi

##### Fortran must die
Pearson's r value is a correlation coefficient like Spearman's (but making different assumptions and calculated a different way). It is not the "p" value which is an assessment of how likely that the results you got were entirely due to random error. You will have a p value with both Spearman's and Pearson.

#### victorxstc

##### Pirate
as noetsi said, no. Correlation coefficients (Pearson or Spearman) give you two values: A correlation coefficient (Pearson's R or Spearman's Rho) and the P value.

The correlation coefficient shows the extent and direction of the correlation. For example you can find that R = -0.34 P = 0.008. In this example, there is 34% correlation between distance and infection. Note that the sign is negative. Therefore, there is a negative 34% significant correlation, meaning that the shorter the distance, the higher the chance of infection.

However, please note that you should use a Spearman's coefficient, instead of Pearson's. I don't know if you have SPSS or not. But if you had SPSS you could do the followings to run the Spearman. If you have done Pearson's test in Minitab already, I think you won't have difficulty in doing Spearman in Minitab. However, before that, make sure you are dealing with 250 rows in your spreadsheet file (each row for a single specimen), not with 5 rows (not each row for a site).

In your SPSS file, you have 250 cases, right? (5 sites, each with 50 cases, so a total of 250 cases). In your raw data file (with at least 250 rows), just write the distance value for each site, in a new column, besides each of your 250 cases. So for example you need to write the number 524 (the third distance) for 50 times, besides the corresponding rows. Then make sure your column dealing with "infection status" is all 0 and 1. If not, create a new column which contains the infection status of each of 250 cases as 0 and 1. No you have two columns, each has 250 cases, and each row shows a single snail: its infection status (0 and 1) and its distance. Now go to analysis -> correlate -> bivariate, and select Spearman test and select those two columns. The test is now ready to be run.

J

#### Justice!

##### Guest
I put my distance figures into row C1 then the number of infected snails into column C2, ran the test and got these results: (Just checking I am on the right track so far!)

All 2 1 1 1 5
40 20 20 20 100

Cell Contents: Count
% of Total

Pearson's r -0.970143
Spearman's rho -0.974679

J

#### Justice!

##### Guest
Apologies Victorxstc, I posted before seeing what you had wrote. I do not have SPSS (I have tried to download a trial version but keep getting an error message when trying to download) I will persevere with trying to get it

#### noetsi

##### Fortran must die
If you are doing this work near a university, they commonly have SPSS on their computers these days.

The two values you noted (for Pearson's R and Spearman's Rho) are very close, effectively the same thing.

If either of your variables is coded as a dichotomy (that is for example infected/ not infected) then neither Pearson nor Spearman's will work correctly. You need to do polychoric correlations although I doubt Minitab will do this (even SPSS and SAS won't in the core code, they need special Macros or R code in the case of SPSS).

#### victorxstc

##### Pirate
No problem Justice

They are similar and Minitab is efficient. However, before any analyses, please make sure you are dealing with your raw data, not the summary of your raw data. In your Minitab file, you should have at least 250 rows. If your file is like that, your correlation coefficients are very good, as the more the coefficient is near the value 1, the higher the correlation.

J

#### Justice!

##### Guest
My University does indeed have SPSS, I am off on Wednesday and intend to go in to use their computer. Just thought I would try and download now or try a different software (Minitab) so I could get crackin' instead of waiting until Wednesday
Thank you for the tip

#### GretaGarbo

##### Human
Site 5 is 130 metres from a known bird roosting site.
Site 4 is 300 metres from known roosting site
Site 3 is 524 metres from roosting site
Site 2 is 664 metres " " "
Site 1 is 786 metres " " "

Thank you Justice! Or what should I call you?

Now we know if distance is significant – or not.

Maybe you could cooperate with Palmer86, because he has got identical data as you!

Oh, maybe he has plagiarized your result? Or maybe you should be careful with him since I was told that he had not been the most polite person. Or maybe you could cooperate with Mmanuel, a person I tried to help a lot. You two – I mean, you three – seems to have a lot in common.

Justice, if you find a topic difficult, then you see, there is a search engine called Google, that can be very useful. For example I googled “logit model” and saw 690 000 links. You should not expect someone else to write a thesis for you when there already are 690 000 others for you to read before.

Hlsmith suggested Fishers exact test. Karabiner pointed out that a chi-squarred test could be used. Victorxstc literally did the test for you.

When someone is serving the results on a silver plate for you, do you find it embarrassing to say “thank you” then?

If you find it humiliating (“squat”) to say thank you, then I suggest that you don't do that!

I will withdraw from this subject. I have tried to help you in many posts. But please don't thank me!

#### Karabiner

##### TS Contributor
I would like to know which test would be appropriate for testing my hypothesis.

I am doing a research project on parasitic infections in snail hosts. Snails are infected when coming into contact with bird faeces.
My study takes samples of snails from 5 sites and dissects them to check for prevalence of infection.
Site 5 is 130 metres from a known bird roosting site.
Site 4 is 300 metres from known roosting site
Site 3 is 524 metres from roosting site
Site 2 is 664 metres " " "
Site 1 is 786 metres " " "

At each site I collected 50 snails.

Site 1 had 6 infected snails (out of 50)
Site 2 also had 6 infected snails (out of 50)
Site 3 had 7 infected snails
Each snail has 2 characteristics: a) infected yes/no and b) its distance from the
roosting site. You could try a Mann-Whitney U-test with infected yes/no as
grouping variable and distance as dependent variable. This will show you
whether in the infected group the distances are significantly higher or lower than
in the non-infected group.

With kind regards

K.

#### Karabiner

##### TS Contributor
the "p" value which is an assessment of how likely that the results you got were entirely due to random error
Beg your pardon, but wouldn't that mean p(Hypothesis|Data), i.e. Bayes statistics?
With the frequentist approach, we achieve p(Data|Hypothesis) .

With kind regards

K.

#### victorxstc

##### Pirate
I would like to know which test would be appropriate for testing my hypothesis.

I am doing a research project on parasitic infections in snail hosts. Snails are infected when coming into contact with bird faeces.

Each snail has 2 characteristics: a) infected yes/no and b) its distance from the
roosting site. You could try a Mann-Whitney U-test with infected yes/no as
grouping variable and distance as dependent variable. This will show you
whether in the infected group the distances are significantly higher or lower than
in the non-infected group.

With kind regards

K.
I agree on that, but doesn't a correlation coefficient suffice. Besides, I guess before Mann-Whitney, Justice should do a Kruskal-Wallis to see if there is any overall difference between the 5 sites' infection rates or not. Well, a Kruskal-Wallis does not directly show the direction and extent of the "correlation" (and further evaluations would be necessary), at least not as clearly as the correlation coefficients show the extent and direction of the association.

Besides, when doing Kruskal-Wallis and Mann-Whitney tests, the length of the distance is discarded, because it would be used Only as a grouping variable; while in correlation coefficients, the distances (in meter) would have a meaning, which this favors the accuracy of the results.

Kind regards

#### Karabiner

##### TS Contributor
I agree on that, but doesn't a correlation coefficient suffice.
Perhaps. But I feel uneasy with Spearman on binary-versus-rank-data.
Maybe some forgotten childhood experience.
Besides, I guess before Mann-Whitney, Justice should do a Kruskal-Wallis to see if there is any overall difference between the 5 sites' infection rates or not.
That is, treat infection yes/no as ordinal? I had rather assumed that this was
categorical, in which case the Chi² could apply (expected frequencies are all
> 5, AFAICS).
Besides, when doing Kruskal-Wallis and Mann-Whitney tests, the length of the distance is discarded, because it would be used Only as a grouping variable;
I would treat it ordinal DV, not as grouping variable. I guessed
that since there are 5 fixed distances and none in-between, ordinal
would be appropriate.

With kind regards

K.

J

#### Justice!

##### Guest
Fantastic thanks guys! You've made me a very happy girl

#### noetsi

##### Fortran must die
Beg your pardon, but wouldn't that mean p(Hypothesis|Data), i.e. Bayes statistics?
With the frequentist approach, we achieve p(Data|Hypothesis) .

With kind regards

K.
I was giving a very general comment on what the p value tells you as I understand it. I am not enough of an expert in the theory of statistics to understand the distinctions you are raising The way I interpret a p value of a test is that you either reject the null or you don't - and if you do not it means that you can not be sure that the effect size you found was not due to random error in your sample. That is it would not exist in the population.

#### CowboyBear

##### Super Moderator
the "p" value which is an assessment of how likely that the results you got were entirely due to random error.
Hi noetsi,

I think Karabiner makes a good point about your comment here. The p value tells you the probability of observing the a test statistic as or more extreme than you have if the null hypothesis is true.

Note that the bit on the end of the definition means that the p value is conditional - it's a probability of observing something if the null hypothesis is true. We can fiddle with the wording of the definition above and still get across its essential meaning, but any definition of a p value that leaves off the conditional bit is always going to be wrong.

Some sources do describe the p value as the probability that the results were due to "random error" or "chance", but this interpretation leaves off the conditional part of the p value definition. If we changed it to "the probability of observing the results we have due to chance, if the null hypothesis was true", then it'd be better.

Without that conditional bit, the definition you've used could be interpreted as saying that the p value is the probability that the results were due to the null hypothesis being true; i.e. the probability that the null hypothesis is correct. But a p value absolutely can't tell you that information (unfortunately!)

#### noetsi

##### Fortran must die
While I don't doubt that is correct cowboybear I have no idea at all what it means in practice That is I don't know in practical teerms what it means to say it is conditional on "if the null hypothesis is true." Particularly since I was taught that the null hypothesis could never be determined to be true by statistics. You could (by rejecting the null) show the alternate hypothesis was true but you could never show the null was true. You either rejected it or failed to.

Actually this is one of those areas that has always puzzled me. If the probability is only when the null hypothesis is true, why does the p value have any value at all when the null is rejected? Which is exactly what occurs when p is below a certain value. It seems we are using p to reject the null, then saying the p value only has meaning when the null is true....

I interpet the p value as the the probability that you can reject the null primarily although I also think of it as the chance that the results could be tied to random error. The later might not be right

Last edited: