# Correlation

#### Tobee

##### New Member
I have a list of 100 observations. One variable describes the sex of an individual (male or female); the other variable describes the number of times the individual blinked during the observation. (30) of the (100) observations are with men and (70) are with women. Sample sheet attached. How do I analyze/interpret/calculate the correlation between sex and number of blinks?

Thanks!

#### gianmarco

##### TS Contributor
Hello,
I had a couple of spare minutes and I played around with your dataset.
I believe that what you could test is if the values for one gender tend to be higher than the values of the other one. To test this broad hypothesis, I used Mann-Whitney test. I have put togheter an R code that is described HERE, and whose output is attached as .jpeg.

As you can see, 2 notched boxplots are displaying the distribution of the number of blinks for male and females, while dots are representing each individual observation. At the bottom of the chart, you can see some useful information. As for MW test, it proves not significant, meaning that there is not a significant difference in the number of blinks for the two genders (i.e., the values of one gender do not tend to be higher than those of the other).
This is also confirmed by the overlapping of the notches of the boxplots.

Hope this helps,
gm

#### Tobee

##### New Member
Very helpful and I'm sure sufficient to get a sense for the connection/correlation in my data.

I read through the article you referenced, and the process for running the analysis is beyond my limited capabilities.

I have attached my 2 actual data sets (each on a separate tab within the worksheet)here; I would be incredibly grateful if you'd be so kind to run the analysis and post results for me? Any additional help is greatly appreciated!

Thanks!

#### gianmarco

##### TS Contributor
Could you give me some details on the data, please?

#### Tobee

##### New Member
Sure. I canvassed random companies in the marketplace and logged their responses to a simple question. In one scenario, the answer was either "Brook only" or "Not Brook only". In the other scenario, the answer was either "Brook mentioned" or "Not Brook mentioned". The "# of orders" data indicates the # of orders I received from each of these companies in the past 6 months.

In Set 1, I'm looking to test the hypothesis that "Brook only" is positively correlated with # of orders.

In Set 2, I'm looking to test the hypothesis that "Brook mentioned" is positively correlated with # of orders.

Hope that makes sense. Thanks again.

#### gianmarco

##### TS Contributor
Hello,
I hope this helps. I do not quite get the details you provided, sorry I am not a native English speaker...
There is no significant difference in # of orders between the 2 categories in each set of data. See attached images.
I see that lots of your observation has 0. Maybe some other guy here in the forum can jump in and provide further feedbacks.

Gm

#### Tobee

##### New Member
Gm, thank you so much for running that analysis. Even after researching, I'm still not clear on how to interpret the test outputs (u value, etc), but your note that "there is no significant difference in # of orders between the 2 categories in each set" makes sense.

Regarding the details of my data, it's really simple. Using "Set 1" as an example, I called a bunch of companies and asked them who they prefer to refer their employees to for medical care and then I recorded their answers. If they said they prefer to refer their employees only to the company named Brook, then I categorized them as a "Brook only" company. If they said anything else, then I categorized them as "not Brook only". I then looked up the last 6 months of Brook's sales data to employees working at each company to see if they actually purchased product from Brook. I'm trying to determine if it matters to Brook's sales whether or not these companies say they prefer to refer their employees only to Brook, or if they say anything other than that. Brook works really hard (and spends a lot of money) to get these companies to tell their employees to go to Brook for medical care and I'm trying to determine how much those efforts matter, if at all. If Brook can expect to produce the same amount of revenue to employees working at "not Brook only" companies as it can to "Brook only" companies, then there is an opportunity to make big changes and re-allocate significant \$'s.

I hope that makes better sense. If that explanation gives you a better idea of my situation and what I'm trying to accomplish, then do you still think the Mann-Whitney Test is most appropriate?

Thanks again! I'm really grateful for the assistance.

#### gianmarco

##### TS Contributor
Hello,
you should get familiar with Mann-Whitney test (link HERE) as well as with hypothesis testing in general if you want to make sense of any stat test.
To put it in a nutshell, any stat hypothesis test provide you with a test statistic (in MW case, the U statistic) and an associated probability value. Usually, a test's result is considered statistically significant if the p value is equal or smaller than a given threshold, which is 'traditionally' set at 0.05 (but note this is context dependent, and there is a lot of discussion about that in the statistician community....too deep water for me....).

I cannot go deeper into the issue right now (I am about to leave for my holydays), but assuming that MW test can be used with skewed data like yours, as I told you in my preceding posts, there seems to not be a significant difference in the distribution of values between the two groups in each of your set. As you can graphically see from the attached density plots, the distributions look very similar.

Hope this helps a bit, and stimulate you to investigate some more.

Cheers
gm