# Understanding correlation and regression

#### Maggy

##### New Member
Hello all,

I need your help to understand this question, that I got as my homework. I really can not understand what should i do in this question.

Question: given in the table are different properties of two types of objects. please find in which properties they are the most similar and in which they are the most different?

pa pb pc object
1.1 1.2 1.2 A
1.2 1.0 1.1 B
1.3 1.4 1.1 A
1.2 1.1 1.1 B

Please guide me what should i do here? find correlation? regression?

#### hlsmith

##### Less is more. Stay pure. Stay poor.
We don't know what you have covered in class. I would look at the pairwise differences between properties.

#### Omerikooo

##### New Member
You should do t-tests for every variable. A-B difference in pa, pb and pc. That would make 3 different tests comparing two groups(A and B).

#### Maggy

##### New Member
So, we do not need any correlation or regression? to find if they are related to each other and how similar they are in which properties.
for example how to find if both object A and B are similar in property pa.

#### Omerikooo

##### New Member
So, we do not need any correlation or regression? to find if they are related to each other and how similar they are in which properties.
for example how to find if both object A and B are similar in property pa.
Regression would show the predictive value of A and B on pa,pb,pc but not essentially the difference between pa,pb,pc with regards to subject.

Question can be rewritten like this: Which variables could differentiate A-B ? An answer for such question would be given with a t-test.

How big is your data. Is this the whole data(4x4), that you have just presented?

#### Maggy

##### New Member
No, the data is much more, around 500 rows. Don't you think i can use logistic regression?

#### Omerikooo

##### New Member
No, the data is much more, around 500 rows. Don't you think i can use logistic regression?
Logistic regression in your case would mean that pa,pb,pc and their relationship with subject(being A/B subject). For example, pa increases the odds of being the object A/B.

Be careful! In this case A and B should be exclusive. Such as A being man, B being woman.

You can do 3 separate logistic regressions as 1: pa vs. A/B, 2: pb vs. A/B, 3: pc vs. A/B. This can give the variable(pa,pb,pc) that has the significant association with A/B. Then, you can compare which has the most significant relationship (highest estimate, or in other words highest odds ratio).

#### hlsmith

##### Less is more. Stay pure. Stay poor.
This is just a poorly written question, if you wrote it exactly as presented to you. Also, it would have helped if you stated initially that you have 500 rows.

When rereading the questions, to me it reads, when are A and B are most similar or dissimilar and does not give criteria for these. Thus, if they have a similar mean but differing variance, how do we take this into play. And once again we don't know what you have covered in the class. A pseudo-permutation test would be interesting here. Also, Bland-Altman plot would be interesting as well.

#### Maggy

##### New Member
OK, here is the full question with data:
Selected properties of white and red wine were monitored. In which properties do they differ the most and, conversely, in which are they most similar? Build a model that describes the properties of red and white wines (classifies into red and white). Evaluate the suitability of the model.
Data:
Sug Chl Sulfat pH Alkl Dens Acid Wine
1.9 0.076 0.56 3.51 9.4 0.9978 0.7 RED
2.6 0.098 0.68 3.2 9.8 0.9968 0.88 RED
2.3 0.092 0.65 3.26 9.8 0.997 0.76 WHITE
1.9 0.075 0.58 3.16 9.8 0.998 0.28 RED
1.9 0.076 0.56 3.51 9.4 0.9978 0.7 WHITE
1.8 0.075 0.56 3.51 9.4 0.9978 0.66 WHITE
1.6 0.069 0.46 3.3 9.4 0.9964 0.6 WHITE
1.2 0.065 0.47 3.39 10.1 0.9946 0.65 RED
2 0.073 0.57 3.36 9.5 0.9968 0.58 WHITE

and around 500 more rows...

What am I supposed to do, in this question?
@hlsmith

#### Omerikooo

##### New Member
OK, here is the full question with data:
Selected properties of white and red wine were monitored. In which properties do they differ the most and, conversely, in which are they most similar? Build a model that describes the properties of red and white wines (classifies into red and white). Evaluate the suitability of the model.
Data:
Sug Chl Sulfat pH Alkl Dens Acid Wine
1.9 0.076 0.56 3.51 9.4 0.9978 0.7 RED
2.6 0.098 0.68 3.2 9.8 0.9968 0.88 RED
2.3 0.092 0.65 3.26 9.8 0.997 0.76 WHITE
1.9 0.075 0.58 3.16 9.8 0.998 0.28 RED
1.9 0.076 0.56 3.51 9.4 0.9978 0.7 WHITE
1.8 0.075 0.56 3.51 9.4 0.9978 0.66 WHITE
1.6 0.069 0.46 3.3 9.4 0.9964 0.6 WHITE
1.2 0.065 0.47 3.39 10.1 0.9946 0.65 RED
2 0.073 0.57 3.36 9.5 0.9968 0.58 WHITE

and around 500 more rows...

What am I supposed to do, in this question?
@hlsmith
For the first part I would use t-tests and for the second part 7 univariate logistic regressions and then multivariate regression with the significant variables on the first 7 univariate regressions.

#### Buckeye

##### Active Member
I disagree with Omerikoo. Even after restating the question, the first part seems very open-ended. You don't necessarily have to run t-tests (or any tests) to see if properties are similar or different. I can't really glean a hypothesis in the first sentence. You might be able to get away with a data viz of distributions for example. That part seems like the EDA step in your analysis. I would do a multivariate logistic regression with all variables included for part two.

Last edited:

#### Omerikooo

##### New Member
I disagree with Omerikoo. Even after restating the question, the first part seems very open-ended. You don't necessarily have to run t-tests (or any tests) to see if properties are similar or different. I can't really glean a hypothesis in the first sentence. You might be able to get away with a data viz of distributions for example. That part seems like the EDA step in your analysis. I would do a multivariate logistic regression with all variables included for part two.

T-test would tell if there is difference in means of given variables with regards to wine types. I agree that the question is not well posed, but still t-test would be a good start.

#### css

##### Member
The question has some level of ambiguity. However, I think that the clue is in this part "Build a model that describes the properties of red and white wines (classifies into red and white)".
Given that you are asked to build up a predictive model for a binary classification task (red/ white wine) you have tons of possibilities (linear discriminant analysis, logistic regression, classification tree/ random forest...).
I find it more difficult to answer the question "In which properties do they differ the most and, conversely, in which are they most similar", as the question does not specify how we should consider these differences (e.g., differences between means, in variance...). For many people, a suitable effect size measuring differences between means would suffice (e.g., cohen's d + 95% CI). Yet, if you have built a predictive model, you could interpret this question as "which features better differentiate red/ white wine" and then the answer should be based on comparing the relative importance of the predictors in the model. If you use LDA, you should obtain an almost identical answer regardless you attend to the variable weights of the discriminant function or if you calculate univariate means ' differences.