what is: correlation? regression? confidence?

#1
I need to learn statistics for a project I'm working on. I am totally lost and am spending more time than I thought I would on trying to understand statistics. So I have a few questions. Any responce, no matter how short will be appriciated.

What is correlation?
My boss introduced it to me as a measure of simularity between two data sets. I have also heard that it is a measure of how well a linear regression fits a certain cluster of data.

What is cross correlation?
Stems from my question on correlation. It is a procedure that seems to be done when comparing the simularity of the first half of data to the second half.

What is the difference between multiple regression and multivariate analysis?
I thought I needed to do multiple regression to find out which independent variable affects the dependant variable the most. I then read about multivariate analysis and got even more confused.

How do I measure how good my regression fits my data?
I thought that if my linear regression fitted poorly than the data could be non-linear. But I don't know how to figure out how well my data fits the regression of that data. I was going to do correlation or confidence level testing to determine the strength of my regression. I just don't know if it is right to do so.
 

gianmarco

TS Contributor
#2
Hi,
there is many things to explain and it is difficult to answer to all the questions you posed.

My knowledge of multiple regression is not so extensive, so I will limit myself to the correlation and regression. I will try to give some commonsensial ideas, leaving others providing you with deeper details.

Sometimes you need to establish the nature of the relationship between to or more variables, and sometimes you may also want to "predict" a value of a variable on the basis of the value of another "related" variable.
So, you are in the position to study:
a) "correlation" in order to asses the strenght of the relation between the values of two variables;
b) "regression" in order to asses the nature of that relation and to make a prediction from it.

When you study correlation (and regression), you have two variables and you wish to see if small values of one variable are associated with small values of the other one, or, by the same token, if large values of one are associated to large values of the other.

Imagine that you wish to study the relation between the radius of a circle and its circumference. So, for a given circle A you got two values: radius and circumference; the same applies for a circle B, C, D and so on.
So you can build a table that collect your values:

radius circumference
circle A x1 y1
circle B x2 y2
circle C x3 y3
and so on.

You know from geometry that when the radius increases, the circumference increases as well with a clear-cut proportion.
If you plot the various x against the various y, you will see that the points (representing circles A, B, C, D, and so on) in the Cartesian plane will perfectly lay along a straight line. Since the line intercepts all the point, you can understand that the variables (radius and circumference) vary together, that is they are correlated. In this case, the correlation is very good since when a variable vary the other vary as well and always by the same "proportion".

Now, you can calculate the correlation coefficient (named "r"), that is a "value" that give you the measure of how well the two variables are correlated. This coefficient span from -1 to +1. For a moment leave aside the + or -.
The more the value approximate to 1, the more the relation is good, with a "perfect" correlation when the value is 1, no correlation when is 0.
So, put it in a nutshell, from:
0 to 0.5 there is from no to weak correlation;
from 0.5 to 1 the is from weak to perfect correlation.

Keeping with the previous example, the more the point are close to the straight line passing among them (that is, the stronger is the correlation), the more the correlation coefficient will tend to 1. The more the point are spread apart along the line (that is, the weaker it the correlation), the more the coefficient will tend to 0. When no correlation is present, the coefficient is 0 and the plotted point will be scattered in a cloud without any evident "direction".

As for the sign - or +, it indicates the direction of the correlation.
The + indicates that the correlation is positive, that is when one variable increase the other increase as well. The - indicates a negative correlation, that is when one increases, the other decreases.

Now, once you have seen if a correlation exist between your two variables, you could try to predict the value of one variable even if the corresponding value of the second variable is lacking. You do this by means of the Regression (and its regression equation).
Keeping with the example of the circle, if you have the value of the radius, by means of regression you can calculate the value of the circumference.
Note that the more precise the prediction will be if the closer is the correlation between the two variables (that is, if the points tends to be along the aforementioned straight line).

You do not have to calculate the regression equation by hand, since it is provided by software.

Finally,
1) remember that there are two type of correlation coefficients (at least, as far as the ones most used by "beginners" are concerned):
a) the Pearson's, which assume (among other things) that your variables are normally distributed;
b) the Spearman's , which operate on the ranks of your data, and does not assume normal distribution (in stat packs you may find it among the "non-parametric" stats).

2) as for Pearson's r, if you square it you get a value that is the percentage of the variability of one variable accounted for by the other variable.
For example: an r=+0.8 (that is a positive strong correlation) can be squared= 0.64. It means that the 64% of the variation in one variable can be explained by the variation in the other one.

3) remember that the doubling of the correlation coefficient means a quadrupling the amount of agreement between the two variables.


I think it is all, and I hope that it may help you a bit.

Regards,
Gm