Univariate, Bivariate, Multivariate: Myths and facts

#1
This is not a question but a summarization of my recent learnings regarding the common mistakes in calling different types of analyses, and to address the proper names for them.

I now know that a multiple regression is not a multivariate analysis, although I have seen millions of incorrect references that mistakenly call it multivariate. The hint is the number of the DVs which determines being multivariate or else. And a sophisticated regression model is still univariate since it has only one DV.

Another myth is the wrong explanation given here about the univariate analysis, being confused by descriptive analysis.

And an interesting point is that a sophisticated multiple regression is univariate while a chi-square or a correlation coefficient are bivariate!

Another interesting point is that a repeated-measures ANOVA is multivariate since a DV is repeated for some times and we have actually more than one DV.

I would add more detailed discussions here, but I have available this one:

"victorxstc 20.8.2013: I received a SPSS result from a student. Apparently her staistician has done rmANOVA. There is a section which reads "Multivariate Tests" under which these tests appear: Pillai' trace, Wilks Lambda, Hoteling's trace, Roy's largest root... besides the term "multivariate tests" there is a "b" indicator which refers to the table legend "b". This table legend reads: "Design: Intercept + X + X2 + X1 * X2"... In this analysis, there is only one DV (which is the increase in temperature of tooth). There are definitely no two or more DVs. However, the line reads "multivariate analysis". So I don't understand... Maybe since it is repeated-measures, there have been somehow two DVs and this is why it is multivariate. I don't know but interesting
...

Spunky: @victor: repeated-measures ANOVA can either be analysed the usual, univariate way or as a MANOVA. the multiple DVs come from the fact that you took repeated measurments on the same subjects over time (hence multiple DVs because of the repeated measurements). that's an interesting question, actually."
 

noetsi

Fortran must die
#2
In fact, since there is no official body that makes these decisions I would have to disagree with this statement.

I now know that a multiple regression is not a multivariate analysis, although I have seen millions of incorrect references that mistakenly call it multivariate. The hint is the number of the DVs which determines being multivariate or else. And a sophisticated regression model is still univariate since it has only one DV.
That might be one definition of multivariate, but based on my observation it is a minority one. A more common usage is that multivariate means analysis involving more than one independent variable. For example in their well known text, now in 5th ed which means it is highly popular, Fidel and Tbachanick use multivariate to cover regression, ANOVA etc.

Ultimately, with no official body to make such decisions, what multivariate means is up to the author defining it. And most do not define it to mean the DV only (if they did virtually no analysis is multivariate since only a few methods do that).
 
#3
I agree noetsi, this is why I said I have seen this mistake a million times. Actually as I told before, I searched my local journal database for the word "multivariate". Among 60 full articles I had from a very top journal, 22 had used the word "multivariate" and all the 22 ones had used it only for multiple regressions with only one DVs. So I see this is extremely common. I have had talks recently with reviewers and editors who had asked me to write more about "multivariate analyses" when my analyses were single-DV multiple regressions.

However, I found this link which made me suspected and thereafter Greta and Dason told me (and TE and CB and vinux too) multivariate refers to when we have more than one DV.

"Certain types of problem involving multivariate data, for example simple linear regression and multiple regression, are NOT usually considered as special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome variable given the other variables."
Later that I told the folks that 22 out of 60 articles referred to regression as multivariate, Dason too told me it is not a terrible mistake.

I personally have reached the conclusion that at least in my field I am allowed to use them interchangeably. In my own field, All I have seen is multivariate = multiple regression. But your comment is heartwarming too! (license to do it!)
 

noetsi

Fortran must die
#4
I guess my point would be it is not a mistake. Because that assumes there is an authoritative correct definition, which there is not. When there is no governing authority, it is up to the community to decide itself. And for the most part, the community appears to define multivariate as involving multiple IV - even if some authors don't agree with that definition.

There are so many different terms for the Sum of Squared errors in use that it can be infuriating/baffling to read text - because you don't know if they mean the same thing. Does SSresid equal SSwithin or SSbetween :p
 
#5
I understood your point, but since there were comments on the reasonable mechanism underlying this naming thing, I thought it is already agreed upon by statisticians. They told me it is something with a "single distribution of DV" that matters in regression. So I thought that there *is* some consensus over it based on statistical reasoning. However if you insist that there is not a consensus over it or that there is no department of authorities to decide on it (without others being able to refute them), so it is not a mistake. This issue is only becoming better and better for me! Since as I said befre the word "multivariate" is much more elegant that "multiple regression", and I would love to use it instead. :)
 

noetsi

Fortran must die
#6
One thing I am certain on is that there is no governing body in statistics for such things. I used to think the strong disagreement among statisticians (including nomeclature) was just my own lack of knowledge in stats. Until I had a brillant Harvard stats proff last summer for a time series (the type of person who might actually knows Sir Box). He said that there was in fact no agreement at all on such things. And no way really to reach a concensus.

This should not surprise you. Academics is famous for disagreeing on everything (except maybe that academics are paid too little) and giving entirely different meaning to the same concepts. This was certainly true in my fields (political science and public administration)? I thought mathemetricians would be different....but they aren't.
 
#7
One thing I am certain on is that there is no governing body in statistics for such things. I used to think the strong disagreement among statisticians (including nomeclature) was just my own lack of knowledge in stats. Until I had a brillant Harvard stats proff last summer for a time series (the type of person who might actually knows Sir Box). He said that there was in fact no agreement at all on such things. And no way really to reach a concensus.
That was awesome indeed.

This should not surprise you.
No I myself question the validity of everything (as long as my knowledge allows me) and keep telling my students too that their books are not the reflection of truth and that they are just compilation and discussion of previous (controversial) findings. We see this thing you said more in experimental sciences. However interestingly the students think of their books (and sometimes their teachers) as some kind of holy and non-replaceable "fact". However regarding the naming, there are names presented for the first time by developers of the idea that are generally accepted by the community. For example, nobody calls the eyes "the heart". Ok this was extreme. A better example in my field is that no physician will call an artery a vein. They are both vessels but with different names and although there is no authority to make the rule, nobody refutes this unofficial naming, first appeared in ancient medical books. The reason is simple: the direction of blood flow determines this and it is useful and clear. I thought this "multivariate" thing is perhaps one of those generally accepted ideas and names, until you gave the example of the Harvard prof. :)
 

noetsi

Fortran must die
#8
Some fields I suspect have more accepted usage, medicine is likely one. My guess is this is tied to it being really important that you get it right so you don't want to do an injection in a vein rather than an artery. This leads to more agreement and more regulation of the usage. Also there are bodies such as the AMA (in the US) UN units and the like in some fields that are authoritative bodies. That is different than statistics.
 

vinux

Dark Knight
#9
Hi Victor,
Sorry I didn't read all the conversations. To add one more point to what I said in the chatbox. "Multivariate regression and multiple regression are different. But, both can be considered as multivariate analysis. multiple regression is a conditional expectation E[y|X] and (y,X) is a multivariate random variable "
 

CB

Super Moderator
#10
Technically I think you're right, victor. But like noetsi says this is a bit of a subjective issue about how we define words. In social sciences "multivariate data analysis" is usually used to refer to an analysis involving 3 or more variables. That could be everything from multiple regression to canonical correlation. I don't know if there's really any harm in that, though it's not the definition a statistician would necessarily follow. These broad umbrella terms are always a bit ambiguous - same thing with "parametric" and "non-parametric".

But we might need to be more picky when dodgy terminology actually causes confusion about what's meant. E.g., I don't really like to see the term "multivariate regression" when people actually mean multiple regression, because this can cause legitimate confusion what the analysis is actually doing.
 

TheEcologist

Global Moderator
#12
Technically I think you're right, victor. But like noetsi says this is a bit of a subjective issue about how we define words. In social sciences "multivariate data analysis" is usually used to refer to an analysis involving 3 or more variables. That could be everything from multiple regression to canonical correlation. I don't know if there's really any harm in that, though it's not the definition a statistician would necessarily follow. These broad umbrella terms are always a bit ambiguous - same thing with "parametric" and "non-parametric".
Even though it may be so that in some fields the mis-use of statistical terms is rampant, there is NO subjectivity to the meaning of these terms to those who are more informed. Multivariate has a clear and precise meaning, so does parametric and non-parametric. Those who don't follow it are simply wrong. It really is that black and white.

But we might need to be more picky when dodgy terminology actually causes confusion about what's meant. E.g., I don't really like to see the term "multivariate regression" when people actually mean multiple regression, because this can cause legitimate confusion what the analysis is actually doing.
Which is why this is not subjective matter.

I'm not going to call a try a touch down or a home run when I watch the Springboks Cowboybear, you would rightfully laugh in my face.
Lets apply the same shame to those who mis-use statistical terms - basically I agree with you on this one. I just don't agree that this is - and should be treated as - a subjective matter.

Baie Dankie,

TE
 
Last edited:

noetsi

Fortran must die
#15
Multivariate has a clear and precise meaning
To who? Among social science researchers multivariate almost certainly is defined by involving more than two variables (even if only one is a DV) - indeed as I noted there are well known text such as Fidel and Tbachnick that specifically do define it that way. The vast majority of data analysts (most of whoom are not academics) would almost certainly agree with that definition. Statisticians may disagree - but they are vastly outnumbered.:p

This is not, to me anyhow, a really important issue, but I think if you were to look a journals and work you would find overwhelmingly that the way I have used multivariate is the norm.
 

TheEcologist

Global Moderator
#16
To who? Among social science researchers multivariate almost certainly is defined by involving more than two variables (even if only one is a DV) - indeed as I noted there are well known text such as Fidel and Tbachnick that specifically do define it that way. The vast majority of data analysts (most of whoom are not academics) would almost certainly agree with that definition. Statisticians may disagree - but they are vastly outnumbered.:p
[I realize that you were not serious noetsi, but] Playing the band wagon is a logical fallacy, popularity doesn't matter in science. If you are wrong, then you are wrong. Plus didn't you get the memo? Never disagree with a statistician :p

Failing to understand statistical concepts, is at best confusing to peers who want to reproduce your results. At worst it may cause substantial harm to science, causing papers to be retracted. I feel there is nothing subjective about this, if you publish in science, it is your responsibility to understand what the statisticians mean, it is also your responsibility to ensure that your research is reproducible. Anything else if unethical, even if your peers participate.

This is not, to me anyhow, a really important issue, but I think if you were to look a journals and work you would find overwhelmingly that the way I have used multivariate is the norm.
As long as you take heart to do it correctly, nothing more can be expected of you. If we do that, we can at least lead by example. And in this case, Cowboybear can be our example, he lead the way in helping clear the misunderstandings on the normality assumption in linear regression (with that TS paper!).

Understanding statistics is important for those who are expected by society to translate results and theories to real life cases. For instance look at the first paragraph of this paper, poor guys wife really did cheat :). This is a non trivial issue.
 
Last edited:

noetsi

Fortran must die
#17
If you were to submit an article to a social science journal where you used the term multivariate to refer only to those methods that have multiple dependent variables no one would have a clue what you meant by that usage. The same is true I suspect among practisioners [well in honesty few practisioners would use the term period]. This is obviously an area where the usage in social science research and statistics varies.
 

spunky

Doesn't actually exist
#18
If you were to submit an article to a social science journal where you used the term multivariate to refer only to those methods that have multiple dependent variables no one would have a clue what you meant by that usage.
well... i know of a couple of editors in social science journals who would whack you in the head for saying that.

and this is yet again... another sweeping, completely unfounded generalisation. i know you're trying to get an idea across and i do think you have a few valid points, but why do you always do this?!

ლY U GENERALISE SO SWEEPINGLY !?(ಠ益ಠლ)
 

noetsi

Fortran must die
#19
You are welcome to cite social science journals that restrict the usage of the term multivariate to those methods that have multiple dependent variables. I respectfully doubt that any editor of a social science journal would restrict such usage, but since you claim there are I would be interested in seeing evidence of this.

Personally I doubt, based on decades of reading such journals that there are rules period restricting such wording. Social science journals are notorious for allowing those who submit articles to use the language any way they personally chose - including inventing words and giving new meaning to them.

This is all loosely coupled....:p
 

noetsi

Fortran must die
#20
Since there is no governing authority in statistics who is "right" and "wrong" in the way words are used is impossible to determine. The fact that some writers use a certain usage does not make it correct.