Normality of the data in linear regression

zwergje

New Member
We have a data structure with 11 columns: 3 species and 8 chemical-physical indicators of water quality. However, two of the species are zooplankton well known for their use as biological indicators of water quality. That's why it seemed likely that we have only one variable (zebra mussel density in this case) and 10 independent variables (including these two species of zooplankton). To check, we have made a correlation/regression analysis between zebra mussel density and respectively the two species of zooplankton, with the outcome that these two zooplankton species individually have no significant effect on zebra mussel density. That is why we believe that it does not make much sense to include them as independent variables, and therefore we can include them as response variables. So: we have 3 response variables (the species) and 8 independent variables (water quality). However, what kind of analysis would be better here? In multiple regression, I think that there is usually a response variable alone? So three times multiple regression, one for each response variable? Or is there something better? Secondly, we do not get a clear picture if the variables used should be normally divided, or only the residuals of the outcome.

GretaGarbo

Human
So three times multiple regression, one for each response variable?
Yes, that seems reasonable.

only the residuals of the outcome.
Yes, only the residuals of the outcome are supposed to be normally distributed (not the explanatory variables).

zwergje

New Member
Hi, thanks for your reply. Closer consideration made us change the idea of what is the predictor/ independent variable and what are the response variables. Since zebra mussels are an invasive species with a negative effect on aquatic ecosystems and water quality and the other two (zooplankton) species biologic indicators of water quality, it seems likely that the question should be what effect zebra mussel density has on a variety of water quality parameters. So, one independent variable and 10 dependent variables. Now, the next question is, what statistical analysis would be best in such a case?

GretaGarbo

Human
Let
y1 = zebra mussels
y2 = one zoo plankton
y3 = other zooplankton

x1, x2,...., x8 other explanatory variables determined outside of the system = exogenous variables

Of course it is possible to have an interdependent system, like this:

y1 = f(x1, x2,...., x8) +error1
y2 = a1+b1*y1 + g(x1, x2,...., x8) +error2
y3 = a1+b1*y1 + b2*y2 + h(x1, x2,...., x8) +error3

IF the system works like this then it is a Wold causal chain and it can be estimated with OLS

(Here basically y1, y2 and y3 influences each other.)

But if it works like this:

y1 = f(x1, x2,...., x8) +error1
y2 = a1+b1*y1 +b2*y3 + g(x1, x2,...., x8) +error2
y3 =a1+b1*y1 + b2*y2 + h(x1, x2,...., x8) +error3

then it is more complicated and could maybe be estimated with 2SLS (two stage least squares).

(There should be different parameters in each equation.)

Which sea or lake is this about?

zwergje

New Member
Let
..........
Which sea or lake is this about?
Hi, the data set represents different sweet water reservoirs in Spain. I believe it is likely that Zebra mussel density is the only explanatory variable, and all others are responsive water quality indicators since the invasive Zebra mussel is known for its negative effects on water quality. The two zooplankton species are widely known for their use as biological indicators of water quality. This gives us one explanatory variable and 10 response variables, all continuous. A Manova is not suitable since the explanatory variable is not categorical but continuous. Multiple regression is suitable for more than one EXPLANATORY variable: we have only one if our separation of response and explanatory variables is correct. The solution you mention was not educated in the course this assessment comes from, so it must be something more simple I guess.