# Help! What analysis should be performed on a dataset like this?

#### unbiased95

##### New Member
I'm sorry but I really need help on how to perform an analysis on these data.
This dataset, in which is studied the number of calories emitted during exercise on 24 observations (subjects), consists of three variables, all quantitative: ho (heat output) is the dependent variable and is the number of calories burned during activity, wl (work level, i.e. the number of calories burned per hour) and bm (body mass in kg) are, respectively, the regressors.
What kind of univariate and bivariate exploratory analyzes should be performed?
I also tested all three variables with Shapiro's test and none of them has proven to follow a gaussian distribution (p-value was p < 0.01).
I am attaching the dataset hoping it may prove useful.

#### Attachments

• 699 bytes Views: 4

#### Karabiner

##### TS Contributor
Univariate: exploratory data analysis , for example boxplots, stem-an-leaf plots, histograms. In addition, descriptive statistics (mean, median, minimum, maximum, variance, skewness, kurtosis).
Bivariate: X-Y scatterplots, Pearson correlations, Spearman correlations.

#### unbiased95

##### New Member
Thank you very much Karabiner.

#### Karabiner

##### TS Contributor
In addition, you might perhaps make use of the BMI categories
(proportions of participants underweight/normal weight/overweight/... ;
mean or median work level, mean or median heat output in the respective
categories).

#### unbiased95

##### New Member
Thank you, the main goal of the assignment is actually to build a regression model in which the dependent variable is the number of calories burnt per hour, in function of the other two; since this variable when tested for Shapiro-Wilk has proven to be distributed non normal (slightly evidence since p-value was p=0.03, level of significance was alpha = 0.05), should I consider a GLM to be the best choice?

#### Karabiner

##### TS Contributor
For a linear regression, it is not necessary that the dependent variable is normally distributed. If your sample size is small
(n < 30 or so), then the residuals from the model should be normally distributed (in the population). If your sample size is
larger, then the tests are considered robust against nonnormal residuals.

#### unbiased95

##### New Member
Thanks, I'll make sure to test the residuals of the model for the normality hypothesis then. And yes, the sample is quite small (n = 24).

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Yes you could use GLM. As mentioned, you would want to review the model residuals, visually and not via standardized testing. Given the small size, formal testing would be under-power and not really needed. You could also think about standardizing the variables. Remember at the end, given the small sample size your generalizability would be highly limited. I would also stay away from categorizing BMI, since that would suck up degrees of freedom. Given you have relatively healthy subjects there will likely be a positive dose response with BMI, but if you had underweight subjects due to an underlying condition or other scenarios - there could be heterogeneity effects (non-linear between DV and BMI.

#### unbiased95

##### New Member
Thank you for all the advices hlsmith, I will let you know of how it'll work in the end eventually.

#### unbiased95

##### New Member
Sorry for bothering you again guys: the assignment says that there are two models considered in this dataset, while one is a multiple linear regression model and isnt much of a problem, this other one: E(H)=b0+b1*M+(W/(b3+b4*M)) where H is number of calories burnt, W is number of calories burnt per hour and M is body mass in kg, really is a problem for me since (and I quote) "b0,b1,b3 and b4 are all costants" obtained in a non specified graphical way (the study is dated 1913). The equation estimated in the study is -138+4.5M+(W/(0.08+0.003M)) but theres no explanation on how did they get it. Does this model ring any bell to you? I have no clue how to make it in R either or what to do with it because of the presence of these costants.
Should I just go with the multiple linear model? But if so, I don't understand why it is in the assignment then

#### katxt

##### Well-Known Member
This is just a thought .... If b4 is small compared with b3, which it is, then W/(b3+b4*M) can be re-expressed as W*(b6 + b7*M).
This is now a linear regression with an interaction term WxM.
All terms are significant, the residuals are now as normal as you could reasonably expect and there is no obvious pattern in the residuals predicted plot.
I imagine that in 1913 they mucked about for days drawing experimental graphs.

#### katxt

##### Well-Known Member
which it is,
On reflection, it isn't.
However, the regression with an interaction is in fact a very good fit with good residual diagnostics.
If you do a rotating 3D plot of the data, it forms a very definite sheet which is almost flat. The effect of the interaction is a slight twisting of the sheet.

#### Koen Van de moortel

##### New Member
For a linear regression, it is not necessary that the dependent variable is normally distributed. If your sample size is small
(n < 30 or so), then the residuals from the model should be normally distributed (in the population). If your sample size is
larger, then the tests are considered robust against nonnormal residuals.
I would add to this: you can't nor shouldn't force the residuals to be normally distributed. If they are not normally distributed, it's just an indication that another variable is influencing your dependent variable.

#### Koen Van de moortel

##### New Member
Maybe it's too late, but this is what I would do: a linear fit of heat output vs work level/mass. What do think? Does it make sense?

#### Attachments

• 198.1 KB Views: 4