Assessing covariates to include in model

#1
Hi all, I have a couple of basic questions which would be great to get some help on

I have a continuous outcome (brain volume: total and divided into specific brain regions) and categorical exposure (smoking- 4 categories)
I am using linear regression to analyse this relationship- so far in my adjusted models not seeing very much!

I have information on many covariates which are coded in different formats/types; binary(eg sex), categorical(eg education level), and continuous (eg IQ). I want to check associations of these against both my exposure (categorical) and outcome (continuous).

Should linear regression be used for all of these? Or do I need to use another statistical test when looking at categorical vs categorical data etc ?

Thank you- any advice would be hugely appreciated, I am very new to epidemiology and biostatistics :)
 
#2
If the sample size is not very large, you should use linear regression where the candidate predictors are the following:

1) original numeric variables,
2) nonlinear transformations of the original numeric variables,
3) binary dummy variables representing each category of the nominal variables (except for the reference categories),
4) interactions of selected members of groups 1) - 3).

The rule of thumb says that there should be at least 15 observations per each coefficient to estimate. So not all of the predictors may find their way into any given model. You should use standard model selection protocols (forward stepwise selection, backward stepwise selection, lasso, etc) to determine which predictors should be kept in the final model and which predictors should be dropped.

You do not have to use ANOVA since it is algebraically equivalent to linear regression. However, choosing the ANOVA option in some statistical packages (SPSS,SAS,...) produces extra, informative output.

If the sample size is very large and your focus is on the predictive accuracy (not interpretability), you can experiment with data mining methods (like boosted trees, SVM and such).
 
#3
Hi,

in order to correct the volume for the influence of these additional covariates you should integrate them into your regression analysis. The usual way is that you build different reasonable models

mod1 <- lm(volume ~ exposure)
mod2 <- lm(volume ~ exposure + sex)
mod3 <- lm(volume ~ exposure + sex + exposure*sex)
mod4 <- lm(volume ~ exposure + IQ)
....

and than you can compare them via the AIC-value by

AIC(mod1,mod2,mod3,mod4,...)

and the model with the lowest AIC is the model you should choose. It will correct your test for the effect the additional covariates have