Good evening.

My question regards the "best" and correct way of finding out which predictors are best for determining an outcome of interest.

I currently have a very small sample (45 patients), a number of continuous variables (20) describing patients hearts' characteristics and a three-levels categorical variable indicating what type of condition each patient has (control, condition A, condition B). The dataset is balanced with 15 patients for each condition.

Outcome | P1 | P2 | P3 | ... | P20

--------------------------------------------

control | 2.5 | 0.7 | 1.1 | ... | 3.5

--------------------------------------------

control | 1.5 | 1.2 | 9.2 | ... | 5

--------------------------------------------

cond. A | 5.5 | 2.3 | 8.2 | ... | 1.2

--------------------------------------------

cond. A | 6.5 | 3.6 | 0.2 | ... | 3.1

--------------------------------------------

control | 2.5 | 1.1 | 2.3 | ... | 0.05

--------------------------------------------

cond. B | 3.5 | 9.8 | 3.5 | ... | 0.7

--------------------------------------------

I want to determine which of these 20 variables help me most in determining the condition of interest.

What I did was:

- perform an ANOVA analysis to determine whether exists a statistically significant difference between the means of the three groups. In other words, I run ANOVA 20 times (one for each variable), then corrected p values with benjamini-hochberg for multiple testing. This procedure left me with a subset of variables (from 20 to 10).

- perform a post-hoc analysis (Tukey's HSD) for each of these 10 variables with multiple comparison correction to find out which groups differ. As a side note, I'm only interested in differences between Control and Condition A and Control and Condition B; I'm not interested in differences between Condition A and Condition B.

Let's suppose we focus on

This is my guess:

I'm worried about the following points:

Thank you very much!!

Francesco

My question regards the "best" and correct way of finding out which predictors are best for determining an outcome of interest.

I currently have a very small sample (45 patients), a number of continuous variables (20) describing patients hearts' characteristics and a three-levels categorical variable indicating what type of condition each patient has (control, condition A, condition B). The dataset is balanced with 15 patients for each condition.

Outcome | P1 | P2 | P3 | ... | P20

--------------------------------------------

control | 2.5 | 0.7 | 1.1 | ... | 3.5

--------------------------------------------

control | 1.5 | 1.2 | 9.2 | ... | 5

--------------------------------------------

cond. A | 5.5 | 2.3 | 8.2 | ... | 1.2

--------------------------------------------

cond. A | 6.5 | 3.6 | 0.2 | ... | 3.1

--------------------------------------------

control | 2.5 | 1.1 | 2.3 | ... | 0.05

--------------------------------------------

cond. B | 3.5 | 9.8 | 3.5 | ... | 0.7

--------------------------------------------

I want to determine which of these 20 variables help me most in determining the condition of interest.

What I did was:

- perform an ANOVA analysis to determine whether exists a statistically significant difference between the means of the three groups. In other words, I run ANOVA 20 times (one for each variable), then corrected p values with benjamini-hochberg for multiple testing. This procedure left me with a subset of variables (from 20 to 10).

- perform a post-hoc analysis (Tukey's HSD) for each of these 10 variables with multiple comparison correction to find out which groups differ. As a side note, I'm only interested in differences between Control and Condition A and Control and Condition B; I'm not interested in differences between Condition A and Condition B.

Let's suppose we focus on

**control vs condition A**and we adopt a one-vs-rest approach where I consider the 15 patients having the condition A as cases and the other 30 patients as controls.This is my guess:

- take a look at the correlation between each independent variable (IV) and the dependent variable.
- consider the subset of IVs with highest correlation and build an additive logistic regression model (following is an R snippet using generalized linear model with logit link):

model <- glm(outcome ~ IV1 + IV2 + ... + IVn, data=data, family=binomial) - take a look at the model's coefficients and consider the statistically significant ones.

**control vs condition B**.I'm worried about the following points:

- as far as I understand it, correlation helps me determining whether a linear relationship exist between two variables, along with its magnitude and direction. By basing my exploratory analysis on correlation, there's a risk I discard a variable that doesn't show correlation with the dependent variable because their relationship is not linear. Is it correct ?
- By building an additive model, I'm not exploring any interactions between the variables, hence losing important information. Is this correct ?
- Given my sample is so small, I don't think I will be able to build a model with so many variables because the risk of overfitting is very high. I was reading that as a rule of thumb I should add a variable in my regression model every 10 data points.

Thank you very much!!

Francesco

Last edited: