Multivariate Logistic regression

#1
Hello

For a research (prospective analytical study of a cohort of patients with and without a disease state. Trying to looking at prediction of a disease state with the use of few clinical variables, some of which are categorical, while few are continuous. The cut off point for the continuous variables has been found by AUC. So now there are twelve categorical variables which are been checked to predict the disease state.

On univariate logisitic analysis eight of these variables have been found to be statistically significant in predicting the disease. (p value <0.05) The odds ratios have been calculated for all along with 95% Confidence Interval. Some of the confidence intervals are very wide.

So now to predict independent predictors, I was trying to use multiple logistic regression analysis using SPSS and Medcalc. I wanted to calculate odds ratios again using multiple regression. But I have failed to do so.

I understand that it is believed that using a univariate analysis and a p value less than 0.2 is not an ideal method for picking variables for multivariate analysis, but I have no other way of knowing which of the twelve variables are independent risk factors for the disease, or in other words predict the disease?

Can someone kindly help me to do this analysis, or suggest a better way of doing it?
 

hlsmith

Less is more. Stay pure. Stay poor.
#3
Well for one, are you really predicting disease. Are some of your covariates signs/symptoms of disease, AKA effects? If so, you are doing retrodiction. Further trouble can occur with using actual predictors and retrodictors in same model. Look up Markovian Blanket.

Yes, using univariate analyses this day and age is considered a faux pas. Variable selection should be determined via content knowledge and existing research. Just selecting based on significance risks placing irrelevant variables in the model - such as common effects (colliders) or just spurious variables. The best approach is using content knowledge with data splits. So build and validate covariates using two data splits and then test them and get estimates using a third holdout set. Splitting of data should be based on a random process.
 
#5
Well for one, are you really predicting disease. Are some of your covariates signs/symptoms of disease, AKA effects? If so, you are doing retrodiction. Further trouble can occur with using actual predictors and retrodictors in same model. Look up Markovian Blanket.

Yes, using univariate analyses this day and age is considered a faux pas. Variable selection should be determined via content knowledge and existing research. Just selecting based on significance risks placing irrelevant variables in the model - such as common effects (colliders) or just spurious variables. The best approach is using content knowledge with data splits. So build and validate covariates using two data splits and then test them and get estimates using a third holdout set. Splitting of data should be based on a random process.
Thank you for the reply.

I am predicting a disease state. Specifically Lymph node metastasis in a particular type of cancer. The twelve covariates are clinicopathological and radiological factors. Trying to predict lymph node metastasis based on these factors.

My knowledge of statistics is limited to basic use of SPSS and Medcalc. Due to some restrains, unable to avail services of a Statistician, so doing the statistics myself.
Content knowledge on this matter is variable and not well established.
In layman's terms this is what I was hoping to achieve:

1. There were forty three patients with cancer who underwent surgery which included systematic lymph node dissection.

2. Out of the forty three only eight patients were actually found to have disease in the lymph nodes.

3. Systematic lymph node dissection has its own set of adverse effects after surgery.

4. So if we could know a subset of patients who had high risk of lymph node metastasis, then in only such patients can lymph node dissection be done. This will save rest of the patients from undergoing a morbid procedure.

5. For this, I used twelve variables (7 continuous Interval Scale and 5 Dichotomous categorical) like patient's age, BMI, some reports of blood anf biopsy and imaging parameters.

6. Used ROC curves and AUC for the continuous variables to define cut off levels, and thus had dichotomous categories for all twelve variables.

7. Did Univariate logistic analysis for all 12 variables. Found 8 of them to have significance.

8. Have calculated sensitivity, specificity, positive predictive value, negative predictive value, False positive rate and False negative rate, and Accuracy for each of these factors individually.

Now comes the point where I am stuck

1. Was hoping to do a multivariate analysis on either these selected 8, or on all 12 to find independent predictors of lymph node metastasis.

2. Wanted to assess various combinations of these 12 (or 8 significant) factors to predict lymph node metastasis. Maybe like a Probability risk matrix where based on presence or absence of various variables in different combinations, the probability of lymph-node metastasis can be predicted.

I am aware that the sample size is small, but given the study period and disease incidence, this was the maximum number I could get. This is part of an academic dissertation, so more than the validity and representativeness of the calculated research in population at large, the approach and attempt are more important.

Have hit a block. I don't know how to go about these.
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#6
Never use cut off levels, always use raw distribution. Dichotomizing data only loses information. For example if you have a BMI of 30 are you truly any different from a person with a BMI 29.9. And a person could vary between a BMI of 29.9 and 30 depending on the time/day.

You don't have enough data to model your question. The historic rule of thumb would be you need around 20 case for each predictor. You have 8 cases, so you don't really even have enough data to have an empty model (intercept only). Any results you find could easily be spurious or if it actually generalizes the magnitude of the effect will be off. Do these eight people accurately represent all other cases in the world and given sampling variability you could just have an anomalous sample. Also, with eight people the predictors combination will make no real sense and the standard errors with be very large. In particular look at how many people fall into the combinations of the groups - most of the time you will have zero people in the subgroup combinations.

I get it is a perfunctory process - but it will serve no utility beyond that. The results will be the worse if you look at different model results and make decisions on those - that is always the recipe for spurious findings that don't replicate.
 
#7
Never use cut off levels, always use raw distribution. Dichotomizing data only loses information. For example if you have a BMI of 30 are you truly any different from a person with a BMI 29.9. And a person could vary between a BMI of 29.9 and 30 depending on the time/day.

You don't have enough data to model your question. The historic rule of thumb would be you need around 20 case for each predictor. You have 8 cases, so you don't really even have enough data to have an empty model (intercept only). Any results you find could easily be spurious or if it actually generalizes the magnitude of the effect will be off. Do these eight people accurately represent all other cases in the world and given sampling variability you could just have an anomalous sample. Also, with eight people the predictors combination will make no real sense and the standard errors with be very large. In particular look at how many people fall into the combinations of the groups - most of the time you will have zero people in the subgroup combinations.

I get it is a perfunctory process - but it will serve no utility beyond that. The results will be the worse if you look at different model results and make decisions on those - that is always the recipe for spurious findings that don't replicate.

I understand and agree with your points. Unfortunately the sample size could not be larger. And the study has to be submitted.
On multiple regression analysis the odds ratio for most are either coming in billions or to the tune of billionth decimal place.

Is there some solution, can some other test be applied to show the prediction?
 

Karabiner

TS Contributor
#8
So you want to predict just 8 cases (out of 43), using 12 variables or so.

You could maybe just collapse the 12 variables into an index, look whether this index works reasonably well,
and suggest that in some follow-up study, with a larger sample size, the index is further examined.
And the study has to be submitted.
Where?

With kind regards

Karabiner
 

hlsmith

Less is more. Stay pure. Stay poor.
#9
I understand and agree with your points. Unfortunately the sample size could not be larger. And the study has to be submitted.
On multiple regression analysis the odds ratio for most are either coming in billions or to the tune of billionth decimal place.

Is there some solution, can some other test be applied to show the prediction?

Are you stating the ORs are crazy big or infinitesimally? That is the sparsity at play. I am surprised the model converged at all. I bet the SEs are wildly huge. If you had prior context knowledge a Bayesian model could be fit to regularize the coefficients.
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
General comment -

Code:
set.seed(555)
parts_1=rep(1, 8)
parts_2=rep(0, 35)
y=c(parts_1,parts_2)
y
x1=rbinom(43, 1, runif(1))
x2=rbinom(43, 1, runif(1))
x3=rbinom(43, 1, runif(1))
x4=rbinom(43, 1, runif(1))
x5=rbinom(43, 1, runif(1))
x6=rbinom(43, 1, runif(1))
x7=rbinom(43, 1, runif(1))
x8=rbinom(43, 1, runif(1))
x9=rbinom(43, 1, runif(1))
x10=rbinom(43, 1, runif(1))
x11=rbinom(43, 1, runif(1))
x12=rbinom(43, 1, runif(1))

df = data.frame(y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12)
df
mod1 = glm(y~., family="binomial", data=df)
summary(mod1)
exp(cbind(OR = coef(mod1), confint(mod1)))
I just sloppily made up a whole bunch of independent variables and by chance x3 has an OR 38.6 (95% CI: 2.0, 4107.7) with the outcome. How do you address such issues in your analyses?

Your project is like asking 43 people who they voted for and then collecting their:
political affiliation,
age,
gender identity,
religion,
color,
creed,
marital status,
familiar status,
disability status,
race,
sexual orientation

and saying for the 8 people that voted for persons X, these characteristics are associated with all people that vote for person X and this is the magnitude of those association for all the voters! If someone told you this, would you believe these eight people's characteristics represented those of all the constituents? So unless this is just a pedagogical exercise never to see the light of data - I would report descriptive stats and state the data set is absolutely too small to power any analyses or make any inferences from. Doing any analytics would be an unprofessional act.
 
#11
So you want to predict just 8 cases (out of 43), using 12 variables or so.

You could maybe just collapse the 12 variables into an index, look whether this index works reasonably well,
and suggest that in some follow-up study, with a larger sample size, the index is further examined.

Where?

With kind regards

Karabiner
Thank you for the reply.

Your suggestion is very wise, but unfortunately I am not aware how to collapse these variables into an index. I would greatly appreciate your help if you could kindly explain how to do so in SPSS or Medcalc.

The study has to be submitted as a post graduate dissertation. It only has perfunctionary value. Approach will be assessed more than the validity and representativeness of the results.

Thank you.
 
#12
Are you stating the ORs are crazy big or infinitesimally? That is the sparsity at play. I am surprised the model converged at all. I bet the SEs are wildly huge. If you had prior context knowledge a Bayesian model could be fit to regularize the coefficients.
Few odds ratio are extremely big and few infinitesimally small. I understand the limitation of the study in having very small sample size. I can't have a bigger sample size. What would you suggest be a possible solution with the given dataset.
In another response Karabiner has suggested collapsing the 12 variables into an index.
I will try to find out more on how to go about it and see if that can help.
Basically I have been stuck over the past week, with no solution in sight. Karabiner's suggestion seems to be the only hope now.
 
#13
General comment -

Code:
set.seed(555)
parts_1=rep(1, 8)
parts_2=rep(0, 35)
y=c(parts_1,parts_2)
y
x1=rbinom(43, 1, runif(1))
x2=rbinom(43, 1, runif(1))
x3=rbinom(43, 1, runif(1))
x4=rbinom(43, 1, runif(1))
x5=rbinom(43, 1, runif(1))
x6=rbinom(43, 1, runif(1))
x7=rbinom(43, 1, runif(1))
x8=rbinom(43, 1, runif(1))
x9=rbinom(43, 1, runif(1))
x10=rbinom(43, 1, runif(1))
x11=rbinom(43, 1, runif(1))
x12=rbinom(43, 1, runif(1))

df = data.frame(y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12)
df
mod1 = glm(y~., family="binomial", data=df)
summary(mod1)
exp(cbind(OR = coef(mod1), confint(mod1)))
I just sloppily made up a whole bunch of independent variables and by chance x3 has an OR 38.6 (95% CI: 2.0, 4107.7) with the outcome. How do you address such issues in your analyses?

Your project is like asking 43 people who they voted for and then collecting their:
political affiliation,
age,
gender identity,
religion,
color,
creed,
marital status,
familiar status,
disability status,
race,
sexual orientation

and saying for the 8 people that voted for persons X, these characteristics are associated with all people that vote for person X and this is the magnitude of those association for all the voters! If someone told you this, would you believe these eight people's characteristics represented those of all the constituents? So unless this is just a pedagogical exercise never to see the light of data - I would report descriptive stats and state the data set is absolutely too small to power any analyses or make any inferences from. Doing any analytics would be an unprofessional act.

I understand your concern and comment.
This is indeed a very sample size, but given the temporal limitations a bigger sample size was not possible.
Like I said earlier, this is part of a post graduate thesis in medical field. The results may or may not be applicable at large.

Let me give a little context

There have been some research articles predicting lymph node metastasis using different clinicopathological variables, and some using imaging parameters like MRI or PET Scan. I was hoping to combine all these and see if any of their combination could better predict the lymph node metastasis
I agree that it may be unprofessional to do analytics on this data set. But Like i said this is just an academic exercise, for partial fulfillment to be eligible to write the exit exam.
Nonetheless, I believe there must be some way of professionally and fruitfully analysing this dataset, albeit small. It is just that I haven;t been able to find a way yet. Karabiner's advice is worth following. Will try that and see if it solves my problem.
Thank you again for your comments and help.
 

hlsmith

Less is more. Stay pure. Stay poor.
#14
I believe there must be some way of professionally and fruitfully analysing this dataset, albeit small. It is just that I haven;t been able to find a way yet.
You are wrong - there really isn't some magic panacea. You just don't have enough data to support or test that many predictors and the outcome is imbalanced. I would just go to the literature and find the two best predictors and run a model with those. Boom you are done and learned a lesson about sample size, power, and generalizability. That is your conclusion and pedagogical moment. Overfitting a model even for perfunctory reasons is still silly when the purpose is to acquire a skill set to apply in the future.
 
#15
You are wrong - there really isn't some magic panacea. You just don't have enough data to support or test that many predictors and the outcome is imbalanced. I would just go to the literature and find the two best predictors and run a model with those. Boom you are done and learned a lesson about sample size, power, and generalizability. That is your conclusion and pedagogical moment. Overfitting a model even for perfunctory reasons is still silly when the purpose is to acquire a skill set to apply in the future.
Thank you for the reply. I can understand your points and totally agree with you. But you did provide me with the magical solution, the option to choose just a couple of predictors. Maybe have the PET CT parameters, combined with a clinicopathological variable Like (PET CT + Variable 1), (PET CT + Variabe 2), (PET CT + variable 3) and compare them.
Would that be an acceptable approach?
The limitation of the study would definitely be the small sample size, and need for re-examining the hypothesis in future with a bigger sample size.
 
#16
So you want to predict just 8 cases (out of 43), using 12 variables or so.

You could maybe just collapse the 12 variables into an index, look whether this index works reasonably well,
and suggest that in some follow-up study, with a larger sample size, the index is further examined.

Where?

With kind regards

Karabiner
Sorry for bothering again.

Just wanted to ask if it would make sense if I tried to do the following:

There are twelve variables. All categorical. Dichotomous values. Low or HIgh.
Based on Univariate logistic regression analysis I divide them into three groups

Group 1. Non Significant
Group 2. Significant - p value 0.001 - 0.05
Group 3. Highly Significant - p value <0.001

Each High from Group 1 is given score 1
Each High from Group 2 is given score 2
Each High from Group 3 is given score 3
All Lows are given score 0

A sum of the score 9Index Score) is calculated for each patient, and checked if it predicts Lymph node metastasis. Does this make any sense at all?

Using ROC analysis a cut off point of the Index Score can be found. Sensitivity, Specificity and Likelihood ratios can be found.

Would it be a correct approach, given the present limitations of this study?
 
#17
Sorry for bothering again.

Using ROC analysis a cut off point of the Index Score can be found. Sensitivity, Specificity and Likelihood ratios can be found.

Would it be a correct approach, given the present limitations of this study?

I just tried that.
AUC is 0.945

This means good discretionary power with a 94.5% chance of distinguishing positive from negative case.


I am not sure if I was supposed to, but I also tried to do univariate logistic regression analysis of the Index Score.
This is the result.

Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1a Index_Score .438 .132 11.021 1 .001 1.550
Constant -5.962 1.717 12.056 1 .001 .003
a Variable(s) entered on step 1: Index_Score.


I am unable to interpret this.
 

Karabiner

TS Contributor
#18
Group 1. Non Significant
Group 2. Significant - p value 0.001 - 0.05
Group 3. Highly Significant - p value <0.001

Each High from Group 1 is given score 1
Each High from Group 2 is given score 2
Each High from Group 3 is given score 3
All Lows are given score 0

Would it be a correct approach, given the present limitations of this study?
No, I'm afraid. A p-value cannot be used to summarize the strength of an association..
And, if you pre-select variables and/or develop scores for the variables, within the same
sample where you apply the resulting sum scale, you are nearly guaranteed to produce
overfitting. In the present case, with just n=8 cases to predict, it gets even worse.
Non-significant results cannot be distinguished from type II errors, due to extremely
low power.

My idea was to include more of the theoretical knowledge instead of trying to
twist curls on a shiny bald head, i.e. the 12 (or maybe less) indicators are based on theoretical
and practical considerations, and on pre-existing literature) indicators, so they
are simply put together and used for the prediction in the current sample. If they
show some predictive value, then fine, you can present this and suggest to further
check & improve the index in further studies.

With kind regards

Karabiner
 
Last edited:
#19
No, I'm afraid. A p-value cannot be used to summarize the strength of an association..
And, if you pre-select variables and/or develop scores for the variables, within the same
sample where you apply the resulting sum scale, you are nearly guaranteed to produce
overfitting. In the present case, with just n=8 cases to predict, it gets even worse.
Non-significant results cannot be distinguished from type II errors, due to extremely
low power.

My idea was to include more of the theoretical knowledge instead of trying to
twist curls on a shiny bald head, i.e. the 12 (or maybe less) indicators are based on theoretical
and practical considerations, and oon pre-existing literature) indicators, so they
are simply put together and used for the prediction in the current sample. If they
show some predictive value, then fine, you can present this and suggest to further
check & improve the index in further studies.

With kind regards

Karabiner

Thank you for the prompt reply.
I understand that my approach will produce overfitting as I don't have a pilot study on a different set of population where the index was developed, and that it is not ideal to develop and then apply the index test in the same population.

Also that the p value measures significance and not the strength of association.

No previous study has been done taking these twelve parameters together. There are some with few of these separately or in different combinations. None of them have provided any practice changing evidence.

Can you please suggest some other way of making an index score with these variables, if not on p value magnitude?
I understand that developing and applying the index score will cause overfitting, but I have no other option.
Your suggestions will be greatly appreciated.

PS: - Can I use the ODDS RATIO values to group the variables into the categories instead of p values
 

hlsmith

Less is more. Stay pure. Stay poor.
#20
General comment - bivariate models DV = IV, have a tremendous risk. They do not account for the possible relationships between the variables. So X1 may be by mediated by X2 or confounded. Unless all variables are in the model you may not know their true statistical association. However, your sample is too small for this. In actuality it may be too small for any variables.

Why do you keep dichotomizing the continuous variables. Sure it is fun, but you lose information doing that. Confirm linearity in the logit assumption for continuous variables and then just use the raw variable in model. Creating a cutoff threshold on such a small sample will never hold and it is poor practice to report it's effect on the sample used to create it - similar issue to what @Karabiner mentioned.

It would be like me saying does your model suck, yes/no. When there is a small chance it does not completely suck - but given a black and white rule - I have to say it sucks. Good luck.