Patient data - Choosing the Right Model

#1
Hello,
I just finished up my first year as a statistics major and have been assigned a summer project from my employer at the hospital near my school. The aim of the project is to use data from 50 patients infected with a certain disease, create a data model based on this information and use that model to predict whether or not other patients should be tested for this disease. I'm given 15 measures such as blood pressure, heart rate, etc. which we already know are directly related to the disease. My supervisor suggested using a neural network and it seems like a logical approach after researching this method online; however, with my limited experience in statistical analysis, I would like input from more knowledgeable statisticians on whether or not a neural network would be the best approach. Also, I was told by a professor of mine that R would be the best software for attacking this project, which I do not know is true. Please let me know what approach might help me in achieving my goal of having a fully functional data model by the end of summer. Like I said, I have pretty limited experience conducting this high-level analysis so any help would be phenomenal. Thank you for your time!
-Aspiring Statistician
 
#2
I'm no statistician, but I am a nurse with some basic working knowledge of stats - and how they are applied in the "real world"... so let me just help out by asking some questions.

It sounds like you have a great project starting! Congrats!

Are you looking at a group of patients retrospectively? In other words, do you have the data for the patients who underwent a test with the results?
What is the test? What type of result is it? Is it on a numerical scale (such as a lab value [HbA1c for diabetes] or a nominal result [malignant/benign for cancer], perhaps a categorical result [tumor stage].
Are you aiming to establish which measure/test/lab is most sensitive to predicting the disease? (eg. A1c is a better predictor than a basic Blood Glucose fingerstick).
Is the aim to understand the risk factors related to a disease? (eg obesity/smoking/diabetes/hypertension/diabetes on heart failure)

Good luck!! Enjoy the project! You'll be a valuable resource in any hospital!
 
#3
Thanks! I'm very excited about this as well, it's my first real opportunity to dive into analysis that has real world repercussions. To answer your questions: yes, I'm looking at patient information from patients we have tested and have been diagnosed with the disease. I am not sure of what exactly the test involves since I work more on the analytic side of things while physicians and nurses take care of the clinical aspect, but the result is categorical. They either do not have the disease or are diagnosed under a certain phase of it. My goal is not to determine the most influential measures to determine the result, from my understanding I'm working with measures that have already been directly linked to it. My goal is to be able to look at those same measures in a patient I do not have a result for and be able to determine, with as high an accuracy as possible, whether they are free of the disease or which stage of it they have. Thanks so much for making me think about this more critically, by the way. These questions alone have given me a better understanding of my own data set!
 

Karabiner

TS Contributor
#4
My supervisor suggested using a neural network and it seems like a logical approach after researching this method online; however, with my limited experience in statistical analysis, I would like input from more knowledgeable statisticians on whether or not a neural network would be the best approach.
Since I am not too familiar with neural networks I cannot tell
whether it is the *best* approach under these circumstances.
But due to the unfavourable ratio between number of predictors
(15) and number of patients (n=50, very small anyway), common
methods such as multiple logistic regression will certainly be of
limited use here.

One very simple approach would be to construct a scale from
the 15 predictors. E.g. each characteristic is given a score =1
if above a certain threshhold and = 0 if below; thresholds could
be determined from clinical knowledge or using ROC curve analyses.
You could then determine with which score you get the optimal
prediction. But this approach would also suffer from the small sample
size, so there would be the need to validate it with new samples.

Just my 2pence

K.
 
#5
Yeah, the small sample size is a very big problem here. There is a serious risk of overfitting (finding a model that will predict your data set very well, but will do very poorly on new data). In theory, a neural network would be great here with a large data set, but since NN have even more parameters ((predictors X hidden units + hidden units) for one dependent variable), I don't think it's very suited.

What is the measurement level of the independent variables? And how many dependent variables do you have and what is their measurement level? I'm guessing "has the disease" and "stage" should be two separate variables?

As a rule of thumb, I've seen numbers of about 15:1 for cases : predictors in biomedial research. So that would mean about 3 predictors in your case.

So maybe you could start by looking at correlations between the predictors and the dependent variable(s). And then take the 3 strongest and try some kind of regression (depending on the nature of the predictors and depedent). And if you have the model, make sure you run some kind of study testing it (do you have a gold standard?). Make sure you don't raise expectations too high when you finish your model... Take a good look at your R^2 and the specifity/sensitivity.

Depending on the size of the correlations I guess you could add a few more. I'm by no means an expert, so I'm curious to hear what others think.
 
Last edited: