# Thread: Master thesis, statistics check!

1. ## Master thesis, statistics check!

About me: Medical student with minimal statistical knowledge or research experience

Question related to: statistical regression/correlation analysis of a medical trial

Trial design: Double blind, randomized controlled trial. Placebo vs vitamin with a few months between baseline and follow-up measurements. Follow-up and baseline measurements are identical.

Measurements performed, possibly relevant for answering my questions:
In total 6 tests are performed twice. Once at baseline, once at follow-up.
Insulin sensitivity testing; 2 tests.
- First provides 3 values usable for analysis (one of these is important, other 2 secondary)
- Second test provides 4 quite comparable values usable for analysis (difference are minor because of different calculation method)
Vascular function testing; 4 tests.
- First provides one important useable value
- Second provides 4 values
- Third provides 3 values (1 important, 2 secondary)
- Fourth provides 4 equally important variables

Important! Current situation:
- Goal of study: 120 participants
- Current status: 13 participants included with baseline measurements taken, 8 of them have also completed follow-up measurements.
- Double blinded study, with no option to de-blind (until trial completion) to see effect of the vitamin. Solution for me: Looking at the correlations between insulin sensitivity (insulin resistance) and vascular function (vascular dysfunction).

What I did so far:
Baseline correlation (N=13)
Using the 13 baseline measurements, I performed Spearman correlation (in SPSS) using all variables of insulin sensitivity and vascular function. I also included screening data such as BMI, age etc. I figured Pearson would be inferior because of the small sample size (and thus not a normal distribution).

Follow-up/baseline difference (N=8)
I made a separate database where I subtracted the baseline (BL) data from the follow-up (FU) data. This results in a database with both positive and negative values (because sometimes the values improved at follow-up, and sometimes they deteriorated).
I did this, because I want to know if there is a relationship between the change of insulin sensitivity and vascular functioning between BL and FU. In other words; does a decrease (between BL/FU) in insulin sensitivity cause an increase in vascular dysfunction?
To analyse this, I used the spearman correlation again. Also, I tried some linear regression analysis (in SPSS) on these numbers, using insulin sensitivity parameters as the independent variable.

My questions:
Regarding baseline correlation:
(1) Is Spearman correlation indeed the best (and easiest) way to assess the correlation between insulin resistance and vascular dysfunction in this situation?
Regarding the Follow-up/baseline difference analysis:
(2) Does it even make sense to use spearman correlation in this situation?
(3) Is linear regression the best way to analyse the difference between follow-up and baseline in this situation, using the subtracted data?
(4) Would it make sense to correct for age, sex or other variables in such a small sample size (N=8) with regression analysis?
General questions:
(5) After reading this, any other suggestions regarding the statistical analysis approach? Or any specific references I could use which deal with situations like this specifically?

PS. Note to administrators: this was my first time on this forum. After writing this relatively long thread (in latest version of Google Chrome) and then clicking on preview after a decent amount of time spent on typing it prompts with a re-login because of an expired session. After logging in again, I got a white screen. Pressing back did not work. Everything I wrote disappeared. Very frustrating
I wrote this post again in Word and then pasted it into the forum to prevent this from happening again. Known issue?

2. ## Re: Master thesis, statistics check!

Hi there!

Originally Posted by DutchMedicalStudent
Important! Current situation:
- Goal of study: 120 participants
- Current status: 13 participants included with baseline measurements taken, 8 of them have also completed follow-up measurements.
Can you elaborate on why you're analysing data at this stage, given that your data collection is incomplete and your current sample is so tiny?

3. ## Re: Master thesis, statistics check!

Originally Posted by CowboyBear
Can you elaborate on why you're analysing data at this stage, given that your data collection is incomplete and your current sample is so tiny?
I'm currently doing a research internship which only lasts a few months. The study will last another year or two, and is done by my supervisor here. Only helping with the current study does not meet the university requirements so I also have to do my own research questions. This is the only data available from this study so far and there's only 2 months left before the thesis has to be complete so this is all the data I can use, unfortunately.

4. ## Re: Master thesis, statistics check!

How is your dependent variable measured? That determines if Spearman makes sense or not.

Its common to run ANCOVA to control for things in a randomized test.

5. ## Re: Master thesis, statistics check!

Originally Posted by DutchMedicalStudent
I'm currently doing a research internship which only lasts a few months. The study will last another year or two, and is done by my supervisor here. Only helping with the current study does not meet the university requirements so I also have to do my own research questions. This is the only data available from this study so far and there's only 2 months left before the thesis has to be complete so this is all the data I can use, unfortunately.
To be totally honest I would suggest talking to someone else in your university (not just your supervisor) for advice about how to proceed here, whether this study will really be sufficient for a passing grade, and if not whether you have any alternative options. A correlational study with N=13/8 has very weak power (i.e., even if there is a substantial correlation, you'd have little chance of detecting it with a sample size this small) - meaning that there probably isn't much value in an analysis like this. It sounds like a tricky situation though, so make sure you sound out your options carefully.

Re. The pearson/spearman choice: A small sample size does not imply a non-normal distribution of your variables. I'd probably use Pearson, since it doesn't throw away information by turning observations into ranks, and thus should have slightly higher power. If you're worried about non-normality you can calculate confidence intervals via bootstrapping instead of normal theory.

I wouldn't use multiple regression in the follow-up analysis: The use of difference scores controls for individual difference variables implicitly, and you don't have degrees of freedom to burn on including other predictor variables.

6. ## Re: Master thesis, statistics check!

Originally Posted by DutchMedicalStudent
I'm currently doing a research internship which only lasts a few months. The study will last another year or two, and is done by my supervisor here. Only helping with the current study does not meet the university requirements so I also have to do my own research questions. This is the only data available from this study so far and there's only 2 months left before the thesis has to be complete so this is all the data I can use, unfortunately.
To be totally honest I would suggest talking to someone else in your university (not just your supervisor) for advice about how to proceed here, whether this study will really be sufficient for a passing grade, and if not whether you have any alternative options. A correlational study with N=13/8 has very weak power (i.e., even if there is a substantial correlation, you'd have little chance of detecting it with a sample size this small) - meaning that there probably isn't much value in an analysis like this. It sounds like a tricky situation though, so make sure you sound out your options carefully.

Re. The pearson/spearman choice: A small sample size does not imply a non-normal distribution of your variables. I'd probably use Pearson, since it doesn't throw away information by turning observations into ranks, and thus should have slightly higher power. If you're worried about non-normality you can calculate confidence intervals via bootstrapping instead of normal theory.

I wouldn't use multiple regression in the follow-up analysis: The use of difference scores controls for individual difference variables implicitly, and you don't have degrees of freedom to burn on including other predictor variables.

7. ## The Following User Says Thank You to CowboyBear For This Useful Post:

ondansetron (05-30-2017)

8. ## Re: Master thesis, statistics check!

Originally Posted by noetsi
How is your dependent variable measured? That determines if Spearman makes sense or not.

Its common to run ANCOVA to control for things in a randomized test.
Doesn't that require me to know which group had what? All my variables are continuous/scale.

Originally Posted by CowboyBear
To be totally honest I would suggest talking to someone else in your university (not just your supervisor) for advice about how to proceed here, whether this study will really be sufficient for a passing grade, and if not whether you have any alternative options. A correlational study with N=13/8 has very weak power (i.e., even if there is a substantial correlation, you'd have little chance of detecting it with a sample size this small) - meaning that there probably isn't much value in an analysis like this. It sounds like a tricky situation though, so make sure you sound out your options carefully.

Re. The pearson/spearman choice: A small sample size does not imply a non-normal distribution of your variables. I'd probably use Pearson, since it doesn't throw away information by turning observations into ranks, and thus should have slightly higher power. If you're worried about non-normality you can calculate confidence intervals via bootstrapping instead of normal theory.

I wouldn't use multiple regression in the follow-up analysis: The use of difference scores controls for individual difference variables implicitly, and you don't have degrees of freedom to burn on including other predictor variables.
There is no other option. This is what I have to use. Luckily the correlation and the linear regression analysis have both resulted in multiple interesting significant values, so apparently the correlations are strong enough to become significant in this small population.

I have tested for non-normal distribution, the majority of the variables are non-normally distributed. So, using Pearson for higher power is in my opinion a bad choice (Also, the advantage of higher power is no not necessary as Spearman has provided sufficient significant correlations).

I see your point with the multiple regression analysis. Thanks!

9. ## Re: Master thesis, statistics check!

It requires you to have variables that vary on some dimension you are interested in. But if you don't have that then any analysis would be impossible anyway. I don't know what you mean by knowing which group had what.
The same problem CWB mentions for regression, low power, applies to ANCOVA.

10. ## Re: Master thesis, statistics check!

Originally Posted by noetsi
It requires you to have variables that vary on some dimension you are interested in. But if you don't have that then any analysis would be impossible anyway. I don't know what you mean by knowing which group had what.
The same problem CWB mentions for regression, low power, applies to ANCOVA.
I thought that ANCOVA required you to split the population in groups. Since I don't know who had treatment or placebo, I cannot do that (population is still double blinded). I am simply looking at the correlation between variables.
Even comparing the baseline and follow-up I really can't consider as two separate groups, because they are the same people. I subtracted these from each other to get a delta which I can use for linear regression ANOVA).

What exactly do you mean by "variables that vary on some dimension", because I have many variables for each part i'm interested in and they vary quite a lot haha.

11. ## Re: Master thesis, statistics check!

What I mean is that to run ANOVA or regression you have to know 1) how your dependent variable varied (did they get better, did they die or whatever you are measuring) and 2) how the predictor variable varied. For example if some one had a treatment or did not. If you don't know this information its impossible to run any statistic I know, you have nothing to compare. The point of standard statistics is to see how the dependent variable varied with the predictors. If you don't know how the predictors varied obviously you can not run such tests.

I am not aware of any statistics you can run where you don't know if the predictor took on a specific level. But they may well exist, I do not do medical research.

12. ## Re: Master thesis, statistics check!

(1) Is Spearman correlation indeed the best (and easiest) way to assess the correlation between insulin resistance and vascular dysfunction in this situation?
It is appropriate for small sample sizes such as yours. But, as was already mentioned, statistical power to detect effects will be very low.
Regarding the Follow-up/baseline difference analysis:
(2) Does it even make sense to use spearman correlation in this situation?
Yes. And you could do some scatterplots.

(3) Is linear regression the best way to analyse the difference between follow-up and baseline in this situation, using the subtracted data?
(4) Would it make sense to correct for age, sex or other variables in such a small sample size (N=8) with regression analysis?
Multiple regression is inappropriate regarding your extremely small sample size.

Wit kind regards

K.

13. ## The Following User Says Thank You to Karabiner For This Useful Post:

noetsi (05-31-2017)

14. ## Re: Master thesis, statistics check!

Originally Posted by noetsi
What I mean is that to run ANOVA or regression you have to know 1) how your dependent variable varied (did they get better, did they die or whatever you are measuring) and 2) how the predictor variable varied. For example if some one had a treatment or did not. If you don't know this information its impossible to run any statistic I know, you have nothing to compare. The point of standard statistics is to see how the dependent variable varied with the predictors. If you don't know how the predictors varied obviously you can not run such tests.

I am not aware of any statistics you can run where you don't know if the predictor took on a specific level. But they may well exist, I do not do medical research.
Now I think I understand you. Pretty much all measurements are simply numerical. With all variables I know whether higher is "better" or "worse". I just wanted to say, there is no categorical value to divide people in groups. In the linear regression analysis I just wanted to know, for example, if people who had improved insulin sensitivity after 8 weeks also had better vascular function (and those who deteriorated after 8 weeks also had a worsened vascular function). That was the goal of this analysis. This means I compared two numerical/continuous/scale values. So I suppose I do know how the values "are varied", right?

Originally Posted by Karabiner
It is appropriate for small sample sizes such as yours. But, as was already mentioned, statistical power to detect effects will be very low.
Luckily and apparently, the correlations are very strong, because I found numerous significant results!

Originally Posted by Karabiner
Yes. And you could do some scatterplots.
Yeah, I think I will add some graphs to my results section as well!

Originally Posted by Karabiner
Multiple regression is inappropriate regarding your extremely small sample size.
That's what I thought. After using one independent variable I got a P of 0.053. After adding another variable, for which I should actually correct, I got P's of around 0.01 and 0.03 for those 2 variables!
Does this mean these P-values are probably overestimation (and should be higher) because of the small sample size?
Could i put these results in the thesis (because they are clearly significant now), with a sidenote that multiple regression analysis is not ideal in such a small sample? Or are 2 variables so "not done" with N=8 that I shouldn't even dare to put it in?

Originally Posted by Karabiner
With kind regards
K.
Thanks

15. ## Re: Master thesis, statistics check!

In general, there's the problem of overfitting. Statistical models tend to become too perfectly fitted if there are many predictors and only a few observations. That means, it is doubtful whether results can be generalized to new data. On the other hand, you seem to have extremely strong associations (which makes me wonder why this wasn't known beforehand; by the way, you could have mentioned the size oif the correlation coefficients and of the regression coefficients (and the Adjusted R² of your mutltiple regression model). Admittedly, I am not sure whether overfitting is still a serious issue if associations are such strong.

With kind regards

Karabiner

16. ## The Following User Says Thank You to Karabiner For This Useful Post:

DutchMedicalStudent (06-02-2017)

 Tweet