Combining Two Samples (one w/ intervention and one w/ out intervention)

hlsmith

Omega Contributor
#1
I have a local dataset with a continuous predictor and binary outcome. I will use logistic regression to predict the binary outcome with the continuous predictor.


I also have a national sample of comparable individuals, along with the same continuous predictor and binary outcome. I will duplicate the above approach. Though all individuals in the second dataset also received an intervention after the collection of the continuous variable and before the outcome.


I am debating about whether it is feasible to also merge these two datasets together and predictor the binary outcome with the continuous variable while controlling for and testing the effect of the intervention status?


I wonder if I can do this or what I need to think about. I would assume (and also test) that the only difference between the datasets is that the national group got an intervention and background characteristics are comparable between datasets. Then examine the effect of the intervention on the outcome.


I am feeling that I may need to run a multi-level model. Then look to see if dataset of origin was a significant predictor (AKA intervention status). Though, I will only have two datasets of origin (local and national), so two groups in level two of the model.


So taken home question, can I merge these sets together and test the predictive value of the continuous variable as well as effect for which dataset data came from using a multi-level model?


Any thoughts?
 

CowboyBear

Super Moderator
#2
You could merge the two. One thing to think about is that, since everyone in dataset2 got the intervention, the two variables "intervention vs no intervention" and "sample1 vs sample2" are perfectly correlated. Thus making it very hard to work out what is the effect of the intervention vs. what pre-existing differences there might be across the two samples. As a study of the effect of the intervention, what you have is in effect a static group comparison (a pre-experimental design), so your inferences will be very tentative. On the other hand, this doesn't mean you can't get a good assessment of the relationship between the continuous predictor and the binary outcome.

I would just want to think about what other differences there might be between the two datasets (in demographics, measurement procedures, etc).

I can't quite envisage how a multilevel model would apply in this instance, but feel free to talk more about your ideas on that.
 

Jake

Cookie Scientist
#3
I agree with CB.
I would assume (and also test) that the only difference between the datasets is that the national group got an intervention and background characteristics are comparable between datasets.
That's a big assumption, and how exactly do you plan on testing it? It looks to me as if you have no means of testing it.
 

hlsmith

Omega Contributor
#4
Thanks for feedback.

I neglected to out and out say it, but I have other variables, which could be used to examine background charecteristics.

The reason I was thinking multilevel model was the use of them in meta -analyses to control for heterogeneity and to see if cluster (dataset) level data can account for additional model covariance.

I get that assumptions need to be made that are not testable, say related to "s-admissibility" since unknown/not collected confounders could exist.

I don't get why a multilevel model wouldn't work. Say you are looking at IQ in predicting a binary variable and you want to control for which school they went to. Are you saying that just including a varaiable school A or B would be better than a multilevel model, why? I get this is not an ideal set up overall.
 

Jake

Cookie Scientist
#5
Are you saying that just including a varaiable school A or B would be better than a multilevel model, why? I get this is not an ideal set up overall.
Yes. You can't really fit a multilevel model with only two clusters. In a multilevel model you estimate the between-cluster variance, but getting a variance estimate based on only 2 data points is difficult to say the least. You can try to fit the model but you will probably run into convergence errors. It's generally recommended that you have something like 10 or more clusters.
 

hlsmith

Omega Contributor
#6
Yes that was one of my initial concerns. I need to read a book on ML. What do you think of gelmans book or do you have a recommendation for a novice? I think besides convergence, that the SE contribution would be very large.