Running three linear regression models?

#1
I'm currently trying to figure out if there's anything wrong with how I am choosing to analyze my data. Specifically, I have three groups, one dependent variable (y), and one independent variable (x). I am trying to predict the same y from the same x for each group. I suspect that one group shows a relationship while others do not. In order to do this, I've filtered a large data set, creating three smaller data sets for each group. I then ran a linear regression on each of these groups using the same DV and IV. My goal is to determine if x predicts y, but only for one group. Is there anything wrong with running an analysis like this?
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Yeah, just use all the data. The model will kick out two estimates for the group, the its reference group will be represented by the intercept.

You are creating pointless addition work, plus you will need to go through the process of test model assumptions three times and have to explain and justify the process to everyone else.

Welcome to the forum!
 
#3
Thank you for the reply! How would I go about testing all the data at once? How would I go about ensuring in one analysis that I am able to tease apart the effect of x on y for each group? Would that involve dummy/effect coding or multilevel modelling? Or could this be an ANOVA? For me, I tried MLM and was having a really tough time with it. When ever I ran model with random slopes (which is my hypothesis), the model said no convergence or it would say singularity (I'm working with Rstudio). I'm not very familiar with MLM and worry I may be doing something wrong with it. Not to mention the groups are very different in size. One is in the hundreds and one is in the thousands, and I'm wondering if that will affect the analysis.
 
Last edited:

katxt

Active Member
#4
If you aren't confident with linear models, I see no reason why you shouldn't do three regressions as you proposed. The advantage of the linear model is increased sample size and power, but it also comes with the problem of interpreting and explaining interactions, and showing that the residuals have the same distribution for each group. Don't worry about the sample sizes. You have plenty of data.
The hard core statisticians probably won't agree but sometimes it is worth sacrificing the extra power of a single analysis for something which is easy to do, easy to interpret and easy to explain, and which is still valid.
 
#5
You can create a categorical group variable (i.e. as.factor() in R). The baseline group will be absorbed into the intercept term. You can interpret the beta coefficients and so forth. Regression and ANOVA are equivalent; it's only the model form that is changed. I would advise against doing three analysis'. If we try to glean anything from p-values/hypothesis tests we may run into a multiple comparisons problem. Additionally, it's less appropriate and more difficult to compare across groups if there are 3 separate analysis.

assuming linear model:
model<- lm(y~x, data=data)
 
Last edited:
#6
Thanks for your help! I followed up with this into modelling interactions in linear regression and that works. I effect coded my categorical variable, and ran a linear regression with an interaction between this grouping variable and my DV. e.g., y ~ x*group1 + x*group2 + x*group3. That way I can compare against the overall grand mean.
 

hlsmith

Less is more. Stay pure. Stay poor.
#8
Do you have suspicion for an interaction? Why aren't you including the base terms in the model? What is this model for; school, work. If for actual executable decisions, I would do a lot more learning about best practices in analytics before making any conclusions or sharing results.
 

hlsmith

Less is more. Stay pure. Stay poor.
#10
Perhaps that is why they mentioned MLM.

So you are trying to get the random intercepts and slopes of three groups for another covariate related to the outcome?
 

katxt

Active Member
#11
This is how I read the situation. RedNightSkies RNS has three groups - perhaps C a control non treatment group, P a placebo group and T a treatment group. There is a variable x which may or may not influence a response y. RNS has a shrewd idea that in group T the effect of x will be seen and the graph of y vs x will show a rise (or fall) while the graphs of C and P will be more of less flat. RNS draws the graphs and it looks plausible. No doubt these three graphs will appear in the final paper. All RNS needs now are some p values to confirm it all.
A simple (some might say simplistic) approach is to do the three regressions and see if graph T has a significant slope and C and P do not. A careful researcher may check that the residuals in each graph are normal and even (but not necessarily the same variance), and a cautious researcher will likely adjust the critical significance cutoff to allow for multiple p's.
A more hard core analyst might put all the data into one linear model with Group, x and the interaction Group*x or some similar variation. The idea is that hopefully the interaction will be significant indicating that at least one slope is different from the others. (The advantage of this combined LM is that error df is higher, meaning that the critical F values are slightly smaller and so the power is increased, but only extremely slightly with samples of this size.) A careful analyst would check the residuals and ensure that they were normal within each group and additionally they had equal variance across the groups. This is more stringent and more work than the simple approach.
Fortunately, the interaction turns out to be significant. However, this shows only that there are differences between the slopes, not that T is significant and C and P are not. This can no doubt be shown by considering the size and SE of the various estimates but it is hard work and not obvious. In any event, this will involve three comparisons and so a cautious analyst will adjust the critical significance cutoff to allow for multiple p's.
In short, I would suggest that the best approach is the three regressions. As I said before, it is easy to do, easy to interpret and easy to explain.
 

hlsmith

Less is more. Stay pure. Stay poor.
#12
What if C has a mild positive slope, and T has a mild negative slop, neither significant in the individual regressions. However, if plotted in the same model they are disordinal and potentially 'significant'? Would the individual regression approach miss this?

I like how you created a back story!