Testing linearity assumption in a multiple regression model

Dusan11

New Member
Hello,

I would like to know how can one test the linearity assumption in a multiple regression model?

I know that for a simple regression, this assumption is tested visually by examining the scatter plot between X = predictor Y = outcome, with the dots forming an ellipse and being aligned on the regression line.

However, I am not sure how to do this in a multiple regression model. Should the residuals vs. fitted curve have a specific shape?

Thanks
BW

Dusan

noetsi

No cake for spunky
You look to see if there is a non-linear pattern in the residuals. Residuals should not have a specific shape.

hlsmith

Less is more. Stay pure. Stay poor.
Look for stochastic spread no funnels unless you have crazy big data. What is your sample size?

Last edited:

Dusan11

New Member
Thanks both!
I was thinking of a model with 5-6 covariates, with a sample size of >300.
I know that for homoscedasticity assumption, no pattern should be observed for residuals (y) vs. fitted (x), with no funnel, but wasn't sure for linearity in a multiple regression setting.

Best

D.

noetsi

No cake for spunky
I like to fit a general additive model to look for nonlinear relations.
How do you actually test this way? Not analyze the results?

I never understood how you analyze some parameters with splines or loess (which are entirely graphical in nature) and other with a specific parameter (the ones that are non-linear). They are apples and oranges - parametric and non-parametric in one model.

Buckeye

Active Member
To go along with noetsi, does a GAM make sense for models with all categorical predictors? Just curious. I'm starting to dig into this as the parametric models have been troublesome recently. It could be that I'm missing important predictors.

hlsmith

Less is more. Stay pure. Stay poor.
@noetsi I just throw a spline term in the model to see what the relationship looks like. And I take that into consideration along with my context knowledge. Below is a supplemental figure from a recent paper I wrote. I like to look at the df which serves as a pseudo value for how many line segments are in the figure. Here there is ~2, which is supported by the content knowledge - since there are actually two underlying data generating functions. In particular, once someone is old enough their intentions/actions change.

hlsmith

Less is more. Stay pure. Stay poor.
@Buckeye - per my limited knowledge, if the variable was binary, the relationship would be linear since you just connect the point estimates. But if you had ordinal categories, you could put them in a model to look at trend or shape. I would imagine assumptions would be needed related to a change from 1 to 2 is the same as 3 to 4, etc. I usually use the splines to visualize suspicions and direct my following action of how I treat the term. You can always put these ordinal categories in the model as a continuous variable if the spline seems reasonably linear or the spline may say hey don't data not treat these as categories, since estimates bounce around.

@noetsi - below is another image from the same paper, where I visualize an interaction using quantile regression, which reveals some potential non-linearity in estimates across values:

Someone just needs to remember that be transparent to their audience how much of this was suspected a priori or if it is exploratory. With enough data you can do it as exploratory and then apply it to a heldout data split.

Images were supplemental file to this paper: https://pubmed.ncbi.nlm.nih.gov/34498075/

noetsi

No cake for spunky
Thank you for reminding me that dummy variables are always linear pretty much by definition. Most of my predictors are dummy variables. The problem I have is that there are 40 plus variables in my model that are statistical controls, before I get to look at things I am interested in. And there is exactly zero theory in my field on relationships or structural form. So I can not test theory or easily prespecify a model.

hlsmith

Less is more. Stay pure. Stay poor.
@Buckeye he mentions 'controlling' for categorical variables in this, but not using them. I have another resource I will post when I come across it again!