Linear Regression with Negative Response Variable

Hi everyone,

I was reviewing a regression analysis for a colleague the other day. I know that linear regression works fine with negative values for the Y variable. But then I noticed that 20% of the values for the response = 0 exactly. Because of this, I doubt that the model is correctly specified amongst other things. I've never worked with what I would call 'non-positive and zero-inflated continuous data'. Can anyone suggest an alternative? I thought about just leaving out the observations with 0 and analyzing them on their own (in some other fashion). Any help is appreciated.

On second thought, here's an additional note:
The response variable represents amount of money added or subtracted from another value.
But, this data includes all negative values and 20% zeros. So, we are essentially estimating the amount of money we are subtracting away.
But, since it's a monetary value I can just flip the sign and interpret the number accordingly.

Thus, I can say I have zero-inflated non-negative continuous data.
Last edited:


Less is more. Stay pure. Stay poor.
Hey, can you provide a histogram of the distribution of the dependent variable. Also what is the sample size and what predictors may exist?

I simulated some data. It looks somewhat like a lognormal distribution if it were on the positive axis. About 20% of these values are 0. Flipping the sign and interpreting it differently is not an issue here. The problem can be summarized as follows: Someone fit a plain old linear regression to this data. Predictor variables aside, I know this isn't a good idea. I think there may be an explanation for all the 0's which could make modeling them useless. I need more information before I can comment further.


Active Member
it mihgt be productive to, instead of taking linear regressino of difference = ..., to the variable being subtracted to the rhs, ie
Y1 - Y2 = mx + b ---> Y1 = Y2 + mx + b, ANCOVA sort of style. or just scatter plto Y1 x Y2, would be revealing.


Less is more. Stay pure. Stay poor.
Look up change scores. Frank Harrel is a big fan of not using differences and modelling second value while controlling for first value.
it mihgt be productive to, instead of taking linear regressino of difference = ..., to the variable being subtracted to the rhs, ie
Y1 - Y2 = mx + b ---> Y1 = Y2 + mx + b, ANCOVA sort of style. or just scatter plto Y1 x Y2, would be revealing.
I like this idea. The initial value Y1 represents the value of a piece of property. After the property is inspected, we determine how much should be added or subtracted to the value based on it's condition. So, it's plausible to think the amount we subtract off is related to the initial value.


Less is more. Stay pure. Stay poor.
In general I like to use,

If a person weighs 400 pounds, losing 5 pounds may not even be noticeable compare to a 160 pound person. I also use long hair versus short hair and cutting off 2 inches.


Less is more. Stay pure. Stay poor.
P.S., I just learned that buckeye nuts are called buckeyes, since they look like the eyes of bucks. I guess I never thought too hard about that before.
New updates: I said earlier that 20% of my response variable is 0. It turns out that one of the independent variables almost completely determines the 0 (>99% of the time). i.e. when x=3, y=0. In this case, does it make sense to keep these observations in here if I can tell you the y with almost 100% certainty if the x is a specific value?


Less is more. Stay pure. Stay poor.
Can you better describe what these variables are and the data generating function between them? So you want the outcome to be dependent, but if they are literal proxies this 'may' partially drown out some other covariates.
Some of the information is sensitive so I'm trying to respect that aspect. But the variable of interest is the value of a vehicle after inspection. So, there is a starting value and an ending value of the vehicle. We want to be able to talk about how our inspection process impacts the value that we assign to the vehicle. The things I've tried thus far violate some of the linear modeling assumptions.

If I regress Y_post ~ Y_pre, then the R^2 is roughly 98%. lol. Now, consider Y_post ~ Y_pre + x. When x is a specific value I know the value of Y_post 100% of the time without modeling for it. What worries me most are the assumption violations.
Last edited:
Well, the qqplot for log(Y_post) ~ log(Y_pre) + x is left skewed. The residual vs fitted doesn't look terrible, but there are some large negative residuals that create their own pattern such that it's not an "even band around 0". I think the model form could be tweaked. Attached is a pretty accurate depiction of the qqplot. What is your opinion of Y_post ~ Y_pre explaining almost 98% of the variability in the response? I'm almost thinking I should model the difference Y_post - Y_pre with a glm of sorts.


Less is more. Stay pure. Stay poor.
Histogram isn't terrible, perhaps add Hubert white SEs to estimates.

So if the setting is price before and after an appraisal, it is what it is. So everyone gets an appraisal score that represents 'X'?

If smaller prices have a disproportionate change I think you need to account for that! Can you confirm this? Is this what is going on in the qqplot tail?
Yes, there is a 'score' we give after the assessment. And we want to know based on this score, what is the expected value for the property (all else =)? So, you're suggesting that maybe going from 1 to 2 in the score for an expensive property is not the same as going from 1 to 2 for a less expensive property? That's fair, I'll have to check that out. Suppose that's the case. Then, I might be able to add an interaction term.


Less is more. Stay pure. Stay poor.
How big is this dataset?

Also this score then serves as a near deterministic function to change the price?

What does the current output look like, for every percent increase in original price the percent increase in the outcome is about a percent, and then the score cleans up the slight difference? Have you described what the distribution of the score looks like, and if you had enough data you could input it into model as its pieces.
I have ~500,000 observations. Right now, I have the pre and post values log transformed and the score as an additional independent variable. The score is an average of several components. So, it's continuous and ranges between 0-5 roughly speaking. I found in my data that a score of 3 means the Y_post = Y_pre with very little variation.


Less is more. Stay pure. Stay poor.
With that much data you should create three random datasets. One for building the models second for scoring using the models, and third to fit the final model with best fit in second set.

I wonder what the models might look like if you fit post ~pre fitting 5 models. One for score < 1,...,> 5. You could also look at you current model for the lower tertile of pre scores, middle, the top.

Lots of fun stuff.