Linear Regression Alternatives if DV Zero Bounded

hlsmith

Less is more. Stay pure. Stay poor.
#5
Yeah, I knew these cold be options, but my initial naive reservation is whether the output from them are reasonably interpretable. Any insights before I delve into them?

Currently with the the logging, I can say for every blank days the outcome goes down blank percent and it looks like I just have a handful of unilateral wider points in the S / SW portion of the residual plots.
 
#6
If the DV is continuous and strictly positive... maybe Gamma Regression?
I would also have suggested gamma regression. (Good link by @spunky.) The inverse link does not seems to work. So many use the log-link. But you can decide your self so it just fits to the data.

The gamma distribution and the log-normaldistribution are similar, but one of them has a heavier tail (but I don't remember which one).

Or you can choose an other distribution for positive data - Weibull maybe?
gamlss has a lot of distributions to choose from.

If there are values "under the detection point", so that they are censored, Helsel has written about that.
 

hlsmith

Less is more. Stay pure. Stay poor.
#7
I don't know how to use the gamma reg. Any simple tutorials! @spunky's reference is a little too formula heavy, remember I am applied heavy.

I am guessing this can be done in GLM.
 

hlsmith

Less is more. Stay pure. Stay poor.
#8
The gamma isn't looking too difficult, it is just the interpretations when using say log link or say inverse gamma etc.
 

hlsmith

Less is more. Stay pure. Stay poor.
#9
Interesting stuff, any other suggestions (e.g., inverse gaussian). Below are stdize deviance residuals:

Gamma, log link
1613743834459.png


Normal, identity
1613743842015.png


Normal, identity, with logged data
1613743852449.png

The gamma looks like it wins. I may also look at fit statistics between models. Seems the gamma with log link is multiplicative in nature, so I believe I exp the coefficients and describe them as an estimated blank times change in mean.
 
Last edited:

noetsi

Fortran must die
#10
My question is does it really matter. All that seems to be impacted is normality and if you have enough cases the sense I get in the literature (I read this a lot) is normality is not really a big deal.
 

hlsmith

Less is more. Stay pure. Stay poor.
#11
I believe for me in this setting, where the DV is bounded by '0' (see first plot in first post), it matters since the residuals are heterogenous making the SEs off and if ones was to extrapolate outside the far right range things could be not as good as when the issue is addressed. For reference, my sample size is permanently fixed at 200.
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#12
Below is a fitted linear reg (top) and gamma (w/log link; bottom). I believe the data generating function for those outliers in the top middle of plots are based on an additional latent exposure. I may try to address them in a sensitivity analysis. Any feedback or thoughts in general about these fits. I think my preference is still for the latter model/plot, since it is likely asymptotic to the zero boundedness.
1614091645405.png

1614091665283.png
 
Last edited:

noetsi

Fortran must die
#13
I believe for me in this setting, where the DV is bounded by '0' (see first plot in first post), it matters since the residuals are heterogenous making the SEs off and if ones was to extrapolate outside the far right range things could be not as good as when the issue is addressed. For reference, my sample size is permanently fixed at 200.
You are much more of an expert than I, but can't you simply use White SE if you are concerned about hetero? For normality 200 cases should be plenty I would think (although never have I found anyone come down solidly on this issue - the CLM kicks in about 30).
 
#14
Maybe I'm wrong, but when we use a generalized linear model doesn't it open the door to non-normal models? Thus, a normal probability plot for gamma distributed data doesn't make sense? Excuse my ignorance.
 

hlsmith

Less is more. Stay pure. Stay poor.
#15
Yes it does open that door, but if you are looking for the best fit line regardless of linearity, shouldnt the residual be balanced on both sides of the line and normally distributed?
 

noetsi

Fortran must die
#16
I don't think the residuals impact the slope estimates at all. Just the p values. The question is if that matters if for example normality is not an issue since the regression is asymptotically correct and the CLT. And if you use White SE which addressed hetero.

Of course both of those comments are assumption and not everyone would agree that White's is enough (I think with enough cases the consensus today would be normality is not a concern).