Data analysis with a GEE, is my method valid?

#1
Hi everyone, I’ve been doing some statistical analyses in R on some data. It’s for use in a manuscript I’m hoping to get published in a biological journal. Unfortunately, the tests I ended up having to run are kind of at the limit of my understanding, so I was hoping that I could run through what I did with my data, and could get opinions on whether this was a valid approach or not. So let’s start:

I needed to test the effect of a combination of categorical and continuous independent variables on a continuous dependent variable, with a random factor included to control for pseudo replication.

My untransformed data for the responding variable looks like this:



Now, originally, I was going to run a GLMM (generalized linear mixed-model), but no matter what families, links, and transformations I tried, I couldn’t meet the assumption of homoscedasticity (variance always increased)… this was the closest I got:


It was suggested to me that I might have more luck with a GEE (generalized estimation equation) instead, so I went about doing that. First, I fit the data to a GLM (generalized linear model) by transforming the responding variable as such: I took the natural log of the responding variable, added a constant to bring the lowest value up to 1, and used a natural log transformation on the data once again (i.e. an ln-ln transformation). I used a gaussian family with an identity link (the best combination was actually a quasipoisson family with a log link, but there is no quasipoisson family usable for GEE tests in R from what I understand...so I stayed with the gaussian/identity). This rendered the data homoscedastic and normal, as shown by the GLM plot output:


After I fit the data to the GLM, I tested the ln-ln transformed data in a GEE, using, of course, the same family and link function as in the GLM. I used an exchangeable structure as it provided the lowest QIC, QICu and CIC, and the highest quasi likelihood.

Now, other than wondering if this process seems valid, I have two questions:

1. Someone I’ve been talking to said that although they thought my approach was statistically meaningful, it wasn’t scientifically meaningful. They said that transforming my responding variable makes interpretation difficult and should be avoided. They also said it would be much more meaningful to fit an untransformed quasipoisson type of model that violates assumptions than to do the transformations I did to meet assumptions. Is this true? Would a journal find that acceptable?

2. Since my responding variable is continuous, I can’t use the poisson family despite it being a good fit for my data…. And there doesn’t seem to be any way to use a quasipoisson family in the GEE, though I can use link=“log”… is it even possible to fit it to a quasipoisson type of model?

3. If I use the model that I fitted here.... when I report descriptive statistics in my manuscript, would it be more appropriate to discuss untransformed means and medians, or to back-transform? I am leaning towards untransformed since back-transforming from ln or log values does some weird things.

Any input would be extremely valuable!
Thanks!
 
Last edited:

Karabiner

TS Contributor
#2
Heteroscedascity doesn't affect your point estimates, but the
standard errors can be wrong. Maybe you could use robust
standard error estimates (White) within your generalized
linear model.

With kind regards

K.
 
#3
Sorry, I'm not really following. Why would I try and use it for the GLM? Do you mean I should try it in the GLMM?

As for the GEE (using geepack), I have already used sandwich estimation in my model as cluster size n=40.
If I use the package "gee" in R, I get the estimate, as well as naive and robust S.E. and z. And I believe the robust values are using the sandwich estimation as well, so there's that.

I'm not sure if you're suggesting that I use Huber-White sandwich estimators on my untransformed or transformed data, but I've read that using Huber-White sandwich estimators should really only be used if there is a small violation of homoscedasticity, so I'm not sure I should be using it on my untransformed data since it really violates that assumption.
 
Last edited:

CB

Super Moderator
#4
What is your response variable? What are the predictors? Maybe knowing a bit about the substantive context might help?
 
#5
My response variable is the length of bout of scratching by an animal, and my predictors are factors such as age and sex, as well as environmental variables (e.g. are they in a high or low risk area for predation?) and social variables (e.g. group size).
 

CB

Super Moderator
#6
I haven't ever had to model a response variable that was a duration of time before, but I imagine that this is something that comes up for a lot of researchers. (Response times, reaction times, etc etc). Perhaps see if you can find what probability distributions are generally used when the response variable is a duration of time? At a rough guess, the gamma distribution could be useful here, and should be reasonably easy to implement.

Quasi-poisson is still used only for integer-valued variables, as far as I know.

I would agree with the advice that multiple transformations will make your results less interpretable. What would the coefficients produced actually mean to you, given that you've log-transformed the response variable twice?
 
#7
I will look into whether there is a family more suitable for time-based variables, thanks. However, I don't believe finding the most appropriate family will help with meeting assumptions as I've pretty much gone through all the combinations I can already.

I don't have a solid reference that says using quasi-poisson for non-integer data is okay, but from searching around online it seems as though it's not a problem.

That is true about the coefficient estimates, they wouldn't mean a whole lot with transformed data. However, while it would be interesting and informative to have meaningful estimates, I don't feel as though it's integral to the study... I'm mostly concerned about the median and mean values as well as whether significant variables have positive or negative estimates.
 

CB

Super Moderator
#8
I don't believe finding the most appropriate family will help with meeting assumptions as I've pretty much gone through all the combinations I can already.
What do you mean exactly by "gone through"? Are you trying all the families and seeing which one gives you homoscedasticity? Only the Gaussian distribution assumes homoscedasticity. The other families assume different relationships between the variance and the conditional mean. E.g. Poisson assumes conditional variance = conditional mean. Quasipoisson assumes conditional variance = conditional mean * an estimated constant. Etc. So if you are checking to see how well different models fit... make sure you're checking the right assumptions!

while it would be interesting and informative to have meaningful estimates, I don't feel as though it's integral to the study... I'm mostly concerned about the median and mean values as well as whether significant variables have positive or negative estimates.
So you don't care about the sizes of the relationships you're estimating?
 
#9
What do you mean exactly by "gone through"? Are you trying all the families and seeing which one gives you homoscedasticity? Only the Gaussian distribution assumes homoscedasticity. The other families assume different relationships between the variance and the conditional mean. E.g. Poisson assumes conditional variance = conditional mean. Quasipoisson assumes conditional variance = conditional mean * an estimated constant. Etc. So if you are checking to see how well different models fit... make sure you're checking the right assumptions!
By "gone through", I meant having run the tests with various combinations of families, links, and data transformations. Thanks for bringing this up though, this is something I think I might be getting confused about. I was under the impression that regardless of the family and link used in the model, the residuals from the model output should still be normal and homoscedastic.. If that's not the case, how do I assess whether the assumptions of variance for say a gamma or quasipoisson distribution have been met?

So you don't care about the sizes of the relationships you're estimating?
I'm hesitant to say I don't care... it's more that I don't think it's important to my study..but I'm really not sure. Sorry for my confusion about all of this...as I said, this stuff is really reaching the limit of my understanding of statistics.
 

Jake

Cookie Scientist
#10
I was under the impression that regardless of the family and link used in the model, the residuals from the model output should still be normal and homoscedastic
I haven't read the whole thread, but I can at least point out about this point that we usually only make the constant variance assumption when we are using the Normal model. Most (all?) of the other families explicitly assume that the variance is a function of the mean. For example, in the Poisson model the variance is exactly equal to the mean (and thus it increases with the mean).

As for "the residuals from the model output should still be normal"....well, no. That is only true in...you guessed it....the Normal model. Again, if we are using say the Poisson model, the residuals will not be Normal, they will be (shifted) Poisson.
 
#11
I haven't read the whole thread, but I can at least point out about this point that we usually only make the constant variance assumption when we are using the Normal model. Most (all?) of the other families explicitly assume that the variance is a function of the mean. For example, in the Poisson model the variance is exactly equal to the mean (and thus it increases with the mean).

As for "the residuals from the model output should still be normal"....well, no. That is only true in...you guessed it....the Normal model. Again, if we are using say the Poisson model, the residuals will not be Normal, they will be (shifted) Poisson.
Thanks for your reply!

Well, I don't feel very smart right now... For a GEE or a GLMM, how then should I test assumptions for a particular family/link combination? My only guess for testing the distribution of residuals would be to create something like a qqplot, with the qqline being set to the given non-gaussian distribution or something. I wouldn't know how to test if the variance assumption is correct though. Any ideas? I'm using R by the way.
 
#12
It seem to me that the Reese in the first post had a distribution that was quite skewed.

One way of doing it is to try to transform the model to approximative normality and constant variance (and linearity). An other is to formulate a generalized linear model with a distribution (possibly different from a normal distribution) and a link function (possibly a non-linear one) and a linear predictor of explanatory variables.

Reese did log(log(y)+constant). Maybe it did not work with lognormal or gamma distribution. But maybe an inverse Gaussian distribution works - a very skewed distribution.

The normal distribution has often been used to approximate a discrete distribution. That is 'allowed'! I have heard someone say that the great statistician R.A. Fisher once used the Poisson distribution to approximate a continuous distribution. I would say that "that is allowed"! "All models are wrong, but some models are useful" as Box said.
 
#13
Well, I've tried to redo things, starting over with using untransformed data in GLMMs. I seem to have the best luck of creating models without errors using glmmPQL, so I have been using that. I'm going to post the tests I've been going through with the pertinent information in hopes maybe someone can point me in the right direction to go from here... thank you so much to everyone answering my questions..you are life savers!

Here are the test results when I use a gaussian family and identity link:

VPGLMM2<-glmmPQL(VPL~Age+Sex+DET+DOC+GS+NND+Month,random=~1|FocalID,family=gaussian(link="identity"),data=VPtest)


Sigma: 4.855015

**Note: In all plots below, residuals are being tested against NORMAL distributions, not against distribution of the family specified, as I have no idea how to do that**

If I attempt a gamma family with log link:
VPGLMM3<-glmmPQL(VPL~Age+Sex+DET+DOC+GS+NND+Month,random=~1|FocalID,family=Gamma(link="log"),data=VPtest)

sigma:1.007639

If I attempt a quasipoisson family with log link:
VPGLMM4<-glmmPQL(VPL~Age+Sex+DET+DOC+GS+NND+Month,random=~1|FocalID,family=quasipoisson(link="log"),data=VPtest)

sigma:1.924758

If I attempt an inverse gaussian family with log link:
VPGLMM1<-glmmPQL(VPL~Age+Sex+DET+DOC+GS+NND+Month,random=~1|FocalID,family=inverse.gaussian(link="log"),data=VPtest)

sigma:0.4711212
**Results for the inverse gaussian are slightly troubling as the p-value for one independent variable becomes 0.0000


I just have absolutely no idea what to do anymore... None of the tests result in the random effect appearing normally distributed as is required by a GLMM. I know the package hglm allows usage of different families and links for the random effect, but I'm not sure if I should use that package or just head over to doing GEEs again...
 

Dason

Ambassador to the humans
#14
Bump. The last post was marked for moderation so it wasn't being displayed to non-mods. I went ahead and approved the post.