transform y to log(y)


New Member

I'm wondering what kind of data is appropriate for transforming the response y into log(y)?

( So the model becomes log(y)= a^x1 + b^x2....+residuals
a and b are coefficients. )



TS Contributor
There are three reasons that people do this.

1) Log( Y ) is more normalish than Y. Helps speed convergence of means to normal distribution.

2) There is an interaction between two covariates when using Y. It may be removed by using Log( Y ) in some cases.

3) They just happen to prefer Log( Y )!


Ambassador to the humans
I don't know any good sites. But I'm not a fan of transforming data without good reason in the first place (nonconstant variance is NOT a good reason in my mind unless the ONLY thing you care about is the structure of the expected value with respect to a covariate).
It is just that most of what i am being taught defaults to transforming data when there is not constant variance or data does not look completely normal. What, if any approaches wouold you prefer?


Ambassador to the humans
I prefer approaches that take nonconstant variance into consideration. If there's a nonconstant variance that should be modeled as well. If we care about the expected value of the process then why wouldn't we care about the variance as well?

If we're truly trying to figure something out and conceptualize the problem statistically then I don't understand the idea of transform the data to get rid of your problems because those 'problems' are part of the data and tell you something interesting.
Good argument i suppose :)

By a method with non constant variance, do you mean generalised least squares fitting ? or could you direct me to a webpage if you dont have the time :)


Super Moderator

I'm wondering what kind of data is appropriate for transforming the response y into log(y)?

There may be occasions when it is useful to use the Log(Y) rather than Y ---besides obviating violations of assumptions e.g. heteroscedasticity.

More specifically, if we consider a simple linear regression model : Log(Y) = b0 + b1*X, the slope coefficient (b1) measures the constant proportional or relative change in Y for a given absolute change in X. As such, multiplying the relative change in Y by 100 will give the percentage change in Y for an absolute change in X.

This particular model is useful in situations where the X variable is a time trend since in that case the model describes the constant relative (b1) or constant percentage (b1*100) rate of growth (b1>0) or decay (b1<0) in the variable Y, where Y may be a variable such as gross domestic product (GDP), population, money supply, unemployment, profit, sales, etc. In short, the model Log(Y) = b0 + b1*X can be called a growth model.


Ambassador to the humans
Dragan makes good points. The argument I was making was that we shouldn't just transform data because 'Walla it's normal with constant variance if we do'. That in my eye is not a good reason to transform. If the transformation makes sense in either a scientific sense or if from a modeling standpoint the transformation makes sense then I'm alright with transforming data. I'm just against the trend that I see of people transforming data because it makes their life easier... It might make it seem like it makes your life easier but a lot of people don't realize the impact transforming the data has on interpretation and subtle modeling 'gotchas'. For instance, if we look at a simple log transformation.

\(\log(y_i) = \beta_0 + \beta_1x_i + \epsilon_i\) where \(\epsilon_i \sim N(0,\sigma^2)\) which implies

\( \widehat{log(y_i)} = E[\log(y_i)] = \beta_0 + \beta_1x_i \).

Alright, we're ok up to here. So if all we ever wanted to do was talk about the log transformed data then I'd be fine with this. But people don't always want to talk about log transformed data. They collected their data on the scale they did most likely because it's the scale that makes sense for them. So what if we want to consider our data on the original scale? If we're interested in \(\hat{y_i} = E[y_i]\) most people would just backtransform their predictions from the log model and say

\(\hat{y_i} = E[y_i] = \exp(\beta_0 + \beta_1x_i) \)

but this is wrong! What is actually true is:

\(\hat{y_i} = E[y_i] = \exp(\beta_0 + \beta_1x_i + \sigma^2/2) \).

How many people that just transform their data to make it nice do you think know this? What happens if we use a different transformation and want to backtransform? Can we get a nice form for the expected value then? Who knows...

Sorry for the rant but as you can probably tell I'm not a huge fan of transformations.
Hi Dason,

I hope your still watching this old post :=)

I came across an econometric text book which disagrees with you(i think It is wrong).

It essentielly says:




Not really any point to this post, just a thought as i was reading..


Ambassador to the humans
No that doesn't disagree with me. That's a perfectly true statement. I'm assuming your u is the random error term assumed to have a normal distribution? My post was about the expected value and how interpretation isn't as straightforward as most people assume it is.

You have


and this is true but notice what happens is that most people want to say from this that E[wage] = exp(beta_0 + beta_1*education). This isn't true.

If instead of wage=exp(beta_0+beta_1*education+u) we had wage=exp(beta_0+beta_1*education) + u
then it would be ok to make that jump. But what we have is
So E[wage] = E[exp(beta_0+beta_1*education+u)] = exp(beta_0+beta_1*education)*E[exp(u)]

And notice that the expected value of exp(u) is not 1.

It's very easy to get confused and mixed up when dealing with expected values and transformations.