Demystifying the Log Transform

#1
It's common to log transform a dependent or independent variable to help with skewness. Log transformations can be helpful when the goal is to predict a continuous, but skewed response variable. My question concerns how to communicate this to a less stat-savvy person? I want to show a boxplot of highly right-shewed, but strictly positive variable. This variable is a time to event measured in days with median of 1 day and maximum of 500 to give some context. The goal is show the distribution with no intentions of inference or any other statistical methodology. I'm wondering if I should log transform the variable, but show the axis label in the original units. Or keep the variable untransformed, but on a log axis? Or should I log the variable and the axis? What is common practice if visualization/interpretation is the goal?
 
Last edited:
#2
I'm wondering if I should log transform the variable, but show the axis label in the original units. Or keep the variable untransformed, but on a log axis?
I like log-axis, labels would appear in original units. I think using a log-axis is really equivalent to log-transforming the variable, so if you do one you did the other, like it or not.
 
#4
On the other hand, your effects will probably appear larger on an untransformed plot, especially if you don't put error bars, so that should be considered.
 

hlsmith

Less is more. Stay pure. Stay poor.
#5
Is there an actual person you are trying to convey this to? In words, perhaps a way to convey the concept is if you plot the cumulative COVID cases it looks like a ramp (exponential growth). However if you log transform this its becomes a straight near 45 degree line. So log and anti-log let you convert such things. Probably not a great example, but visually it kind of does the trick.
 
#6
Is there an actual person you are trying to convey this to? In words, perhaps a way to convey the concept is if you plot the cumulative COVID cases it looks like a ramp (exponential growth). However if you log transform this its becomes a straight near 45 degree line. So log and anti-log let you convert such things. Probably not a great example, but visually it kind of does the trick.
The purpose is strictly for visualization purposes. Regular scale, untransformed variable looks "scrunched down" at the bottom of the plot. Making the visual essentially useless. I basically want to zoom in on the middle 50 percent without misrepresenting the numbers. I don't believe I necessarily need to transform the raw variable since I'm trying to conserve the original interpretation. I would transform if I was trying to satisfy distributional assumptions in a model, though.

p+scale_y_log10() or
p+coord_trans(y="log10") in R ggplot
 
Last edited:

noetsi

Fortran must die
#8
In time series you log data if the variance changes over time (so the data is not stationary) because that invalidates or distorts common time series models like ARIMA.

Probably not worth explaining because if they don't know skew they won't want to deal with stationarity :)
 
#11
that is strange data...
Yes, I simulated it from a few uniform distributions and blended them together. lol. The point is, my data is highly skewed with many outliers. So, visualizing the middle 50% on a regular scale is difficult. The first and third quartiles are 0.25 days and 3.525 days respectively. These enormous outliers skew the heck out of the data. So, part of the story is helping people realize the outliers so they can investigate them. But, I would like some visual instead of just giving percentiles.
 
#13
The above plots resemble the actual data more closely. Took a couple iterations to find plots that worked best. invg(mean=3, shape=0.5)
 
Last edited:

noetsi

Fortran must die
#14
Yes, I simulated it from a few uniform distributions and blended them together. lol. The point is, my data is highly skewed with many outliers. So, visualizing the middle 50% on a regular scale is difficult. The first and third quartiles are 0.25 days and 3.525 days respectively. These enormous outliers skew the heck out of the data. So, part of the story is helping people realize the outliers so they can investigate them. But, I would like some visual instead of just giving percentiles.
lol

well another thing to consider is that logging is only one solution. Box Cox transformations other than logging might be better. It might be useful to point out that outliers are built, in many measures, on the assumption the data is normal and when it is not you may have many outliers. If you use the correct distribution the "outliers" go away. I had massive number of supposed outliers in one project. When I realized my distribution was exponential and worked from there my outliers went away. Robust regression is another way to address this issue if you have an interval DV.
 
#15
lol

well another thing to consider is that logging is only one solution. Box Cox transformations other than logging might be better. It might be useful to point out that outliers are built, in many measures, on the assumption the data is normal and when it is not you may have many outliers. If you use the correct distribution the "outliers" go away. I had massive number of supposed outliers in one project. When I realized my distribution was exponential and worked from there my outliers went away. Robust regression is another way to address this issue if you have an interval DV.
Thanks for the response. I suppose I can give a little information about the data. This is a simulated, "fake" distribution of the time to close an insurance claim (I did the best I could to find plots that resemble it more closely). We have large outliers because of liability, attorney, injury details etc. that might prolong a claim investigation for more than a year in some cases. The outliers in these plots were flagged based on: 1.5(IQR)+Q3
 

noetsi

Fortran must die
#16
Well if your data is not normal IQR is better than measures that rely on normality to define outliers. But it is not perfect skew will inflate the number of outliers I believe (I worked with this a long time ago in building outlier checks for data). R has a package to estimate outliers given skewed data although I have never worked with it.

Using a median is better than a mean with skew but they estimators that are even better (but again its been a long time since I worked with this).
 

katxt

Active Member
#17
Can you post any of these visual gems for us to cheer and jeer?
An actual teaching example - http://www.dinodatabase.com/dinorcds.asp gives the length m and weight kg of a range of dinosaurs. The left graph is the raw data off the net. All looks fine except perhaps the extreme right point. After the log transforms, the right point turns out to be fine, but there is now one major outlier. It turns out the the original internet data had a decimal point in front of a weight of 113 kg to make it .113 kg.
The log transformed graph made everything clearer, and showed a power relationship as well. Logs.jpg
 
#20
This is why I like to query my own data rather than rely on others to provide it.
Ya. In my case, the people entering the data into the system do not always update certain fields or enter them accurately. We just have to be aware of the business process when querying data.