Unusual distribution of the continuous outcome

Hello dear forum members!

I am seeking your guidance with the following issue. So, I am estimating a "classic" difference-in-difference model (using pooled OLS estimator): Y = time + treated + time*treated, where Y is the average Medicare payment (amp) in $.

Here is the descriptive statistics for amp: Mean = 37.2, SD = 51.4, min = .2, max = 929.7, Median = 2.99. Additionally, attached below is the distribution of amp visualized via histogram:

Screen Shot 2018-02-04 at 8.55.57 PM.png

As one of the first approaches to address such an "abnormal" distribution, I winzorize (i.e., the extreme values are replaced by 1st and 99th percentiles) the outcome. The descriptive statistics for amp_w: Mean = 36.8, SD = 49.6, min = 2.8, max = 154, Median = 2.99. Below is another histogram for amp_w:
Screen Shot 2018-02-04 at 8.57.21 PM.png

As you can see, (1) there is a substantial number of values around $3 dollars; (2) there are a couple groups between $10 and $100; (3) another group between $100 and $150; and (4) some extreme values between $200 and $900 (in case of amp).

I am somewhat concerned about running pooled OLS with such outcome. What do you think? Is there any particular estimator that you would recommend for such outcome? Any help advise will be greatly appreciated :)


Less is more. Stay pure. Stay poor.
What happens when you natural log transform data in first histogram? In addition, as you likely know, it comes down to looking at the residuals.
Thank you for response, hlsmith.

When I transform amp using natural log, the estimates are consistent (histogram is not good). However, the residuals are still "bad" (although the model fit in terms of R-squared goes up from .2 to .32). Following are the three plots (quantiles of residuals against quantiles of normal distribution) for non-transformed, winzorized, and ln transformed amp, respectively:

Screen Shot 2018-02-05 at 2.26.28 PM.png amp

Screen Shot 2018-02-05 at 2.07.37 PM.png amp_w

Screen Shot 2018-02-05 at 2.21.05 PM.png amp_ln
spunky, thank you very much for your advise. I will explore gamma reg (by the means of GLM I'd assume) in greater detail.
Last edited:


TS Contributor
This looks more like a mixture of different groups, which should not be transformed, but further investigated.
Miner, thank you for your response. Indeed, these are groups and I have investigated those. Let me clarify. Average Medicare payment (outcome) is calculated based on the HCPCS codes (these identify different procedures conducted by the doctors). Since there are many different procedures, they are generally categorized in anesthesiology, radiology, medical services, management and evaluation, lab and pathology, (some other), and also surgery -- my focus. Therefore, for analysis I am selecting only those HCPCS codes that belong to surgery category. But even within the surgery category, the variation in the usage of those procedures is quite significant -- that is where "groups" are formed, I beleive (see the attached frequency table). For example, while 36415 (routine venipuncture) is done 13,594 time, 20610 (describes aspiration [removal of fluid] from, or injection into, a major join) is only 31.
Screen Shot 2018-02-06 at 10.22.30 AM.png

As such, I am not sure how I should model such outcome given that my focus is on the surgery related HCPCS codes.
Does quantile regression approach sound plausible? Given the abnormal distribution of amp, I can explore the impact of predictor at different portions of the amp's distribution (i.e., 50th [given median is half the sample], and then 60th, 70th, 80th, 90th, and 95th [given the "groups" observed in the upper tail]).
Following spunky's advice, I did run a glm with gamma family and log link; however, the estimates are somewhat inconsistent (the story doesn't line up).


TS Contributor
It appears there are 4 major groupings. Would it make sense to perform a cluster analysis of the HCPCS codes into 4 clusters then treat these clusters as an indicator variable in your regression? Would the results be interpretable?


Less is more. Stay pure. Stay poor.
Good advice Miner.

I once heard the following critique of quantile regression and it has stuck in my mind. QR is based on your quantiles, others may not have the same quantile cut-offs as you and your results won't be apparently generalizable to their patient sample. So if my distribution is a little different, comparisons between the quants is hard to interpret in a new sample.
Last edited:
Dear Miner, hlsmith, thank you for your advises. Here is how I addressed the aforementioned issue.

Following Miner's suggestion, I firstly experimented with various clusters. However, the result (as well as the model fit) were not appropriate.

Then, given that I have two variables -- (1) average submitted charge (asc) (claim amount that doctors submit for a given procedure) and (2) average medicare payment (amp) (what doctors get paid for the claim they submit) -- I decided to use the percent of asc that is get paid through amp (i.e., amp*100/asc). The resulting variable has the following distribution:
Screen Shot 2018-02-07 at 10.57.59 PM.png
And resulted in the following distribution of the residuals:
Screen Shot 2018-02-07 at 10.56.08 PM.png
Next, I added the cluster variable (as a categorical) that resulted in the substantially improved r-squared (.4 --> .12), appropriate results, and also the following distribution of the residuals:
Screen Shot 2018-02-07 at 11.09.35 PM.png

How does this solution look to you?

I am very thankful for all your input.


Less is more. Stay pure. Stay poor.
I will note that if the later two plots are QQPlots, they do look suspect. It is important to look at the residuals and try to understand where the model is failing (e.g., over predicting large or small y values, etc.).

Which variable is your target (dependent) and independent, given your above description? If your dependent variable is a percentage value, beta regression could be more appropriate.
Yes, the latter two plots are quantiles of residuals against quantiles of normal distribution. Here is the historgram of the residuals:
Screen Shot 2018-02-08 at 10.49.20 AM.png
It seems to me that the model is failing over the small AND large values of the outcome (i.e., those that are about 2SD below and above the mean).

My DV is the percent (amp*100/asc), and IVs are time, treated, and time*treated (diff-in-diff). Following your suggestion, I did run the beta reg -- the estimates are quite close to the OLS ones. Below are the two interaction plots for the OLS and beta, respectively:
Screen Shot 2018-02-08 at 10.54.02 AM.png OLS
Screen Shot 2018-02-08 at 10.54.37 AM.png Beta


Less is more. Stay pure. Stay poor.
Hmm, I heard of FMMs, but haven't used one. They are for when there is an omitted latent variable?

Yeah, a linear model can be used in lieu of a beta some times with comparable results. Issues come into play when you have a bounded dependent variable (contained within 0, 1) with its central location near either 0 or 1).