Artifact in plotting CDF with oil and gas data?

Looking for some help explaining why intuition is failing me in exploring this data.

I've binned by dataset by a predictor variable to examine the response variable through CDF plots. I realize it may not be the optimal display, but it's familiar to my audience and conveniently displays the entire population. Plotting the entire dataset, results make good sense, with some separation among the bins. (See below in image labeled "Entire Population").

Perhaps this is a flawed investigation, but I'm interested to understand the impact of the predictor variable on only those top 1/3 of results. When I pare the dataset down to the upper tertile, I see a different relationship (or lack of relationship) among the bins of predictor variable. This may indicate that this upper 1/3 of the population is governed by a variable not yet investigated. No problem with this conclusion on its own.

If the entire population displays a fairly clear relation to the predictor variable, but the upper 1/3 does not, then I'd expect to see a very apparent relationship in the middle and bottom 1/3 of the dataset. This is where I'm surprised by the data. The plots below show that each tertile, when investigated on its own, displays very little difference between bins of the predictor variable. How can this be the case if there's a fairly clear relationship in the aggregate dataset?

I appreciate any insight you all can provide!


Active Member
i guess the question is what does it mean for the relationship to be 'very apparent'. Some of it may be an artifact of reduced sample size in some tertiles, for example you don't gots alot of blue dots on yer upper tertile plot? You may want to try log-rank test or other formal CDF comparison procedures.
Thanks for the reply. I'll look into some other methods of comparison in the meantime. But because I'm stubborn, I'm still curious to explain this.

If I choose only to evaluate the 1600-2000 (yellow) and >2400 (pink) bins to compare median values - I see that pink is about 9% greater than yellow, when examining the entire population.
Comparing the median of each individual tertile, though, I find -3%, +1% and +1%... how could these possibly aggregate to +9%?