1. Data scattered around y=x

Hello everyone,
I am a researcher in engineering. I do have some background on statistics and mathematics in general, but I am not a professional like many of you here. I hope you can give me some insight regarding the problem I am facing.

I have a set of data scattered more or less around the y=x line but the scatter does not start from the (0,0) point and it seems to increase with "x" (figure below). Up to a certain value of x, the data fit perfectly to a straight line. The actual best fit line is not y=x (let's say it is y=0.98x), however, it is meaningful in terms of the physical phenomena represented by the data to relate the statistics with the y=x line.

How would you describe this distribution in a meaningful and "powerful" manner as a statistician? For example, can I just calculate R^2 and say the data fits y=x with R2=...? This does not seem to be enough to me.

best regards!

2. Re: Data scattered around y=x

The usual regression methods assume that the residuals are at about the same level and evenly spread along the line. Yours are not, so any results that your regression gives are suspect - slopes, intercepts, confidence intervals for those coefficients, R^2, and p values.
Quite often the variance graphs like these can be stabilized by logging the axes to make a power graph so long as no values are zero.

3. The Following User Says Thank You to katxt For This Useful Post:

math4engz (04-23-2017)

4. Re: Data scattered around y=x

Originally Posted by katxt
Quite often the variance graphs like these can be stabilized by logging the axes to make a power graph so long as no values are zero.
Thanks for your reply. Could you provide some more information regarding "logging the axes"? Are you saying I should use LogNormal axes? The point (0,0) is actually meaningful, so I have at least one zero value.

5. Re: Data scattered around y=x

I'm trying to guess the form of the relationship from the graph. I would try this informal test which is equivalent to drawing by hand on log-log paper. Dump the data into Excel. Remove any zero values, assuming that they are real but too small to measure. Don't put the (0,0) in as it will happen automatically. Draw a scatter plot. Now format the Y axis and make it logarithmic. Next format the X axis and make it logarithmic too. If the true relationship is a power law (y = a x^n with n close to 1) then there is a good chance that graph will look quite ordinary with the residuals spread fairly evenly on both sides along the line. Put in a power trendline. If you're not into Excel let me know.
If this gives a nice picture, then log all the values (ignoring for the moment the zeros) and redo the regression. The slope is the power which should be near to 1. Your output may well give you a confidence interval for the power which should include 1.
The uncertainty for a power graph is +/- a percentage. For an ordinary graph +/- some constant.

6. Re: Data scattered around y=x

Hello katxt,
thanks for your time. This is what I get when I use base 10 Logarithmic scale. I added a power trendline in Excel and the equation is plotted on the graph as well.

As you can see (compare to the previous graph), the R^2 value is closer to 1.
Is this what you expected? Any thought? What would you say about the scatter near the (0.1; 0.1) point?
Also, I figured out that I can directly add a power trendline without converting the axes to Logarithmic scale, and I get the same trendline equation (I guess that's the way Excel Log axes work).

7. Re: Data scattered around y=x

Wow. What a range of values. The errors seem to be different above and below about 0.03. Is the measuring equipment or protocol different?
The graph is interesting, and not quite what I expected. I thought there might be more spread around the line at the left hand end. This graph is more informative than the original though because the points below 0.01 were virtually invisible before and things could have diverged from y = x and that wouldn't have been noticed. It does show that the power law extends right to the smallest limit.
If you regress logY against logX, there will be a confidence interval for the constant and the power. If the pattern is truly y = x, then the confidence intervals for both the intercept and slope should include 1 because log 1 = 0 and the slope is the power. If they do, there is no reason to abandon y = x. kat

8. The Following User Says Thank You to katxt For This Useful Post:

math4engz (04-25-2017)

9. Re: Data scattered around y=x

You should use log base e, otherwise you are not really rescaling them, I believe. Also you can look up "funnel shaped residuals" online to better understand the phenomenon.

10. The Following User Says Thank You to hlsmith For This Useful Post:

math4engz (04-25-2017)

11. Re: Data scattered around y=x

It doesn't really whether you use loge or log10. The final results will be the same after you have back transformed them. Log10 may perhaps come more naturally to an engineer.
The first diagram certainly suggested your common funnel errors but the pattern usually disappears with logging or some other suitable transformation. Not so in this case, which is why I suggested that the nature of the errors may differ for higher values.

12. The Following User Says Thank You to katxt For This Useful Post:

hlsmith (04-25-2017)

13. Re: Data scattered around y=x

Originally Posted by katxt
Is the measuring equipment or protocol different?
The data actually represent the results of a numerical simulation on a computer, in the field of structural engineering. That being said, the scatter starts once the structural behavior becomes nonlinear and inelastic. Also, the data fit perfectly to the y=x line for as long as the response is elastic linear.

14. Re: Data scattered around y=x

Originally Posted by hlsmith
....you can look up "funnel shaped residuals" online to better understand the phenomenon.
Thanks a lot. The term "funnel shaped residuals" will definitely help me narrow down my research online.

15. Re: Data scattered around y=x

I am not sure if I correctly understood katxt's suggestions but here is what I did:
- I calculated the base 10 Log of x and base 10 Log of y (I can't think of any advantage in using the Natural Logarithm in my case);
- I plotted the data and added a linear trendline in Excel:

It shows that Log y ≈ Log x and it gives R"2 close to 99%. This seems strange to me. Why is the equation "Log y = Log x" describing my data better than "y=x"? Let's put it this way: I know the data do not fit perfectly to y=x, I just need to quantify the "scatter".

16. Re: Data scattered around y=x

You're quite right. Log y = log x doesn't describe the data any better than y = x. I just suggested the log-log transformed graph in the hope that it would show a pattern in the residuals. Quite commonly when this is done, the residuals are much the same all along the graph, which would show that the errors are multiplicative rather than additive. Your residuals don't fit the classical assumptions of either a multiplicative or an additive model, so you are into what well may be a new area.
One suggestion is that you assume y = x, and just work on the residuals in the inelastic region from about 0.03 up and see if you can find a pattern there. Perhaps you can post a log log graph for the data above 0.03.
kat

17. Re: Data scattered around y=x

Originally Posted by katxt
Perhaps you can post a log log graph for the data above 0.03.
Here it goes.

From this graph above I can understand that the Logarithmic plot somehow "hides" the important dots (i.e. dots for x>0.03) and falsely displays a good fit to a straight line. Keep in mind that there are many dots in that plot (around 1000 dots).

I don't see any pattern in the graph above...

18. Re: Data scattered around y=x

Thanks. The main pattern here is that the errors (residuals) increase as you get higher. I don't know exactly what you are looking for, but I imagine that it is something about how accurate predictions are at some level. Because of non-linearity there may be no simple solution.
Is the simulation still available? If so you could rerun it with x set at fixed levels, say 0.02 to 0.11 in steps of 0.01 and get 1000 points at each level. Find the SD of the relative error at each point and plot the SD against x. See what happens. kat

19. Re: Data scattered around y=x

Originally Posted by math4engz
It shows that Log y ≈ Log x and it gives R"2 close to 99%. This seems strange to me. Why is the equation "Log y = Log x" describing my data better than "y=x"? Let's put it this way: I know the data do not fit perfectly to y=x, I just need to quantify the "scatter".
Keep in mind that the log(y) is a different independent variable than just y, so it isn't quite right to say that either one of these better explains the data better (since it's a different DV being explained). It's just like trying to make the comparison that test scores as a function of study hours are "better" explained than price of cabbage as a function of rainfall because the R-square is high for the former. In general, it doesn't make sense to compare R-square values for different dependent variables. Also keep in mind that taking the log of a variable significantly reduces the variation in the data, so it's not uncommon to see a higher R-square (but remember it's not appropriate to compare these R-square values). If you want to compare the R-square from log(y)= log(x) with the original model, you can obtain a "pseudo" R-square. This is essentially done by obtaining the predicted log(y) values and taking the anti-log. Then you use these to calculate residuals and a "pseudo" R-square. This pseduo R-square can be compared with the original, untransformed Y R-square.

 Tweet