+ Reply to Thread
Results 1 to 5 of 5

Thread: How to determine if two R2 values (rsquared) are significantly different

  1. #1

    How to determine if two R2 values (rsquared) are significantly different




    Hi TalkStats

    I have fit two different functions (f1,f2) to some data. These functions have different forms (see below), in that they have different numbers of parameters. But the data is the same - the number of predictor/dependent variables is the same.

    f1 = \frac{1}{1 + exp((V-V_{2h})/k_h)}

    f2 = y_0 + A_0 (\frac{B_0}{1 + exp((V-V_{0})/k_0)} + \frac{1+B_0}{1 + exp((V-V_{1})/k_1)})

    Here, V is the dependent variable (/predictor). All other occurences are parameters that are fit by a least squares method. f1 is a sigmoid with 2 parameters, f2 is a double sigmoid with 7 parameters.

    I obtain a R2 for each fit; one is R^2_{f1}=0.99975 and one is R^2_{f2}=0.995.

    I want to know which function the data fits best, i.e. is one of these R2 statistically greater than the other? Seemingly they are almost indistinguishable, but in other cases they may differ more (e.g. R2=0.95 vs 0.9). Can I do a test that tells me which function I should use? Ultimately, this is what I want to know. Whether this is decided by the R2 value or not doesn't matter (in fact I may be wrong thinking that this is what I should use).

    If I am using the Rsquare to determine which function (f1 or f2) to use, I have done some reading and think I need to do something as in the following webpage:

    http://www.analytictech.com/ba762/ha...rersquares.htm

    Basically, for regression analysis you can determine whether two different R2 generated from two different regression models (typically with different numbers of predictor variables) are statistically different by calculating the F statistic. Say we have two models m1 and m2, then you can calculate the F stat:

    F = \frac{(R^2_{m1} - R^2_{m2})/(df_{m1} - df_{m2})}{R^2_{m2}/(n-df_{m2})}

    where df_{m1} and df_{m2} are the number of predictor variables in the model and R^2 are the rsquareds for each of the models (m1 and m2). You obviously cannot use this analysis if the number of predictor variables is the same in the two models.

    My problem is that I have the same number of predictor variables for my functions f1 and f2. The difference between the functions f1 and f2 is the number of parameters (to be fit) is different. Do I put df_{m1} and df_{m2} as the number of parameters that are used in the functions (2 and 7, respectively)?

    Any help on this would be really appreciated!

    Thanks,
    MG

  2. #2
    Cookie Scientist
    Points: 13,431, Level: 75
    Level completed: 46%, Points required for next Level: 219
    Jake's Avatar
    Location
    Austin, TX
    Posts
    1,293
    Thanks
    66
    Thanked 584 Times in 438 Posts

    Re: How to determine if two R2 values (rsquared) are significantly different

    Hi MG, The F-ratio approach is only valid for comparing linear models that are nested. Because your models are neither linear nor nested, there's no good reason to think this will work well.

    I think the most common thing to do in a situation like this would be to compare the models' information criterion statistics, such as AIC or BIC, and pick the model that has the lower value -- but note that this is technically a bit different from comparing the two models on which has the higher R^2 value. So this may or may not work for you, based on how committed you are to using R^2 as the basis for comparison, or if you just want to know more generally/vaguely which model is more "consistent with the data."

    What I personally would probably do here is use a bootstrap approach. In each iteration, sample the rows of the dataset with replacement, fit both models to the resampled data, compute their R^2 values, and then take their difference. Do this enough times and you'll have a bootstrap sampling distribution of the difference in R^2 values. Now check to see where the null value of 0 lies in this distribution.
    “In God we trust. All others must bring data.”
    ~W. Edwards Deming

  3. The Following 2 Users Say Thank You to Jake For This Useful Post:

    hlsmith (09-08-2016), mahdieh.godazgar (09-09-2016)

  4. #3
    Super Moderator
    Points: 13,151, Level: 74
    Level completed: 76%, Points required for next Level: 99
    Dragan's Avatar
    Location
    Illinois, US
    Posts
    2,014
    Thanks
    0
    Thanked 223 Times in 192 Posts

    Re: How to determine if two R2 values (rsquared) are significantly different

    Very good suggestion Jake on the use of the bootstrap. You virtually "read my mind" when I was reading OP's post before I read yours. :-)

  5. #4
    Points: 20, Level: 1
    Level completed: 39%, Points required for next Level: 30

    Posts
    2
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Re: How to determine if two R2 values (rsquared) are significantly different

    Hi Jake and Dragan,

    Thank you very much for your informative and speedy replies. This is not something I am 100% confident in so would appreciate you help with the below.

    Firstly, the bootstrap approach.... Am I correct in thinking that what you are suggesting is take my dataset (say, 30 datapoints), and creating lots of sub-datasets by deletion and 'replacement'. (Replacing what with what? Do you replace missing data with mean data?) And then conducting both fits to this data. Then I calculate \Delta R^2 and draw the distribution of this. I then look at the distribution, see where 0 lies and look to see if it is +ve or -ve. If its positively skewed, then one function fit is better than the other, say. My issue is that this sounds quite subjective still? At the end of this process, I am still having to interpret a distribution by eye.

    After doing some reading, perhaps the best approach would be looking at this "Akaike Information Criterion" (or Bayesian), however I am not very familiar with calculating this. The equation I have found is:

    AIC = - 2 ln(likelihood)) + 2K

    where "likelihood" is the probability of the data given a model and K is the number of free parameters in the model.

    The input to the criterion that I am not clear about is the "likelihood". What exactly is this and how do I compute it? I have done some reading and think this is somehow associated with the error associated with the least-squares approach, but I am not sure. Is this value something that could be "spat out" in something like MATLAB?

    Thanks!!!!
    MG :-)

  6. #5
    Cookie Scientist
    Points: 13,431, Level: 75
    Level completed: 46%, Points required for next Level: 219
    Jake's Avatar
    Location
    Austin, TX
    Posts
    1,293
    Thanks
    66
    Thanked 584 Times in 438 Posts

    Re: How to determine if two R2 values (rsquared) are significantly different


    Quote Originally Posted by mahdieh.godazgar View Post
    Am I correct in thinking that what you are suggesting is take my dataset (say, 30 datapoints), and creating lots of sub-datasets by deletion and 'replacement'. (Replacing what with what? Do you replace missing data with mean data?)
    You build each resampled dataset like so: first, randomly draw 1 row from the original dataset. Next, randomly draw another row from the original dataset and add this to the first row that you drew. (Note that there is a possibility that you will draw the same row twice, and this is okay, this is what I meant by "sampling with replacement.") Do this 30 times until you have build up a new dataset that is the same size as the original dataset, but is comprised of a random selection of rows, including some that are duplicated and some that are missing. And then you will compute \Delta R^2 on this resampled dataset just like you said.

    Quote Originally Posted by mahdieh.godazgar View Post
    I then look at the distribution, see where 0 lies and look to see if it is +ve or -ve. If its positively skewed, then one function fit is better than the other, say. My issue is that this sounds quite subjective still? At the end of this process, I am still having to interpret a distribution by eye.
    You can be more precise by doing things like (a) find the 2.5% and 97.5% quantiles of the distribution, i.e., the middle 95%, and see if 0 lies in this interval, (b) compute the proportion of the distribution that falls below 0 (or above 0, depending on which direction you computed the R^2 differences). The method (a) is called the percentile bootstrap confidence interval, and the method (b) is kind of like a p-value, but not exactly.

    Quote Originally Posted by mahdieh.godazgar View Post
    Is this value something that could be "spat out" in something like MATLAB?
    I don't personally use MATLAB, but I am pretty confident there will be some way to have the MATLAB nonlinear regression function that you're using spit out the log-likelihood, or possibly -2*log(likelihood), in which case it may be called the "deviance." Some googling for terms like "matlab aic model comparison" should help.
    “In God we trust. All others must bring data.”
    ~W. Edwards Deming

  7. The Following User Says Thank You to Jake For This Useful Post:

    mahdieh.godazgar (09-09-2016)

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats