+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 17

Thread: How to choose data for your regression

  1. #1

    Angry How to choose data for your regression




    Hi guys,

    I'm developing a statistical method to analyse a player's performance in tennis (annotation). To do this I need to decide on certain performance indicators to analyse. I've on on to WTA's website and got the "matchfact" data for every available rank from 1 to 104 (n=78). These include total ACES, double faults, first serve percent, breakpoints saved percent among some 20 other stats. I'm going to see what best predicts a player's rank with these other independents.

    The problem is, there's about 20 in total. I've used reason to remove some of them but I feel I'm still left with too many to run a decent regression. How best to decide what data to put into a regression?
    Should I do a pearson matrix and then choose the top 5 on significance (or other arbitrary number)?

    Does anyone think correlation and coefficients alone is enough (i.e., do I really need multiple regression)?

    Should I follow this up with a stepwise or just keep it forced entry - hierarchical (or not hierarchical)?

    Mind blown.

    Thanks in advance.
    Last edited by ianhargreaves80; 07-17-2013 at 12:27 PM.

  2. #2
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: How to choose data for your regression

    78 cases is probably not enough to run regression period, that said you would be in worse shape for every extra variable you add to your model. So the fewer the better.

    The best answer to your question is to have theory that determines which you leave in and out. Some use stepwise regression instead, which is sharply objected to by many. Sometimes people will run a series of models with different variables and see which has the highest R squared or AIC. The problem with that approach is that the chances of making an error in the statistical test increases as you use the same data set to analyze different models (corrections such as bonferoni are utilized to address this).

    Multiple regression is superior if more than one variable influences the dependent variable as seems likely here. Correlation does not address that issue.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  3. #3
    Points: 18, Level: 1
    Level completed: 35%, Points required for next Level: 32

    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: How to choose data for your regression

    noetsi, thanks for your response.

    Hmmmm....

    Yes, 78 may not be enough but they are the most data on the ATP/WTP website http://www.wtatennis.com/singles-rankings. The stats just are not available for the lower rankings. I will have to mention this as a limitation. Thanks god this isn't a clinical trial. I am adding this to some similar research that was done on 2007 data, that was on the male game.

  4. #4
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: How to choose data for your regression

    Sometimes you simply have to say there was not enough data to use a method. Regression is one of those. It is common for elements of a method to be true only asymptotically (as n gets larger).

    One thing that confuses me is that you mention rank 1 to 104 and your n is 78. What happened to the other 26
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  5. #5
    Devorador de queso
    Points: 95,540, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,930
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: How to choose data for your regression

    If you're just looking for prediction you could consider fitting a regression tree or a random forest.
    I don't have emotions and sometimes that makes me very sad.

  6. #6
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: How to choose data for your regression

    What is a regression tree?
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  7. #7
    ggplot2orBust
    Points: 71,220, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,417
    Thanks
    1,811
    Thanked 928 Times in 809 Posts

    Re: How to choose data for your regression

    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  8. The Following User Says Thank You to trinker For This Useful Post:

    noetsi (07-18-2013)

  9. #8
    Points: 18, Level: 1
    Level completed: 35%, Points required for next Level: 32

    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: How to choose data for your regression

    78 because WTA website does not carry all the stats for all the players. They start to not hold so many data in the lower echelons.

    I know I'm going to have to be careful about inferring generalisations onto other players on the stats of only 78, but unfortunately these are the only data I have. The other option, of course, is collating the men's data too, but that could also be a problem as they may play a different game tactically so it could be argued these data should be treated separately.

    I guess I need to report findings, and be careful about making strong predictions from these.

    I wonder if SPSS modeler can make a regression tree. Does this show more or is it just another way of showing info?

  10. #9
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: How to choose data for your regression

    You have to be really careful about one thing (which has nothing to do with regression, it has to do with the logic of your experiment). If the players below the top 100 vary signficantly from top players in the way these statistics predict your dependent variable (and that seems possible to me) then the results you get will not generalize to them. Predicting outside your actual data with regression is always dangerous (if commonly done).
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  11. #10
    Points: 511, Level: 10
    Level completed: 22%, Points required for next Level: 39

    Posts
    35
    Thanks
    1
    Thanked 6 Times in 6 Posts

    Re: How to choose data for your regression

    Correlation is useful for information about a relationship between two variables. You would want to examine correlations with potential IVs and performance as well as IVs with other IVs.

    Multiple regression takes into account the shared variance among IVs. When creating your regression model, remember the idea of parsimony. Think about what the most important variables are in predicting performance (based on logic and reasoning). You can add additional variables using a building-up method and keep them if they show incremental validity.

    Stepwise uses the data to determine the best-fitting model. This is frowned upon because it is atheoretical and problematic when you have a small sample as you do here.

  12. #11
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: How to choose data for your regression

    It is also frowned on because concidence can force one variable into a model that is correlated with a variable that gets left out. Which will cause your error term to be correlated with an external variable. It also is frequently not robust between samples.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  13. #12
    Human
    Points: 12,666, Level: 73
    Level completed: 54%, Points required for next Level: 184
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,360
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: How to choose data for your regression

    Suppose that there is a player who never does aces, never do double faults etc. but simply outplays the opponent. Isn't winning against previous good players a more important predictor?

  14. #13
    Points: 3,024, Level: 33
    Level completed: 83%, Points required for next Level: 26
    SmoothJohn's Avatar
    Location
    Edmonton, Canada
    Posts
    165
    Thanks
    11
    Thanked 12 Times in 10 Posts

    Re: How to choose data for your regression

    Quote Originally Posted by noetsi View Post

    The best answer to your question is to have theory that determines which you leave in and out.

    I think that this is the most important consideration. Generate a hypothesis based on solid theory (and previous empirical work, if possible) and then test it.

    Not a big fan of regressional fishing expeditions.

    John

  15. #14
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: How to choose data for your regression

    Not a big fan of regressional fishing expeditions.
    A real life section of a statistical text was called: "Death to Stepwise: Think for Yourself"
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  16. #15
    Points: 3,024, Level: 33
    Level completed: 83%, Points required for next Level: 26
    SmoothJohn's Avatar
    Location
    Edmonton, Canada
    Posts
    165
    Thanks
    11
    Thanked 12 Times in 10 Posts

    Re: How to choose data for your regression


    Quote Originally Posted by noetsi View Post
    A real life section of a statistical text was called: "Death to Stepwise: Think for Yourself"
    Very nice.

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats