1. ## How to choose data for your regression

Hi guys,

I'm developing a statistical method to analyse a player's performance in tennis (annotation). To do this I need to decide on certain performance indicators to analyse. I've on on to WTA's website and got the "matchfact" data for every available rank from 1 to 104 (n=78). These include total ACES, double faults, first serve percent, breakpoints saved percent among some 20 other stats. I'm going to see what best predicts a player's rank with these other independents.

The problem is, there's about 20 in total. I've used reason to remove some of them but I feel I'm still left with too many to run a decent regression. How best to decide what data to put into a regression?
Should I do a pearson matrix and then choose the top 5 on significance (or other arbitrary number)?

Does anyone think correlation and coefficients alone is enough (i.e., do I really need multiple regression)?

Should I follow this up with a stepwise or just keep it forced entry - hierarchical (or not hierarchical)?

Mind blown.

2. ## Re: How to choose data for your regression

78 cases is probably not enough to run regression period, that said you would be in worse shape for every extra variable you add to your model. So the fewer the better.

The best answer to your question is to have theory that determines which you leave in and out. Some use stepwise regression instead, which is sharply objected to by many. Sometimes people will run a series of models with different variables and see which has the highest R squared or AIC. The problem with that approach is that the chances of making an error in the statistical test increases as you use the same data set to analyze different models (corrections such as bonferoni are utilized to address this).

Multiple regression is superior if more than one variable influences the dependent variable as seems likely here. Correlation does not address that issue.

3. ## Re: How to choose data for your regression

Hmmmm....

Yes, 78 may not be enough but they are the most data on the ATP/WTP website http://www.wtatennis.com/singles-rankings. The stats just are not available for the lower rankings. I will have to mention this as a limitation. Thanks god this isn't a clinical trial. I am adding this to some similar research that was done on 2007 data, that was on the male game.

4. ## Re: How to choose data for your regression

Sometimes you simply have to say there was not enough data to use a method. Regression is one of those. It is common for elements of a method to be true only asymptotically (as n gets larger).

One thing that confuses me is that you mention rank 1 to 104 and your n is 78. What happened to the other 26

5. ## Re: How to choose data for your regression

If you're just looking for prediction you could consider fitting a regression tree or a random forest.

6. ## Re: How to choose data for your regression

What is a regression tree?

8. ## The Following User Says Thank You to trinker For This Useful Post:

noetsi (07-18-2013)

9. ## Re: How to choose data for your regression

78 because WTA website does not carry all the stats for all the players. They start to not hold so many data in the lower echelons.

I know I'm going to have to be careful about inferring generalisations onto other players on the stats of only 78, but unfortunately these are the only data I have. The other option, of course, is collating the men's data too, but that could also be a problem as they may play a different game tactically so it could be argued these data should be treated separately.

I guess I need to report findings, and be careful about making strong predictions from these.

I wonder if SPSS modeler can make a regression tree. Does this show more or is it just another way of showing info?

10. ## Re: How to choose data for your regression

You have to be really careful about one thing (which has nothing to do with regression, it has to do with the logic of your experiment). If the players below the top 100 vary signficantly from top players in the way these statistics predict your dependent variable (and that seems possible to me) then the results you get will not generalize to them. Predicting outside your actual data with regression is always dangerous (if commonly done).

11. ## Re: How to choose data for your regression

Correlation is useful for information about a relationship between two variables. You would want to examine correlations with potential IVs and performance as well as IVs with other IVs.

Multiple regression takes into account the shared variance among IVs. When creating your regression model, remember the idea of parsimony. Think about what the most important variables are in predicting performance (based on logic and reasoning). You can add additional variables using a building-up method and keep them if they show incremental validity.

Stepwise uses the data to determine the best-fitting model. This is frowned upon because it is atheoretical and problematic when you have a small sample as you do here.

12. ## Re: How to choose data for your regression

It is also frowned on because concidence can force one variable into a model that is correlated with a variable that gets left out. Which will cause your error term to be correlated with an external variable. It also is frequently not robust between samples.

13. ## Re: How to choose data for your regression

Suppose that there is a player who never does aces, never do double faults etc. but simply outplays the opponent. Isn't winning against previous good players a more important predictor?

14. ## Re: How to choose data for your regression

Originally Posted by noetsi

The best answer to your question is to have theory that determines which you leave in and out.

I think that this is the most important consideration. Generate a hypothesis based on solid theory (and previous empirical work, if possible) and then test it.

Not a big fan of regressional fishing expeditions.

John

15. ## Re: How to choose data for your regression

Not a big fan of regressional fishing expeditions.
A real life section of a statistical text was called: "Death to Stepwise: Think for Yourself"

16. ## Re: How to choose data for your regression

Originally Posted by noetsi
A real life section of a statistical text was called: "Death to Stepwise: Think for Yourself"
Very nice.

Page 1 of 2 1 2 Last

 Tweet