# Thread: Choosing the Appropriate Method for Highly Unbalanced Dataset

1. ## Choosing the Appropriate Method for Highly Unbalanced Dataset

I am trying to put together a research study and have some statistics concerns that I was hoping to get some help with. I'm working with a set of data that is semi-longitudinal and highly unbalanced. I have no control over the actual data set as far as number of participants or trials per participant or per year and as a result I am trying to come up with the most appropriate statistical method to examine my hypotheses given this constraint.

Here's a nutshell view of my study- I am trying to validate the results of a pilot study that indicated 10 variables (things such as knee angle at touchdown, release height, release angle, etc) as being critical (highly important) to performance in the shot put. My current data set is made up of any women who made the finals of the USA Track & Field National Championships between 2002 and 2006. Because I have no control over who will show up to compete, who will make the finals, and how many throws they take or receive fair (as opposed to fouled) throws the data sets are both highly variable by year and by performer. I would like to be able to control for between performer differences and just look at the effect of each of the 10 variables on the measured distance of the throw. Here's some pertinent info about the data set:
• As previously stated I have 10 independent variables and one dependent variable (measured distance of the throw). The data from the independent variables is NOT normally distributed for any of the variables.
• There are 5 different data collections (2002, 2003, 2004, 2005, 2006) available for this study. I could argue either way whether this is even something that needs to be considered.
• I have data on 83 different throws.
• In this data set there are 17 different participants or performers....and only two of whom have data at every year. Some participants have only one throw examined. Others have as many as 12. The average trial per participant is 5.3 and the median is 4.5. As you can see the data set is no where near balanced.
• The data is similarly not balanced by year. For example, in 2002, I only have data for 7 throws. In 2006 I have data for 24 throws.
• While the number of participants is relatively small, it represents a very large (90+%) of the elite American shot put thrower population.

With such an unbalanced data set I am trying to either come up with a data reduction method that will bring some balance to the set or find another way to control for bias (because the best people tend to have the most trials). I'm also concerned that if I clump all of the participant's data together and look at the effect of the 10 variables on the measured distance of the throw that things may get lost in the mix or misinterpreted because it would ignore performer differences. In the end I'd like to be able to examine / validate the results of the pilot study by determining whether those parameters indicated as important in the pilot study are in fact good indicators or correlates of performance.

After doing my own research the best options I've come up with are a Wilcoxon signed rank test looking at pairs of throws for each performer OR transforming the data to be normally distributed and then performing a paired t-test on all the data. Using these methods I could examine the effects of the variables while controlling for performer differences and also potentially bring some balance to the data set (depending on how I set up the pairs). If (and I'm not sure that it is...if you have another alternative I'd love to hear it) these are the best options, I've come up with several pairing options that each would appear to have pros and cons.

I've attached a quick review of various pairwise arrangements that I've come up with. The second and fourth options reviewed look the best to me. Is there anything else I should consider and do you prefer one over the other? Of these two, I'm partial to the second option mainly because of the physiological changes issue. I'm not sure how to make a Bonferroni adjustment though to such a setup. I also wasn't sure whether I should run the t-test or the Wilcoxon test.

If anyone can help it would be greatly appreciated. Thanks in advance-

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts