What is unbalanced data

noetsi

Fortran must die
#1
As in this example.

Another consideration is the method of estimation used by these programs to produce the parameter estimates, either maximum likelihood (ML) or restricted maximum likelihood (REML). Each has its own advantages and disadvantages. ML is better for unbalanced data, but it produces biased results. REML is unbiased, but it cannot be used when comparing two nested models with a likelihood ratio test. Both methods will produce the same estimates for fixed effects, yet they do differ on the random effect estimates (Albright & Marinova, 2010).
 

Miner

TS Contributor
#2
To some extent the definition depends on the type of analysis. To me, definition 1 below describes balanced and unbalanced data. However, I also found definition 2, which has more of a repeated measures flavor to it. It also fails to qualify the definition to differences in sample size, but goes directly to missing data. The Wikipedia article on Panel data has a similar description.

1. In ANOVA and DOE, a balanced design has an equal number of observations for all possible combinations of factor levels. An unbalanced design has an unequal number of observations.

2. A balanced data set is a set that contains all elements observed in all time frame. Whereas unbalanced data is a set of data where certain years, the data category is not observed.
 

hlsmith

Omega Contributor
#3
Hmm, I am guessing it is MLM specific in this context, which isn't too far removed from Miner's #2. If that is the case, it may be related to the number of observations within clusters varying for level two clustering variables.
 

noetsi

Fortran must die
#4
I finally found the author's comments on this. It appears to be that balanced is having the same number of observations in each group which seems to be Miner's 1 to me.
 

hlsmith

Omega Contributor
#5
I finally found the author's comments on this. It appears to be that balanced is having the same number of observations in each group which seems to be Miner's 1 to me.

Yeah, I totally mis-read Miner's #2, which upon a second reading seems more like a data missingness definition. What I was thinking was you have the same number of observations within cluster for the outcome of interest. Which may or may not be totally correct.