Hierarchical stats problem

nml

New Member
#1
Imagine I've got a dataset with 10000 rows, consisting of 100 observations from 100 groups. Within a group the samples are correlated, but between groups they aren't really. For example the groups could be the month in which the measurement was made. In my fairly limited understanding this is called a hierarchical model, or a mixed effects model - is this correct?

If I wanted to train a classification model (e.g. logistic regression) on the above, I'm guessing I'd need to account for the grouping of the data? What would actually happen if I didn't - is it just that my estimate of the model's performance will be wrong?

Thanks!
 

j58

Active Member
#2
Imagine I've got a dataset with 10000 rows, consisting of 100 observations from 100 groups. Within a group the samples are correlated, but between groups they aren't really. For example the groups could be the month in which the measurement was made. In my fairly limited understanding this is called a hierarchical model, or a mixed effects model - is this correct?
Correct.
If I wanted to train a classification model (e.g. logistic regression) on the above, I'm guessing I'd need to account for the grouping of the data? What would actually happen if I didn't - is it just that my estimate of the model's performance will be wrong?
If you ignore the grouping structure of the data, equivalent to treating all the observations as independent, the model will underestimate the uncertainty of its predictions, so the model will appear to be more accurate than it is.
 
Last edited:

nml

New Member
#3
Ok thanks. In terms of properly handling this dependent data is it really necessary to go Bayesian? I've tried pymc3 but it's incredibly slow even for a simple logistic regression model.

What I'm doing at the moment is using the months as groups for cross-validation purposes, so that I always leave out an entire month at a time. The idea being that between months there is little correlation. I think this is called block-validation.
 

j58

Active Member
#4
Unfortunately, I have no experience with cross-validating mixed models. Nonetheless, I can't imagine why the model would have to be Bayesian to perform cross-validation. A Google search for "mixed model cross-validation" seems to turn up promising results.