Hi all,
I'm currently modelling running performance using multiple linear regression. The data has GENDER and AGE as inputs amongst others, the target is RACE_TIME.
I've partitioned the data into training and test for cross validation purposes. I've tried a couple of approaches 1) to generate one model using the entire data set and 2) split the data by gender (i.e. males on one side and females on the other) and creating two separate models. When comparing the SUM of SQUARED ERRORS (SSE) on the test data between the 1 model approach vs the 2 model approach, I'm observing a considerable improvement in the 2-model approach over the other.
I wondered what are your views in general on splitting the data into groups and modelling separately vs modelling it all in one model? Can you see any advantages or disadvantages? Are there any pitfalls that I should bear in mind?
Thanks in advance
Rob
I'm currently modelling running performance using multiple linear regression. The data has GENDER and AGE as inputs amongst others, the target is RACE_TIME.
I've partitioned the data into training and test for cross validation purposes. I've tried a couple of approaches 1) to generate one model using the entire data set and 2) split the data by gender (i.e. males on one side and females on the other) and creating two separate models. When comparing the SUM of SQUARED ERRORS (SSE) on the test data between the 1 model approach vs the 2 model approach, I'm observing a considerable improvement in the 2-model approach over the other.
I wondered what are your views in general on splitting the data into groups and modelling separately vs modelling it all in one model? Can you see any advantages or disadvantages? Are there any pitfalls that I should bear in mind?
Thanks in advance
Rob