When to work with aggregated data for a linear regression?

I am trying to do a linear regression with walking speed of pedestrians as the dependent and population size as the independent variable. I have collected the same amount of samples (200) for each city. Other studies work with the average walking speed per city. My guess is because they have an unequal sample size. My question is, should I regress with all the samples or should I use the aggregated values (average walking speed) for my regressison?


Active Member
ecological fallacy maybe issue?

I would go with non-aggregated data and use a mixed model to capture the correlation within city, I think that would be the 9 out of 10 dentists recommendation.

I think it may be the case that the regression with the city means will give same slope as using the non-aggregated data, this will be true if the sample sizes are all the same, but the p-values and stuff will be different, i think. In either case you should carefully confirm such before preaggregating.


Well-Known Member
Yes, I think that's true. You have a problem if you do the regression using the non-aggregated data because the residuals are not independent and your df are much too large. Hence the difference in the p values.
In general, your regression will have as many data points as you have cities, each with its average. If the sample sizes are different, you can account for that by using weighted regression.