Ok I'm pretty sure someone will have a simple answer to this but I don't entirely understand it.
Using a large dataset (2000+ cases) of demographic information and haemoglobin levels in Malawi I am trying to find a model to best predict levels of haemoglobin (continuous.)
But my question is why when I run a simple linear between haemoglobin levels and age does it have a lower r square than a simple linear between haemoglobin levels and age squared? and other figures seem to suggest a better fit.
Any help with this would be fantastic, thanks. If any more details or anything else is required just ask.
Last edited by Fingbut; 09-05-2008 at 03:59 PM.
Offering a computation more parameters will always generate an equivalent or better fit. One parameter for each variable or data point should fit perfectly. The Parsimony Principle should cause you to sensibly question whether or not the additional variables are adding any predictability or broadening the applicability of your hypothesis. Unless your data is curved in such a way that it necessitates a quadratic (two age groups having the same hemoglobin levels or vice versa) you should avoid them. Either accept the 'skew' as being irrelevant (and disclose it as such), relevant but unmeasured/controlled, or dependent on some other variable that is lurking in your data.
Personally, I'd put good money on the fact that your old people are less active than your younger ones, so, in addition to a general decline with age, there's a decline with activity as well. In the US (??? wrt Malawi) older age groups can be biased towards women and women generally have a lower hematocrit than men. Your chemistry analyzer may also be saturating/bottoming out, the hypotheticals are endless.
There's a quote about the String Theory and 11-dimensional space that goes something along the lines of;
'Give me eleven dimensions and I can make an elephant, give me twelve and I can make him dance.'
Sorry I should have given more details it's of children aged 0-5. Between these two answers I think I understand it now. The scatterplots resemble bars with a very slight positive correlation but at a very young age the variance is much greater (from 0-1 months) where there are a collection of cases with what seems to be high levels of haemoglobin. I figured this to be children of healthier mothers or maybe had something to do with foetal haemoglobin and the test used.
But this is besides the point, in simple terms (the only way I can put this) is that the very young ages are compressed meaning they have less influence overall in the model.
Last edited by Fingbut; 09-05-2008 at 05:10 PM. Reason: spelling and grammar
Advertise on Talk Stats