Deleting outliers AGE varieble

#1
Hi all,

I am conducting a bi-variate analysis of age groups (>40 years, 40-49, and 50-75) and basic demographic characteristics of women who had a breast cancer screening. The first thing I want to do is delete outliers (this seems logic to me). Some participants wrote that they are 1, 2 or 130 years old- of course this is not possible. What would be a standard procedure to deal with this? What is the command?

Thank you in advance!
Marvin
 

bukharin

RoboStataRaptor
#2
Of course the best solution is to find out how old they really were. Assuming you can't do that, you should replace those ages with missing. You could specify a plausible age range (eg 40-75) and use that:
Code:
replace age=. if !inrange(age, 40, 75)
You should be careful about doing this because some of the outliers might have correct data (eg age 39 or 77).

If only a tiny proportion of your dataset is outside your allowed range then that's probably all you need to do - those patients will be excluded from your models. If it's a larger proportion then you may want to consider imputing ages based on the other data available, in order to preserve power and minimise selection bias.
 
#3
Yeah!! We receive the data from health centers and they have indicate that they do have clients aged 13 year old. Couldn't i delete cases using the Standard deviation? Perhaps deleting those observation that fall 3 SD away from the mean?
 

bukharin

RoboStataRaptor
#4
I would not do that - your handling of outliers should primarily be driven by your content knowledge. For example, you know that a 1 year old is not going to be screened for breast cancer, so that should be excluded. Depending on your research question you may want to restrict the age range to that recommended for breast cancer screening in your country, or some other clinically relevant categories. Excluding patients based on standard deviations is likely to throw away good information without a good reason.