Outlier analysis with IQR and different sample size

#1
I'm doing an extreme outlier analysis on the price distribution of the New York Airbnb listings in 2019.
I divided the overall distribution in 15 distributions, grouping them by "room_type" (3 types) and "borough" (5 boroughes): for example, one distribution can be ("Private room", "Queens"), another ("Shared room", "Brooklyn") and so on.
I did this division because obviously the price distribution of ("Entire apartment", "Manhattan") is very different compared to ("Shared room", "Bronkx").
Moreover, because the price distributions are right-skewed and without negative values, i used the median as location metric, and the inter quartile range as dispersion metric.
If i had only one overall distribution i would use the threshold Q3 + 3*IQR as outlier detection.
In the link below you can see that i found the threshold for each of the 15 distributions, but, in order to simplify the analysis i did it considering the distributions as independent. This assumption is not true, because for example if i know that the prices of ("Shared room", "Bronx") increase, also the prices in ("Private room", "Bronx"), increase.

Another problem is that each empirical distribution has different sample sizes.

My question is: does make sense the Q3 + 3*IQR threshold separately calculated for each distribution given that the distributions are not independent and the are different sample sizes?
How can a define a method for the outlier detection in this exploratory descriptive analysis?

If you want to see the problem in more details here the link: https://antonio-catalano.github.io/NY_Airbnb_outliers.html

In the same link you can find even the first part of article (but it's not necessary in order to understand the problem that i highlighted).

In other words: the threshold of Q3 + 3*IQR makes sense if calculated for different but not independent distributions?
Thanks.
 

obh

Active Member
#2
Hi,

Per my understanding, the goal of "removing outliers" is to remove one of the following:
1. Mistake - leaving this mistake will influence the results.
2. Valid value with a very low probability that got into your small sample, and if you do another 10 similar samples you shoudn't get such extreme value.

Why do you want to remove the "outliers"?
If the distribution is right-skewed and you expect to get a long tale, it seems that you try to remove a valid data?

What question do you want to solve after removing the outliers?
 
#3
Hi,

Per my understanding, the goal of "removing outliers" is to remove one of the following:
1. Mistake - leaving this mistake will influence the results.
2. Valid value with a very low probability that got into your small sample, and if you do another 10 similar samples you shoudn't get such extreme value.

Why do you want to remove the "outliers"?
If the distribution is right-skewed and you expect to get a long tale, it seems that you try to remove a valid data?

What question do you want to solve after removing the outliers?
Not sure that outlier detection serves only for those purposes.
I think that it can serve also to "describe" the empirical distribution and if or not it is "well-behaved".

For example if i generate two samples with n = 1000, one by a normal distribution and another by a t-student with df = 3, if i calculate the sample average and the sample standard deviation, i can find that for the normal distribution i have less values above or below 3 sigma from average, and those values are smaller in absolute value compared to those of t-student distribution.
In this example there is no need to remove outliers, because i know that those outliers are generated from the same distribution (normal or t-student).
Nevertheless the frequency of outliers and the maximum outlier (in absolute value) are significant in order to differentiate the two generated samples.
 

obh

Active Member
#4
Hi C,

Usually, when n=1000 you won't get df=3, but this was only an example and describe your point it well :)

Per my understanding when trying to identify the distribution you won't look only on the hedge but on the entire distribution.
Like Shapiro Wilk does for the Normal distribution and like you can use Chi-square for the goodness of fit for any distribution.
Why should you restrict yourself to only part of the distribution? even if this is an important part, as I agree.

Back to your original question.
What question do you want to solve after removing the outliers? or is it a theoretical question only?
 
#5
Per my understanding when trying to identify the distribution you won't look only on the hedge but on the entire distribution.
Yes, i did it in the first article of the study, after truncate the distribution at the 98th percentile.


Back to your original question.
What question do you want to solve after removing the outliers? or is it a theoretical question only?
In this specific case is theoretical.
I'm not sure even if i have framed the problem in the exact manner.
Because i have 15 distributions that have not independent metrics (if the average price of a group increases, also the average price of a "near" group increases) but if we consider the random variables extracted i'm not sure if they are dependent or not.
In other words: when we evaluate the correlation between 2 random variables, we can do it because:
1) Or the 2 RV are paired in the same moments, for example: (X1,Y1) in t1, (X2,Y2) in t2 ecc ecc ecc
2) Or the 2 RV are mapped at the same object, for example: (X1, X2, X3....) are the weights of a population and (Y1,Y2,Y3...) are the heights of the same population where each of the (X1,Y1), (X2,Y2)... ecc ecc ecc, concern the same person.

Instead in my problem we have (X1, X2, X3....) "prices of the group1", (Y1, Y2, Y3.....) "prices of group2"..... (Z1, Z2, Z3.....) "prices of group15" but (X1, Y1....Z1) don't concern the same object and they are not taken at the same time.
The only relation that we have is that the moments of those distributions are not independent.

I don't know if i get a little bit confused but it seems to me not easy to frame the problem in the right way.
 

obh

Active Member
#6
Okay, so let's think about the theoretical question...

Probably doing it separately for each of the 15 distributions will okay.

Just an idea, what about multiple regression? (didn't say which one, for linear you need to meet the assumptions and the normality is only for the residuals)
Then maybe there is an option to calculate the outliers based on the expected Y instead of the average/median?

You may look for "outliers for regression"?
I found the following but it may be not relevant (I didn't have the time to read ...)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-123