Weights for weighted mean: Do these weights make sense?

trinker

ggplot2orBust
#1
I'm attempting to weight a mean to make outliers less impactful. I know people use a median here often but am looking to try to make a mean that's rbust to outliers. That is the further an observation is from the mean the less weight it has. I'm sure this has been done before. In meta-analysis weighted least squares, or multilevel modeling, the less precision a study has the less weight it is given. So this thinking is similar.

Here I attempt to do this weighting in a convoluted way that I think works. I first calculate the standard scores for each observation. Take that absolute value and calculate the logit link to constrain to be between -1 and 1. I multiply everything by 2 and subtract 1 from it to get everything between 0 and 1. This means that things close to the mean are now a value of 0 and things fa from the mean approach 1. That's the opposite of what I want so I reverse score it (1 - i) and now I have weights. I then use the weighted mean function in R to calculate the eman using the weights. It seems logical to mean that I now have a mean that's robust against outliers. But I'm not a mathematician nor a statistician.

Code:
n <- 50
x <- sample(1:5, n, T, c(.1, .1, .2, .2, .4))

mean(x)

z <- 1 - (((1 - (1/(1 + exp(abs(scale(x)))))) * 2) - 1)
weighted.mean(x, z)

barplot(table(x))
  1. Is this logical to achieve what I want?
  2. Poke holes in the thinking.
  3. Is there a better way?
  4. Additional thoughts folks have?
 

vinux

Dark Knight
#2
As a central tendency estimate, it looks fair to me. But, it is actually scaling the actual distribution and may not be robust in small samples.


  1. Is this logical to achieve what I want?
    It is logical. It is penalising the outlier observations by giving smaller weights
  2. Poke holes in the thinking.
    Why this set of weights?
  3. Is there a better way?
    You could use trim option in the mean function EG: mean(x,trim = .1)
  4. Additional thoughts folks have?
    May be after your reply.
 

Jake

Cookie Scientist
#3
Take that absolute value and calculate the logit link to constrain to be between -1 and 1. I multiply everything by 2 and subtract 1 from it to get everything between 0 and 1. This means that things close to the mean are now a value of 0 and things fa from the mean approach 1.
This doesn't make sense. The logit function maps values from [0,1] to the real numbers, not from the real numbers to [-1,1]. The inverse logit function would be closer, but still, this maps from the real numbers to [0,1], not to [-1,1]. Then you say that you multiply everything by 2 and subtract 1 to get everything in [0,1]. But if you did this to [-1,1] as you say, then you would end up with values in [-3,1], not [0,1]. Maybe what you mean to say is that you applied the inverse logit function to your standard scores to get them in [0,1], and then multiplied by 2 and subtracted 1 to get them in [-1,1]. But then that would contradict the last sentence that I quoted, which says you end up with values in [0,1]! Some correction/clarification is definitely needed here. The whole first part makes no sense, before I even start thinking about the deeper motivation of it all.

Edit: Also I just looked at your code and see that you are I guess attempting to apply the inverse logit function, although you're missing a minus sign in the exponent (should be \(logit^{-1}(x)=1/(1+e^{-x})\)), which means all your scores are being inadvertently reverse-scored at the initial inverse-logit step. And I have no idea why you take an absolute value at any point in the calculation. WHAT IS HAPPENING
 

vinux

Dark Knight
#4
Hi Jake,

I think trinker wrote the function in a complex way and complex interpretation.. I haven't even read.. But looking at the values.. it seems fair to me..

See the function here
Code:
z <- 1 - (((1 - (1/(1 + exp(abs(seq(-3,3,length.out = 100)))))) * 2) - 1)
plot(z, type="l")
 

hlsmith

Not a robit
#5
It would help me if I knew why you were doing this, the context. Reason being, how do you know what the true mean or precision really is?

Naturally vaules already have weights in mean calculation, which is just 1/n, you know this.
When you calculate the standard scores, aren't you using the mean already? So you use the mean to find their distance then recaluate the mean? I could be wrong about this, but if you don't want them to have such a weight, why would you use the weight you do not want to calculate their weight.
 

Dason

Ambassador to the humans
#7
Yeah - if anything you could just use a trimmed mean. That's a helluva simpler and more well known.

And I'm confused as to where your "outliers" are in your sample data.
 
#8
Have a look at "robust statistics", "M-estimators" or "outlier resistant". Isn't Huber a name that is often mentioned? Trimming is one of the methods.

(One simple and common way is to take the log, calculating the mean of the logged values and use exp(mean). That would estimate the (population) median in the log-normal distribution. That is, this is one way it the data is right skewed.)

But isn't the M-estimators and all that, just kind of "old"?

There is nothing wrong if the data happens to be skewed. Why should it be symmetric or even normally distributed? Isn't a more modern way to specify a (say) skewed distribution and estimate it with maximum likelihood?
 

trinker

ggplot2orBust
#9
The original purpose for this turned out to not be useful in that it didn't account for what I was attempting to account for. I'm working with folks who have a 5 point response scale (1-5). It looks at subsets of text (chunks of dialogue) and uses the response scale to example each of the subsets. So let's say you have 9 subsets (this is not fixed). Here's some sample data.

Code:
library(qdap); set.seed(1)

x <- random_sent(9, 7)
set.seed(3)
dat <- data.frame(score = sample(1:5, 9, T), wc=wc(x), text=x)

print(left_just(dat, 3), row.names=F)
Code:
score wc text                                                         
    1  4 For use we any.                                              
    5  5 Does part want America no.                                   
    2  8 Animal an much or about way and no|                          
    2 12 Has come just may your ask right big be show first why.      
    4  3 Large live little.                                           
    4 12 To get around same get home oil their are I make take.       
    1 13 Than play these long into much up get does with try time men.
    2  9 Into get picture home way well answer who came.              
    3  9 Him put there came from if by how was.
Their idea was to take the average of all scores (this is because this is for a single item of a rubric and we need to go from 9 scores to a combined.

I said you can't do that because in the worst case scenario you have 3 subsets. Let's say 2 of them are 1s but the word length is minimal (e.g., 4 words) and you have a third subset that's 50 words and scores highest (5). The two smaller scores will pull the larger score down though they represent very little of the dialogue. I figured we needed a way to weight this sucker.

I suggested weighting but word length but they retorted that high word count isn't necessarily better. For example a professor that uses terrific language to describe an ANOVA but berates the subject for 4 classes beyond what was necessary for understanding actually does a dis-service. Valid point.

Then I thought well the problem is variability and outliers pulling down the average (hence the original questions) but then I realized in the worst case scenario I posed above, that the score of 5 is actually the outlier (this is of course a very liberal use of the word outlier) not the 2 scores of 1. By weighting the way I proposed in the OP it actually makes the situation worse and gives higher weight to the 1s.

Alas I think the mean may be the best approach. I like weighting by word count but they are content experts and bring valid reasons of why this ought not be done.

So that's the background. You have all provided tremendously insightful responses and questions that give me a ton to look at (not really to solve my original problem but the sub-curiosity that arose). Thank you.

If anyone has additional thoughts on the original problem I'm all ears (eyes in this case).