I have an algorithm that takes text data and predicts polarity of the data between -1 and 1. I want to test this algorithm for it's ability to predict. sentiment from text.
The NLP field tests such models against humans, which I have done. I have a mean score of 2 to 3 scorers who have analyzed data for polarity ranking it between -1 and 1 (can take any value). I've been throwing around numerous tests of the algorithm including sensitivity and specificity (this won't work because I don't have a simple yes no outcome). I think it would make sense to simply use Pearson correlation. A perfect algorithm would have a correlation close to 1 and the regression line would be about a 45 degree angle.
Does this seem like a reasonable approach to assess my algorithm's success?
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
That approach is easy enough and would probably work fine for your purposes. (Although I guess I don't really know much about your purposes so maybe not.) A more rigorous test would involve cross validation. Not sure how familiar you are with this technique. The Quick R website has a very brief section on R procedures for it here, but it shouldn't be too hard to code up yourself.
In God we trust. All others must bring data.
~W. Edwards Deming
I should add: cross validation is only an option if your algorithm uses free parameters estimated from the data. I don't have any idea how your algorithm works, but if it consists simply of a series of built-in rules which it applies the same to any dataset, cross validation wouldn't make sense because there would be no need for any training data. (With no parameter estimates there is nothing to generalize.)
In God we trust. All others must bring data.
~W. Edwards Deming
h.m.s.standardized is the human scores standardized by person. By the way this is totally fictitious. The algorithm would have scored these sentences differently.
Code:
person sex adult state algor.polarity hum.mean.score h.m.s.standarized
1 sam m 0 Computer is fun. 0.256 0.600 2.25518769
1.1 sam m 0 Not too fun. 0.000 0.250 0.51411339
2 greg m 0 No its not, its ****. 0.400 0.100 0.04886514
3 teacher m 1 What should we do? 0.600 0.050 -0.10621761
4 sam m 0 You liar, it stinks! 0.000 0.300 0.66919614
5 greg m 0 I am telling the truth! 0.053 0.100 0.04886514
6 sally f 0 How can we be certain? 0.000 -0.300 -1.19179687
7 greg m 0 There is no way. 0.000 -0.350 -1.34687962
8 sam m 0 I distrust you. 0.033 0.200 0.35903064
9 sally f 0 What are you talking about? 0.222 0.550 2.10010494
10 researcher f 1 Shall we move on? 0.400 0.250 0.51411339
10.1 researcher f 1 Good then. 0.000 0.075 -0.02867624
11 greg m 0 Im hungry. 0.250 0.055 -0.09070934
11.1 greg m 0 Lets eat. 0.000 0.050 -0.10621761
11.2 greg m 0 You already? 0.350 0.050 -0.10621761
Code:
example<-structure(list(person = structure(c(4L, 4L, 1L, 5L, 4L, 1L, 3L,
1L, 4L, 3L, 2L, 2L, 1L, 1L, 1L), .Label = c("greg", "researcher",
"sally", "sam", "teacher"), class = "factor"), sex = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("f",
"m"), class = "factor"), adult = c(0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), state = structure(c(5L, 4L,
10L, 14L, 15L, 7L, 6L, 12L, 8L, 13L, 11L, 1L, 9L, 2L, 3L), .Label = c(" Good then.",
" Lets eat.", " You already?", " Not too fun.", "Computer is fun.",
"How can we be certain?", "I am telling the truth!", "I distrust you.",
"Im hungry.", "No its not, its ****.", "Shall we move on?", "There is no way.",
"What are you talking about?", "What should we do?", "You liar, it stinks!"
), class = "factor"), algor.polarity = c(0.256, 0, 0.4, 0.6,
0, 0.053, 0, 0, 0.033, 0.222, 0.4, 0, 0.25, 0, 0.35), hum.mean.score = c(0.6,
0.25, 0.1, 0.05, 0.3, 0.1, -0.3, -0.35, 0.2, 0.55, 0.25, 0.075,
0.055, 0.05, 0.05), h.m.s.standarized = c(2.25518769436277, 0.514113390124449,
0.0488651369550992, -0.106217614101351, 0.669196141180899, 0.0488651369550992,
-1.1917968714965, -1.34687962255295, 0.359030639067999, 2.10010494330632,
0.514113390124449, -0.0286762385731257, -0.0907093389957057,
-0.106217614101351, -0.106217614101351)), .Names = c("person",
"sex", "adult", "state", "algor.polarity", "hum.mean.score",
"h.m.s.standarized"), row.names = c("1", "1.1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "10.1", "11", "11.1", "11.2"), class = "data.frame")
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
I haven't got the multiple confirmations on this I was hoping for. This is just a *bump* to get the thread up to the top again. Hopefully a get a few more responses.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
The mean human score... is it the same people doing the score every time? You said "2 or 3" people so does this imply that some of those are the mean of 2 and some are the mean of 3?
I think the easiest way to explain is to show you with a data frame:
Code:
group1 group2 group3 group1.scores group2.scores group3.scores
1 scorer 1 scorer 4 NA 2.21926899 -0.0695034395480398 NA
2 scorer 1 scorer 4 NA -0.91914158 0.21539195249389 NA
3 scorer 1 scorer 4 NA -0.78943285 -0.593739815464766 NA
4 scorer 2 NA scorer 5 -0.12123523 NA -0.0464242493234577
5 scorer 2 NA scorer 5 -0.57183893 NA 0.978547963399828
6 scorer 2 NA scorer 5 1.39600043 NA -0.836448983464415
7 scorer 3 NA scorer 6 -0.07372781 NA 0.0585154022531354
8 scorer 3 NA scorer 6 0.62242375 NA 0.0463662419365834
9 scorer 3 NA scorer 6 1.52761071 NA -0.0885663043068995
So to get the mean score sometimes it was three people's average and sometimes two.
Again I know this is poor collection but I did the best with what I had.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -