SiBorg (08-19-2012), TheEcologist (08-16-2012)
This message only pertains to those with chatbox privileges. If you do not have those privileges yet keep posting and being a part of the community and soon you'll have access.
I'm working on a paper right now with the speech formality of educators (this actually has a great deal to do with he formality that Greta has been discussing). One of the elements of the language teachers use is that of formal language vs. contextual language. Here's a fascinating (IMHO) paper regarding this:
http://pespmc1.vub.ac.be/Papers/Formality.pdf
I have written a function to measure formality in speech using R. I ran it for the chat box and thought I'd share. You can run the code yourself just download the qdap and talkstats packages from my github repo (if you already downloaded qdap I'd update).
I warn you that the function takes a while to run the first time. Consider it's generating a part of speech per word. On my machine i7 quad core windows 7 machine it took about 10 minutes to run.
Here's the results (formality ranges from 0 to 100; neither extreme being possible). Also the second visual is with people less than 300 words used as the measure isn't recommended for under this number of words.
Results:
Visuals:Code:person word.count formality 1 bugman 15 86.67 2 SmoothJohn 15 73.33 3 ledzep 74 71.62 4 TheEcologist 195 67.95 5 spunky 278 63.31 6 bukharin 25 62.00 7 quark 995 61.46 8 bryangoodrich 10957 60.76 9 duskstar 92 60.33 10 vinux 2664 58.20 11 Lazar 2162 57.28 12 Dragan 84 57.14 13 Jake 8249 57.10 14 trinker 8194 57.04 15 victorxstc 8765 56.81 16 SiBorg77 985 55.69 17 noetsi 1588 55.57 18 GretaGarbo 5872 55.29 19 Dason 13415 54.16
Code:
Code:# install.packages("devtools") library(devtools) install_github("qdap", "trinker") install_github("talkstats", "trinker") x <- ts_chatbox() #the first one talks a long time as it's parsing parts of speech (res <- formality(x$dialogue, x$person, plot = TRUE)) formality(res, x$person, plot = TRUE, min.wrd = 300) formality(res, x$date, plot = TRUE) with(x, formality(res, list(person, date)))
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
SiBorg (08-19-2012), TheEcologist (08-16-2012)
Bwahah! I'm winning the informality race!
Any chance we could get some sort of standard error on those measurements![]()
"His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich
I guess it's not a serious question, but it did make me think that you could in principle get bootstrapped standard errors by resampling on peoples' samples of words, right?![]()
In God we trust. All others must bring data.
~W. Edwards Deming
Not that I can devise as the formula works of the speech as a whole. no sd sorry. Formality isn't necessarily a good thing. I'm guessing that (in fact I'd bet) that in threads we're all way more formal as there's a greater chance to be misunderstood.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
Actually Jake that's a very interesting concept, thanks.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
I thought about it for a little while but the problem with that approach is that there is a structure to our sentences that wouldn't be accounted for with a naive bootstrap. You would probably have to do some sort of block bootstrap to make it work.
"His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich
Code:37 2011-08-16 15:46:00 trinker Formal language is useful in that it is good for those with little contextual knowledgeable but terribly inefficient. Thus the goal is to get the student to have greater context and thus be less formal. 36 2011-08-16 15:51:00 trinker What's interesting is that Dason is less formal than Greta who is female (females are less formal in spoken dialogue). That makes me question if bots are less formal still because the'r programming lacks sophistication. This could be an interesting way to detect bots. 35 2011-08-16 15:54:00 trinker @Jake I was actually wanting to get SE for this and bottstrapping didn't occur to me. Nice idea. 34 2011-08-16 15:55:00 Jake based on what you said about how long it took for a single run, i guess getting the SEs would take a while 33 2011-08-16 15:55:00 trinker Would you resample with replacement each time but use the same n? 32 2011-08-16 15:55:00 Jake yeah 31 2011-08-16 15:56:00 trinker No Jake once you run it once it's easy. It saves the parts of speech in a lsit, that's what takes a while. After that you feed the first one to the next and it takes seconds. 30 2011-08-16 15:56:00 Jake the key to this working is the fact that the formality algorithm just analyzes individual words, not sentence structure - that's true, right? 29 2011-08-16 15:57:00 GretaGarbo What is the unit of investigation in this case, the word, sentence of message? 28 2011-08-16 15:57:00 trinker correct jake, but the parts of speech algorithm needs the sentence struct. However if I smeel what you're cooking you just sample from the the parts of speech after they've been determined 27 2011-08-16 15:57:00 Dason Like I mentioned in the thread I think you would need to do a block bootstrap - not a naive bookstrap where you just resample all of the words 26 2011-08-16 15:58:00 trinker The formula is rather simple:25 2011-08-16 15:58:00 Dason But I would probably need to learn more about the actual measure used to be sure of that 24 2011-08-16 15:58:00 trinker bootstrap I quasi understand now you lost me with block and naive 23 2011-08-16 15:59:00 Jake block means basically you would resample at the level of sentences rather than words 22 2011-08-16 15:59:00 trinker the tex doesn't come through but you add up all formal parts of speech ,inus all contextual parts plus 100 divided by 2 21 2011-08-16 15:59:00 trinker why that way dason? 20 2011-08-16 15:59:00 trinker It's doable but why?
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
Code:19 2011-08-16 16:00:00 GretaGarbo Is it multilevel with: message, sentence, word? 18 2011-08-16 16:00:00 trinker The parts of speech are actually saved in a list by turn of talk, not necessarily by sentence. 17 2011-08-16 16:00:00 Dason Because people don't just throw out random words - there is structure that needs to be accounted for. 16 2011-08-16 16:01:00 Dason Ok - people don't usually just throw out random words. 15 2011-08-16 16:01:00 trinker Oh that makes sense 14 2011-08-16 16:01:00 Jake for the simple analyses that only look at characteristics of the word (ignoring what sentence it came from) a simple bootstrap should be fine. but stuff that depends on sentence structure may need a more complicated boostrap like dason suggests 13 2011-08-16 16:01:00 trinker Though you've been known to... 12 2011-08-16 16:01:00 Dason that's why I added the "usually" 11 2011-08-16 16:02:00 Dason How do you determine if something is formal or contextual? 10 2011-08-16 16:03:00 trinker Very intersting. I'm working on the lit review now and the analysis is a few weeks off so I think I may add the SE to the analysis. It's on 3 subjects 3 pre and 3 post measures but not a large enough sample to run sound statistical analysis on, however the SEs I think add to the information conveyed. 9 2011-08-16 16:04:00 Jake the pre and post design also adds some more complexity because now you also have to block by subject 8 2011-08-16 16:05:00 trinker Dason it's rather simple, verbs, adverbs, pronouns and interjections are contextual, where as nouns, articles, prepostions and adjectives are formal and conjunctions are neither. 7 2011-08-16 16:05:00 trinker I may start a thread on this. 6 2011-08-16 16:06:00 Jake the more i think about it, you probably cant get away with a simple boostrap even for the very simple measures. it is probably the case that there are correlations across sentence in the composition of words. i.e., sentence with lots of verbs also have lots of nouns, that kind of thing 5 2011-08-16 16:07:00 Jake so sentences probably always introduce dependence 4 2011-08-16 16:07:00 trinker You may say well in this instance... blah blah blah. This is an overall measure so that's why we need 300+ words. It's pretty robust from everything I've read on it and pretty standard among linguists.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
Here is a releated post I wanted to link to this on preferred equation dsiplay: http://www.talkstats.com/showthread....lay?highlight=
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
I was speculating a little bit about trinkers model yesterday while doing something else. I just wrote as comment in the chat box since it was, and still is, no more than a speculation on a temporary idea.
Let Letbe 1/0 that is (formal/not formal) for each
individual with the s:th sentence and t:th word within sentence.
Trinker was referring to some authors who added 50 or so. I think that is unnecessary and just makes the situation more complicated. That is just a linear transformation and that can be done afterwards (after the estimation of proportion of formality). Besides, formality proportion and the other proportions will add up to 100%.
One can think of individuals as a third level but I ignore that and model one person and think of sentences and words as a multilevel model.
Whereis a random variable for sentence number s and where
is a random variable for word st. (I think of the word in a sentence like a time series so I use index t). Possibly the alpha:s can be considered to be independent. The omega:s have a dependence like maybe an autokorrelation. I think a first order autocorrelation would be to simple but maybe a second order model:
I have not run a model exactly like this but there are standard models like generalized linear models (with binomial errors) and repeated measurements. Thus a multilevel model. Some sentences can be very short. I know. And that might be a complication for estimating the autocorrelation model. Models where estimates are shrunken towards the overall mean might be useful.
It was suggested to trinker to have standard error for the estimates. Such a model would give standard errors.
First I thought of a Kalman filter with gradually changing proportions of formality. In a way I like the idea that parameter are changing, drifting, over time. Like when a conversation starts more formal, moves over to something more contextual and ends a little bit formal.
Please correct this if it is of any use. Meanwhile, I agree with Socrates.
trinker (08-19-2012)
This occurs (according to the linguists I'm reading) because the beginning is spent building context. When the context is built the speaker may become less formal and thus more efficient.Originally Posted by Greta
Thanks for yuor thoughts Greta. I'd appreciate if people had challenges/confirmations/or new ideas.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
|
|