View Full Version : Use R to break a word into syllables
trinker
12-19-2011, 12:36 AM
I moved this portion of a thread from (HERE (http://www.talkstats.com/showthread.php/22361-Using-a-website-s-tool-s-in-R)) as the original problem was solved.
Posting a problem I'm wrestling with though it's likely outside your fields your brains are generally pretty good at applying your knowledge to other fields. Thought maybe you could help me.
I'm looking to use R for Syllabication or break words into number of syllables (In my field this has all sorts of uses related to readability of a text etc).
Here's a bit of really condensed, not too hard, theory on syllabication that may be useful if you're attempting to help with this problem:
http://allenporter.tumblr.com/post/9776954743/syllables
A suggestion for R from SO to tackle this problem is to use:
nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y)))
I tested it out using text from a post on here (The frequent stats misunderstandings thread (http://www.talkstats.com/showthread.php/18573-Frequent-Statistical-Misunderstandings)) and got surprisingly accurate responses from the simple code(76% accuracy on n =74). The major problem with the above code is it doesn't detect silent 'e' at the end of a word. If I could add a piece to the code it would surely improve the accuracy a lot. Just telling R to not look for ending e's won't work because the words "people" and "little" for instance actually uses the e to for syllabication.
QUESTION: Any ideas on how to use R's reg ex to find silent e's.
x <- "I know TheEcologist is working on an FAQ to put up at some point but I didn't think it would be a
bad idea to compile a list of frequent misunderstandings that we see. I've been seeing quite a few people
making the mistake of assuming that in a linear model we expect the predictor variables to be normally distributed.
Or seeing that they expect the response itself to be normally distributed. This is wrong, of course, because we
don't make any of those assumptions but instead assume that the error term is normally distributed.
So what other misunderstands have you come across quite a bit (either in real life or here at Talk Stats)?"
bracketX <- function(text, bracket='all'){
switch(bracket,
square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
sapply(P1, function(x)gsub("\\{.+?\\}", "", x))})
}
y <- gsub("\n", " ", x)
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
y <- bracketX(y)
y <- gsub(" +", " ", y)
y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
y <- tolower(y)
y <- levels(as.factor(y))
n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y)))
DF <- data.frame(words=y, syllables=n)
words syllables actual
1 a 1 1
2 across 2 2
3 an 1 1
4 any 2 2
5 assume 3 2
6 assuming 3 3
7 assumptions 3 3
8 at 1 1
9 bad 1 1
10 be 1 1
11 because 3 2
12 been 1 1
13 bit 1 1
14 but 1 1
15 come 2 1
16 compile 3 2
17 course 2 1
18 didn't 1 2
19 distributed 4 4
20 don't 1 1
21 error 2 2
22 expect 2 2
23 faq 1 NA
24 few 1 1
25 frequent 2 2
26 have 2 1
27 i 1 1
28 i've 2 1
29 idea 2 3
30 in 1 1
31 instead 2 2
32 is 1 1
33 it 1 1
34 itself 2 2
35 know 1 1
36 linear 2 3
37 list 1 1
38 make 2 1
39 making 2 2
40 mistake 3 2
41 misunderstandings 5 5
42 misunderstands 4 4
43 model 2 2
44 normally 3 3
45 of 1 1
46 on 1 1
47 or 1 1
48 other 2 2
49 people 2 2
50 point 1 1
51 predictor 3 3
52 put 1 1
53 quite 2 1
54 response 3 2
55 see 1 1
56 seeing 1 2
57 so 1 1
58 some 2 1
59 term 1 1
60 that 1 1
61 the 1 1
62 theecologist 4 4
63 they 1 1
64 think 1 1
65 this 1 1
66 those 2 1
67 to 1 1
68 up 1 1
69 variables 3 4
70 we 1 1
71 what 1 1
72 working 2 2
73 would 1 1
74 wrong 1 1
75 you 1 1
table(with(DF, syllables-actual))
sum(table(with(DF, syllables-actual)))
round(prop.table(table(with(DF, syllables-actual))), digits=3)*100
COUNTS
-1 0 1
5 56 13
n = 74
PERCENTAGES
-1 0 1
6.8 75.7 17.6
bryangoodrich
12-19-2011, 10:15 AM
Well, I don't know how to identify something as "silent." What conditions would make that true? But looking for 'e' at the end of the sentence, you should be thinking "e$" or something like that.
Dason
12-19-2011, 10:17 AM
The hard part is that you can't just look for an "e" at the end of a word because sometimes it is silent and sometimes it isn't. trinker is there a good way to determine if something is a silent e?
trinker
12-19-2011, 11:28 AM
OK I added a bit about silent e's to the code and have improved the accuracy to 93%. I think I'm going to a combined approach where I start with a dictionary of known syllable rule breakers (hopefully less than 5000). match the words to that list. Any words in that dictionary will be calculated using the dictionary with pre set syllable values. Anything not in the dictionary will be calculated using the formula below. I think this will increase accuracy even more to well above the 95% accuracy I was hoping for.
y <- gsub("\n", " ", x)
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
y <- bracketX(y)
y <- gsub("\\d+", "", y)
y <- gsub(" +", " ", y)
y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
y <- tolower(y)
y <- levels(as.factor(y))
m <- gsub( "eeing", "XX", y)
m <- gsub( "eing", "XX", m)
magice<-function(z){
if(substring(z,nchar(z), nchar(z))=='e' &
length(intersect(unlist(strsplit(z, NULL)), c('a','e','i','o','u','y')))>1){
substring(z,1, nchar(z)-1)
}else{
z
}
}
m <- sapply(m, magice)
n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", m)))
DF <- data.frame(words=y, syllables=n)
table(with(DF2, syllables-actual))
-1 0
5 69
n = 74
-1 0
6.8 93.2
words syllables actual diff
18 didn't 1 2 -1
29 idea 2 3 -1
36 linear 2 3 -1
49 people 1 2 -1
69 variables 3 4 -1
trinker
12-19-2011, 11:34 AM
I just saw Dason and bryangoodrich's post. I used a convoluted way to find ending e's. I'll fix that in the code. I have two manuals on regular expression, knew the e$ use and just didn't apply it in this situation. I'm still very uncomfortable with regexs but I'm getting there.
Dason there really isn't a good way to tell if an e is silent or not so I'm going to assume it is in the formula. But in my previous post I talk about using a dictionary of known rule breakers first. That should help with that problem. Gotta find the dictionary now (By the way I should define that term "dictionary"; a dictionary in natural language processing (NLP) is a word list of terms. So I actually have a dictionary of positive and a dictionary of negative terms [ie love and hate] that I use for sentiment analysis)
trinker
12-19-2011, 11:40 AM
Currently, for punctuation I'm using
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
to say all [[:punct:]] minus apostrophes. Is there a way to do this more simply? I tried [[:punt:]]^"'"
EDIT: THE carrot has to go inside the brackets somehow. Still reading
bryangoodrich
12-19-2011, 11:41 AM
Do syllables follow morphemes or phonemes? I would assume the latter, and not necessarily the former. My thought on using a dictionary is that if you can break common syllables into their smallest component parts, you could create a dictionary to match those using some C module and a quick searching algorithm. You could have a pretty large dictionary that works quickly. But using a rule is always better, I think. Brute forcing known exceptions to the rule + the rule should be a very good approach. You're making something very valuable for R!
bryangoodrich
12-19-2011, 11:46 AM
to say all [[:punct:]] minus apostrophes. Is there a way to do this more simply? I tried [[:punt:]]^"'"
You might use the exclusion grouping (it's prefixed by some symbol; it's stated in the help files). You'll have to play around with how to make it work, but basically you're saying "all punctuation except apostrophes." That's the only thing I can think of it.
trinker
12-19-2011, 11:51 AM
Do syllables follow morphemes or phonemes? I would assume the latter, and not necessarily the former.
You're correct more so the latter but in reality not so much the latter either. English is an untamed beast with lots of rules and just as many broken rules. I have it a bit easier than some with this task because I'm not looking to actually segment the words just count the syllables. Segmenting the words requires even more rules. For the purposes of the package I'm creating the counting of syllables is enough. On the surface this seems like a rather simple problem but it's very complex.
bryangoodrich
12-19-2011, 12:18 PM
Of course it seems simple, but that's because understanding a phenomena in a language we are fluent with is easy! The hard part is making that phenomena formal, and because that phenomena is not exactly a property of the syntax (grammar) of the language. As a logician, I look at the language in terms of morphemes. I was hoping that maybe there was a way of looking at it in terms of phonemes, of which I'm largely unfamiliar, might prove useful, especially since syllables are a property of the way a word is sounded, not with its meaning.
trinker
12-19-2011, 07:24 PM
I've improved the code to within 95% accuracy on the original data set using the following code (found one more rule with 'consonant + le' that improves accuracy). Still looking for an initial dictionary to use. I've inquired at linguistics.stackexchange.com (post located here (http://linguistics.stackexchange.com/questions/1182/dictionary-of-atypical-syllabication)) a stack site dedicated to linguistics about the dictionary. Hopefully I get some information there.
y <- gsub("\n", " ", x)
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
y <- bracketX(y)
y <- gsub("\\d+", "", y)
y <- gsub(" +", " ", y)
y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
y <- tolower(y)
y <- levels(as.factor(y))
m <- gsub( "eeing", "XX", y)
m <- gsub( "eing", "XX", m)
conle<-function(z){
if(substring(z,nchar(z)-1, nchar(z))=='le' &
!substring(z,nchar(z)-2, nchar(z)-2)%in%c('a','e','i','o','u','y')){
paste(substring(z,1, nchar(z)-1), 'X', sep="")
}else{
if(substring(z,nchar(z)-1, nchar(z))=='le' &
substring(z,nchar(z)-2, nchar(z)-2)%in%c('a','e','i','o','u','y')){
substring(z,1, nchar(z)-1)
}else{
z
}
}
}
m <- sapply(m, conle)
conles<-function(z){
if(substring(z,nchar(z)-2, nchar(z))=='les' &
!substring(z,nchar(z)-3, nchar(z)-3)%in%c('a','e','i','o','u','y')){
paste(substring(z,1, nchar(z)-2), 'X', sep="")
}else{
if(substring(z,nchar(z)-2, nchar(z))=='les' &
substring(z,nchar(z)-3, nchar(z)-3)%in%c('a','e','i','o','u','y')){
substring(z,1, nchar(z)-2)
}else{
z
}
}
}
m <- sapply(m, conles)
magice<-function(z){
if(substring(z,nchar(z), nchar(z))=='e' &
length(intersect(unlist(strsplit(z, NULL)), c('a','e','i','o','u','y')))>1){
substring(z,1, nchar(z)-1)
}else{
z
}
}
m <- sapply(m, magice)
n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", m)))
DF <- data.frame(words=y, syllables=n)
#Raw scores
-1 0
4 70
n = 74
#Percentages correct (0 means difference -1 means the algorithm calculated 1 too few syllables)
-1 0
5.4 94.6
#Words not counted correctly
words syllables actual diff
18 didn't 1 2 -1
29 idea 2 3 -1
36 linear 2 3 -1
69 variables 3 4 -1
EDIT: Sorry for the lack of indentation. I'll do it on a future post when the function is more near complete.
trinker
12-19-2011, 09:52 PM
Alright I got my dictionary after much searching and paper reading. It's the NETtalk dictionary (LINK) (http://jklp.org/public/projects/lists/talk/corpus) and the usage defined here (LINK (http://jklp.org/public/projects/lists/talk/index.html)). Here's the paper that lead me to search for NETtalk (LINK (http://eprints.ecs.soton.ac.uk/14285/1/MarchandAdsettDamper_ISCA07.pdf))
My next step is to read the code into R pull out the word list, count the digits in the 3rd column and make a new data frame with word and number of syllables. It says it's tab separated so I'll try reading it right into R (haven't done this for real yet in R; don't be scared trink you got this). This about 20,000 words first derived from Webster's pocket dictionary. I'm guessing this will get me pretty darn accurate between it and the algorithm.
trinker
12-19-2011, 11:12 PM
SUBTITLE: THE DICTIONARY (NECESSARY FOR THE FUNCTION TO WORK)
Might as well take you all along on my learning process. Until today I always saved web resources to a text file and then loaded them in. A while back bryangoodrich showed how easy it was to just load in a file from the web. No time to learn like the present. So following a script he freely posted I imported the NETtalk file in and doctored it up to my liking with:
j <- 'http://jklp.org/public/projects/lists/talk/corpus'
header <- c("word", "phonemes", "stress.n.structure", "origin")
NETtalkraw <- read.delim(url(j), header=FALSE, strip.white = TRUE, sep="\t",
col.names=header, na.strings= c("999","NA"," "))
NETtalkraw$stress.n.structure2 <- gsub("<", "", NETtalkraw$stress.n.structure)
NETtalkraw$stress.n.structure2 <- gsub(">", "", NETtalkraw$stress.n.structure2)
NETtalkraw$stress.n.structure3<-nchar(NETtalkraw$stress.n.structure2)
NETtalk <- data.frame(word=NETtalkraw$word, syllables=NETtalkraw$stress.n.structure3)
NETtalk<-NETtalk[which(!duplicated(NETtalk$word)), ]
contractions <- structure(list(word = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L,
21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L,
34L, 35L, 36L, 37L, 38L, 39L, 40L, 43L, 41L, 42L, 44L, 45L, 46L,
47L, 48L, 49L, 50L), .Label = c("aren't", "can't", "couldn't",
"didn't", "doesn't", "don't", "hadn't", "hasn't", "haven't",
"he'd", "he'll", "he's", "I'd", "I'll", "I'm", "I've", "isn't",
"let's", "mightn't", "mustn't", "shan't", "she'd", "she'll",
"she's", "shouldn't", "that's", "there's", "they'd", "they'll",
"they're", "they've", "we'd", "we're", "we've", "weren't", "what'll",
"what're", "what's", "what've", "where's", "who'll", "who're",
"who's", "who've", "won't", "wouldn't", "you'd", "you'll", "you're",
"you've"), class = "factor"), syllables = c(1L, 1L, 2L, 2L, 2L,
1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L)), .Names = c("word",
"syllables"), class = "data.frame", row.names = c(NA, -50L))
NETtalk<-data.frame(rbind(NETtalk, contractions))
NETtalk[12503, 2] <- 1 #an error in the word pace
compound <- structure(list(word = c("backfire", "baseboard", "cheesecloth",
"daredevil", "eavesdrop", "elsewhere", "eyeball", "eyebrow",
"eyeglass", "eyelash", "eyelet", "eyelid", "eyepiece", "eyesight",
"eyesore", "eyetooth", "eyewitness", "figurehead", "fireball",
"firebrand", "firebug", "firedamp", "firefly", "firepower", "fireplug",
"fireproof", "fireside", "firetrap", "firewood", "firework",
"flameout", "forecastle", "forefinger", "forefoot", "forefront",
"foregone", "forehand", "foreleg", "forelimb", "forelock", "foremast",
"foremost", "forenoon", "forerunner", "foresail", "foreshadow",
"foreshore", "foreword", "framework", "gamesome", "gatecrasher",
"gatekeeper", "gateway", "globetrotter", "grapefruit", "grapeshot",
"grapevine", "guidebook", "guideline", "guidepost", "harebrained",
"harelip", "hedgehog", "hedgehop", "henceforth", "homebody",
"homecoming", "homeland", "homemaker", "homespun", "homestead",
"homestretch", "homeward", "homework", "horseback", "horseflesh",
"horsefly", "horseflies", "horsehair", "horsehide", "horselaugh",
"horseman", "horsemanship", "horseplay", "horsepower", "horseradish",
"housebroke", "housebreaking", "housefly", "houseflies", "housekeeper",
"housewarming", "iceberg", "icebound", "iceboxes", "iceland",
"lifeblood", "lifeguard", "lifetime", "lifework", "likewise",
"limelight", "lovebird", "lovelorn", "lovesick", "milestone",
"moleskin", "namesake", "notebook", "pacesetter", "peacetime",
"policeman", "racehorse", "racetrack", "safeguard", "safekeeping",
"sagebrush", "shakedown", "shakeup", "shamefaced", "sharecrop",
"sharecropper", "shareholder", "shoreline", "sideboard", "sideburns",
"sidecar", "sidekick", "sideline", "sidelong", "sidesaddle",
"sidestep", "sideswipe", "sidetrack", "sidewalk", "sideways",
"smokehouse", "smokestack", "someday", "somehow", "someone",
"someplace", "sometime", "sometimes", "someway", "somewhat",
"somewhere", "spacecraft", "spaceman", "stagecoach", "stateside",
"storehouse", "storekeeper", "stovepipe", "takeoff", "talebearer",
"theretofore", "therewith", "tideland", "tidewater", "timekeeper",
"timeworn", "trademark", "typeface", "typesetter", "typewrite",
"typewriter", "warehouse", "wasteland", "waveform", "wavelength",
"whaleboat", "whalebone", "whereby", "wherefrom", "whitewash",
"wholehearted", "wiretap", "wisecrack"), syllables = c(2L, 2L,
2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L,
3L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L,
2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L,
2L)), .Names = c("word", "syllables"), row.names = c(NA, 179L
), class = "data.frame")
NETtalk<-data.frame(rbind(NETtalk, compound))
qview(NETtalk)
This yields the following data frame (15 row head)
========================================================================
n = 20031 # of vars = 2 NETtalk
========================================================================
word syllables
1 aardvark 2
2 aback 2
3 abacus 3
4 abaft 2
5 abalone 4
6 abandon 3
Tomorrow's task is to incorporate the dictionary and algorithm together as a working function. 99% accuracy would be nice.
EDIT: ADDED A LINE OF CODE TO REMOVE DUPLICATED WORDS
EDIT2: ADDED CONTRACTIONS TO THE DICTIONARY BECAUSE OF THEIR IRREGULAR NATURE
EDIT3: ADDED COMMON COMPOUND WORDS WITH FIRST WORD PIECE ENDING IN E
trinker
12-20-2011, 05:20 PM
All right I'm at 100% accuracy on my original data set. I'm going to gather a new data set and retest. I think I'm at the accuracy level I was hoping for. I may make some tweaks later but I'm pretty happy with where it's at now.
syllable.count <- function(text, remove.bracketed = TRUE, algorithm.report = FALSE) {
q <- gsub("\n", " ", text)
q <- if (remove.bracketed == TRUE) {
bracketX(q)
} else {
q
}
q <- gsub("[\\,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]",
"", as.character(q))
q <- gsub("[\\(\\)\\[\\]\\{\\}]", "", q)
q <- gsub("\\d+", "", q)
q <- gsub(" +", " ", q)
q <- c(sapply(q, function(x) as.vector(unlist(strsplit(x, " ")))))
y <- tolower(q)
y <- levels(as.factor(y))#######################remove (for testing purposes only)
q <- y#######################remove (for testing purposes only)
SYLL <- function(x) {
if (x %in% NETtalk$word) {
NETtalk[which(NETtalk$word %in% x), "syllables"]
} else {
if (substring(x, nchar(x), nchar(x)) == "s" & substring(x,
1, nchar(x) - 1) %in% NETtalk$word) {
NETtalk[which(NETtalk$word %in% substring(x, 1, nchar(x) -
1)), "syllables"]
} else {
m <- gsub("eeing", "XX", x)
m <- gsub("eing", "XX", m)
conle <- function(z) {
if (substring(z, nchar(z) - 1, nchar(z)) == "le" & !substring(z,
nchar(z) - 2, nchar(z) - 2) %in% c("a", "e", "i",
"o", "u", "y")) {
paste(substring(z, 1, nchar(z) - 1), "X", sep = "")
} else {
if (substring(z, nchar(z) - 1, nchar(z)) == "le" &
substring(z, nchar(z) - 2, nchar(z) - 2) %in% c("a",
"e", "i", "o", "u", "y")) {
substring(z, 1, nchar(z) - 1)
} else {
z
}
}
}
m <- conle(m)
conles <- function(z) {
if (substring(z, nchar(z) - 2, nchar(z)) == "les" &
!substring(z, nchar(z) - 3, nchar(z) - 3) %in% c("a",
"e", "i", "o", "u", "y")) {
paste(substring(z, 1, nchar(z) - 2), "X", sep = "")
} else {
if (substring(z, nchar(z) - 2, nchar(z)) == "les" &
substring(z, nchar(z) - 3, nchar(z) - 3) %in% c("a",
"e", "i", "o", "u", "y")) {
substring(z, 1, nchar(z) - 2)
} else {
z
}
}
}
m <- conles(m)
magice <- function(z) {
if (substring(z, nchar(z), nchar(z)) == "e" &
length(intersect(unlist(strsplit(z, NULL)),
c("a", "e", "i", "o", "u", "y"))) > 1) {
substring(z, 1, nchar(z) - 1)
} else {
z
}
}
m <- magice(m)
nchar(gsub("[^X]", "", gsub("[aeiouy]+", "X", m)))
}
}
}
n <- sapply(y, function(x) SYLL(x))
k <- ifelse(y %in% NETtalk$word, "-", "NF")
DF <- data.frame(words = q, syllables = n, in.dictionary = k)
row.names(DF) <- 1:nrow(DF)
if (algorithm.report == TRUE){
list("ALGORITHM REPORT" = DF[which(DF$in.dictionary == 'NF'), ], "SYLLABLE DATAFRAME" = DF)
} else {
DF
}
}
syllable.count(text)
Thanks for following along and thanks for the suggestions idea bouncing.;)
Lazar
12-20-2011, 05:37 PM
Excellent work. If you change the line load("NETtalk.RData") to be able to read from a url some of us could also test it out.
I think this is great work
trinker
12-20-2011, 05:58 PM
Lazar Thanks for the feedback.
I eliminated the line you are referring to (it will be in the final version where the NETtalk dataset will be available). It is necessary though to get and create the dictionary modified from the netTalk data set. To get this dictionary copy and paste the code from the sub tieled post of this thread that reads: SUBTITLE: THE DICTIONARY (NECESSARY FOR THE FUNCTION TO WORK)
Also note that you need to have the bracketX function (this could be inserted into the body of the function)
bracketX <- function(text, bracket='all'){
switch(bracket,
square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
sapply(P1, function(x)gsub("\\{.+?\\}", "", x))})
}
bryangoodrich
12-20-2011, 06:02 PM
Impressive! Test, test, and retest this on different data sets to find out your accuracy. Keep testing until you've tested it on every word in the English language!!! hahaha
Might I add, once you're comfortable with this thing, just keep the dictionary data set with this item, and try to give a presentation of how awesome you are. This way, you can cement your name to this work, as I'm sure it will be a useful tool for your developing package, and no reason not to impress your colleagues or potential users.
Oh, and I'm curious to know, how quick is this function on large texts? I'm curious how efficient it is.
trinker
12-20-2011, 06:28 PM
I know this is not 100% accurate. It doesn't work on some compound words with the first word ending in e such as 'bikestand' that isn't in the dictionary and my algorithm would say is 3 syllables when in reality it's 2. I have 3 options. Don't bother and accept this. Add a list of compound words to the dictionary. Add lines to the algorithm that makes compound words from the dictionary.
I may use an approach something like this in the algorithm to expand the dictionary:
miniNETtalk <- data.frame(word = c('dog', 'cat', 'pony', 'cracker', 'shoe',
'Popsicle', 'pronunciation' ), syllables = c(1, 1, 2, 2, 1, 3, 5))
data.frame(
word = as.vector(outer(miniNETtalk$word, miniNETtalk$word, FUN = function(x, y)paste(x, y, sep=""))),
syllables = as.vector(outer(miniNETtalk$syllable, miniNETtalk$syllable, FUN = function(x, y) x + y)))
I've got to think this over first. I'm not sure of the ramifications of doing this computationally as there are currently 198521 words in the dictionary so 198521^198521 is pretty big. I doubt R could handle this and it'd slow the process down considerably even if it did.
Perhaps there's a way to match a compound word to two words in the list beyond my proposal but I haven't thought of it yet.
So I think I may just add a list of common compound words to the dictionary instead.
Thoughts?
bryangoodrich
12-20-2011, 06:44 PM
(1) Learn a little C and make a C module do some of the work: always efficient.
(2) Check out partial (fuzzy) matches so that you can try to match compound words (e.g., bikestand matches 'bike' and 'stand').
(3) Let it simply be part of the error that your process fails to pick up on. If that amounts to 99% of the words in the English language, you're still gonna have a **** good success rate!
Seriously, though, "stand" seems like 2 syllables to me due to the hard 'd' at the end, but I'm not expert!
trinker
12-20-2011, 07:15 PM
Again thanks for the feedback.
That's a colloquialism probably in how you pronounce the word stand. Another wrench in the process to consider, but I dictionary.com says it's one syllable (LINK) (http://www.thefreedictionary.com/stand).
For right now C is out of the question as I've got too many things on my plate. I has just asked today to run some stats and co author on a piece. That's a big chunk of my Christmas break. It's on my to do list but it's an eventual not eminent.
trinker
12-20-2011, 08:36 PM
I tested it on a enw data set n=126 taken from here (LINK) (http://en.wikipedia.org/wiki/Standard_deviation). I get about 94% accuracy this time. This revealed a problem in the algorithm I hadn't figured in. The addition of ed to words (see the code box at the bottom of the post). The algorithm reads this as an extra vowel when it is only an extra vowel (syllable) in the case where words are > 4 and end in:
a vowel or a consonant then t or d and then ed
Once I add this piece to the code it should bring the accuracy up to 97.6% on this piece of data.
I also want to test the speed of this on larger text files as well.
TEST 2
text2 <- 'Standard deviation is a widely used measure of variability or
diversity used in statistics and probability theory. It shows how much
variation or "dispersion" exists from the average (mean, or expected
value). A low standard deviation indicates that the data points tend to
be very close to the mean, whereas high standard deviation indicates that
the data points are spread out over a large range of values. The standard
deviation of a statistical population, data set, or probability
distribution is the square root of its variance. It is algebraically
simpler though practically less robust than the average absolute
deviation.[1][2] A useful property of standard deviation is that,
unlike variance, it is expressed in the same units as the data.
In addition to expressing the variability of a population, standard
deviation is commonly used to measure confidence in statistical
conclusions. For example, the margin of error in polling data is
determined by calculating the expected standard deviation in the results
if the same poll were to be conducted multiple times. The reported margin
of error is typically about twice the standard deviation *– the radius of
95 percent confidence interval. In science, researchers commonly report
the standard deviation of experimental data, and only effects that fall
far outside the range of standard deviation are considered statistically
significant – normal random error or variation in the measurements
is in this way distinguished from causal variation. Standard deviation is also
important in finance, where the standard deviation on the rate of return
on an investment is a measure of the volatility of the investment.'
DF <- syllable.count(text2)
DF$actual <- c(1L, 2L, 3L, 3L, 6L, 2L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 4L, 2L,
1L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 4L, 3L, 3L, 4L, 4L, 2L, 2L, 3L,
2L, 3L, 5L, 2L, 3L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 3L,
3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 3L, 1L, 3L, 2L, 1L,
1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 4L, 4L, 5L, 3L, 3L, 2L,
1L, 1L, 2L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 4L, 2L, 1L,
1L, 2L, 4L, 5L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 4L,
2L, 2L, 1L, 2L, 2L, 6L, 3L, 4L, 2L, 5L, 1L, 1L, 1L, 2L, 2L)
DF$diff<-with(DF, syllables-actual)
DF[with(DF, which(diff!=0)),]
table(DF$diff)
sum(table(with(DF, syllables-actual)))
round(prop.table(table(with(DF, syllables-actual))), digits=3)*100
words syllables actual diff
5 algebraically 5 6 -1
21 considered 4 3 1
23 determined 4 3 1
26 distinguished 4 3 1
35 expressed 3 2 1
114 used 2 1 1
126 widely 3 2 1
RAW SCORES (0 is spot on)
-1 0 1
1 119 6
n = 126
PERCENTAGES CORRERCT (0 is spot on)
-1 0 1
0.8 94.4 4.8
EDIT: I made the change described above (did not worry about the > 4 parameter because the dictionary should alleviate the need for specifying this) plus I added a rule about words ending with ely as this causes the same problem as a word ending in ed. This brings my accuracy for this data set to 99.2%, missing only one word.
-1 0
EDIT: Now to test it on a third data set and then speed on large data sets.
0.8 99.2
> DF2b[with(DF2b, which(diff!=0)),]
words syllables actual diff in.dictionary
5 algebraically 5 6 -1 NF
trinker
12-20-2011, 11:24 PM
I added my two fixes and I'm at 97.5 % accuracy using a splice of Bob Dylan's Last Thoughts On Woody Guthrie and Hamlet's Soliloquy (missed 3). Two of the misses were compound words that the first word ends with e. I may augment the dictionary with a list of these, not sure yet. Now to test the speed on a larger data set.
Function Code
syllable.count <- function(text, remove.bracketed = TRUE, algorithm.report = FALSE) {
# load("NETtalk.RData") #To be included in the final code
q <- gsub("\n", " ", text)
q <- if (remove.bracketed == TRUE) {
bracketX(q)
} else {
q
}
q <- gsub("[\\,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\*–\\_\\^\\%\\$\\#\\<\\>]",
"", as.character(q))
q <- gsub("[\\(\\)\\[\\]\\{\\}]", "", q)
q <- gsub("\\d+", "", q)
q <- gsub('"', "", q)
q <- gsub('-', " ", q)
q <- gsub('—', " ", q)
q <- gsub(" +", " ", q)
q <- c(sapply(q, function(x) as.vector(unlist(strsplit(x, " ")))))
y <- tolower(q)
y <- levels(as.factor(y))#######################remove
q <- y#######################remove
SYLL <- function(x) {
if (x %in% NETtalk$word) {
NETtalk[which(NETtalk$word %in% x), "syllables"]
} else {
if (substring(x, nchar(x), nchar(x)) == "s" & substring(x,
1, nchar(x) - 1) %in% NETtalk$word) {
NETtalk[which(NETtalk$word %in% substring(x, 1, nchar(x) -
1)), "syllables"]
} else {
m <- gsub("eeing", "XX", x)
m <- gsub("eing", "XX", m)
ended <- function(z) {
if (substring(z, nchar(z) - 1, nchar(z)) == "ed" & substring(z,
nchar(z) - 2, nchar(z) - 2) %in% c("t", "d")) {
z
} else {
if (substring(z, nchar(z) - 1, nchar(z)) == "ed" & !substring(z,
nchar(z) - 2, nchar(z) - 2) %in% c("t", "d")) {
substring(z, 1, nchar(z) - 2)
} else {
z
}
}
}
m <- ended(m)
conely <- function(z) {
if (substring(z, nchar(z) - 2, nchar(z)) == "ely"){
paste(substring(z, 1, nchar(z) - 3), "ly", sep = "")
} else {
z
}
}
m <- conely(m)
conle <- function(z) {
if (substring(z, nchar(z) - 1, nchar(z)) == "le" & !substring(z,
nchar(z) - 2, nchar(z) - 2) %in% c("a", "e", "i",
"o", "u", "y")) {
paste(substring(z, 1, nchar(z) - 1), "X", sep = "")
} else {
if (substring(z, nchar(z) - 1, nchar(z)) == "le" &
substring(z, nchar(z) - 2, nchar(z) - 2) %in% c("a",
"e", "i", "o", "u", "y")) {
substring(z, 1, nchar(z) - 1)
} else {
z
}
}
}
m <- conle(m)
conles <- function(z) {
if (substring(z, nchar(z) - 2, nchar(z)) == "les" &
!substring(z, nchar(z) - 3, nchar(z) - 3) %in% c("a",
"e", "i", "o", "u", "y")) {
paste(substring(z, 1, nchar(z) - 2), "X", sep = "")
} else {
if (substring(z, nchar(z) - 2, nchar(z)) == "les" &
substring(z, nchar(z) - 3, nchar(z) - 3) %in% c("a",
"e", "i", "o", "u", "y")) {
substring(z, 1, nchar(z) - 2)
} else {
z
}
}
}
m <- conles(m)
magice <- function(z) {
if (substring(z, nchar(z), nchar(z)) == "e" &
length(intersect(unlist(strsplit(z, NULL)),
c("a", "e", "i", "o", "u", "y"))) > 1) {
substring(z, 1, nchar(z) - 1)
} else {
z
}
}
m <- magice(m)
nchar(gsub("[^X]", "", gsub("[aeiouy]+", "X", m)))
}
}
}
n <- sapply(y, function(x) SYLL(x))
k <- ifelse(y %in% NETtalk$word, "-", "NF")
DF <- data.frame(words = q, syllables = n, in.dictionary = k)
row.names(DF) <- 1:nrow(DF)
if (algorithm.report == TRUE){
list("ALGORITHM REPORT" = DF[which(DF$in.dictionary == 'NF'), ], "SYLLABLE DATAFRAME" = DF)
} else {
DF
}
}
TEST 3
text3 <- "When yer head gets twisted and yer mind grows numb
When you think you're too old, too young, too smart or too ****
When yer laggin' behind an' losin' yer pace
In a slow-motion crawl of life's busy race
No matter what yer doing if you start givin' up
If the wine don't come to the top of yer cup
If the wind's got you sideways with with one hand holdin' on
And the other starts slipping and the feeling is gone
And yer train engine fire needs a new spark to catch it
And the wood's easy findin' but yer lazy to fetch it
And yer sidewalk starts curlin' and the street gets too long
To be, or not to be--that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles
And by opposing end them. To die, to sleep--
No more--and by a sleep to say we end
The heartache, and the thousand natural shocks
That flesh is heir to. 'Tis a consummation
Devoutly to be wished. To die, to sleep--
To sleep--perchance to dream: ay, there's the rub,"
DF2c<- syllable.count(text3)
DF2c$actual <- DF2c$actual <- c(1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
4L, 1L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)
DF2c$diff<-with(DF2c, syllables-actual)
table(DF2c$diff)
sum(table(with(DF2c, syllables-actual)))
round(prop.table(table(with(DF2c, syllables-actual))), digits=3)*100
DF2c[with(DF2c, which(diff!=0)),]
Yielding:
RAW SCORES (OUTCOME - ACTUAL)
0 1
116 3
n = 119
PERCENTAGES (0 = CORRECT OUTCOME)
0 1
97.5 2.5
MIS-CALCULATED WORDS
52 life's 2 1 NF 1
80 sidewalk 3 2 NF 1
81 sideways 3 2 NF 1
EDIT: I noticed that TalkStats blocks out the word d-u-m-b and replaces it with ****. in text3 line 2 this needs alter to achieve the results I get
trinker
12-21-2011, 12:50 AM
I tested the function using an excerpt from the book MoneyBall freealy available here (LINK) (http://www.myvsp.cn/technology/My%20ebooks33/Moneyball%20-%20The%20Art%20Of%20Winning%20An%20Unfair%20Game.pdf). Using the preface and chapter1, which is about 5212 words, I get
> system.time(DF2d<- syllable.count(text4))
user system elapsed
17.05 0.00 18.82
trinker
12-21-2011, 02:37 AM
I made an addition of 266 common compound words with the first word part ending in silent e to the dictionary. This list was generated from here (LINK) (http://www.morewords.com/ends-with/ed/).
On the Dylan Hamlet test set (test3) I now have a 99.2% accuracy. I'll confirm with one more data set and then move on to writing the readability statistics functions that will utilize these syllable counts (that's the easy part)
RAW SCORES (0 INDICATES PERFECT SYLLABLE ASSIGNMENT)
0 1
118 1
n = 119
PERCENTAGES OF CORRECT OUTCOME (0 IS A MATCH)
0 1
99.2 0.8
WORDS THAT THE ALGORITHM FUNCTION MISSED
words syllables actual in.dictionary diff syllable
51 life's 2 1 NF 1 2
TheEcologist
12-21-2011, 02:52 AM
Trinker, this comment is not going to be of any use for you per se but I would like to thank you for solving a problem I was wondering about.
With this code I can now make
1) word clouds
2) frequency counts of words in manuscripts to make quantify the importance of words to put in the title of a paper.
Cool. I will save the code as trinkersmagic.r.
trinker
12-21-2011, 12:32 PM
The function is consistently accurate in random text selections ranging from 97-100 percent accuracy. The function is a bit slower than I had hoped on data sets exceeding 10,000. I don't have the C capabilities to increase its speed yet. Maybe for another time. Any ideas or input/help here would be appreciated. There may be ways to simply speed up the code in R with better code writing.
I'm currently working to apply the function to transcripts or structured data in data frames and then generate readability statistics around the speech for individuals or groups or the transcript as a whole. A transcript of a 40 minute block of time would rarely exceed 10,0000 words. I have tested the function on a fake manufactured data set of 20,352 words and 5760 observations (sentences). That should far exceed anything encountered in the classroom. It took a bit of system time (see bottom of post) but for now it's acceptable. I'd love to have any feedback about speeding up the code.
#SEED DATA SET
DF<-structure(list(tot = structure(c(1L, 2L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 3L, 4L, 5L, 6L, 7L), .Label = c("1.1", "1.2",
"10.1", "10.2", "11.1", "11.2", "11.3", "2.1", "3.1", "4.1",
"5.1", "6.1", "7.1", "8.1", "9.1"), class = "factor"), person = structure(c(4L,
4L, 1L, 5L, 4L, 1L, 3L, 1L, 4L, 3L, 2L, 2L, 1L, 1L, 1L), .Label = c("greg",
"researcher", "sally", "sam", "teacher"), class = "factor"),
sex = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L,
1L, 1L, 2L, 2L, 2L), .Label = c("f", "m"), class = "factor"),
adult = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L), state = structure(c(5L, 4L, 10L, 14L, 15L, 7L,
6L, 12L, 8L, 13L, 11L, 1L, 9L, 2L, 3L), .Label = c(" Good then.",
" Lets eat.", " You already?", " Not too fun.", "Computer is fun.",
"How can we be certain?", "I am telling the truth!", "I distrust you.",
"Im hungry.", "No its not, its ****.", "Shall we move on?",
"There is no way.", "What are you talking about?", "What should we do?",
"You liar, it stinks!"), class = "factor"), code = structure(c(1L,
1L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 2L, 2L, 3L, 3L, 3L), .Label = c("K1",
"K10", "K11", "K2", "K3", "K4", "K5", "K6", "K7", "K8", "K9"
), class = "factor"), syllable.count = c(5, 3, 5, 4, 5, 6,
6, 4, 4, 7, 4, 2, 3, 2, 4)), .Names = c("tot", "person",
"sex", "adult", "state", "code", "syllable.count"), row.names = c(NA,
15L), class = "data.frame")
#CREATING A MASSIVE DATA SET
DF2 <- data.frame(rbind(DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF, DF))
DF3 <- data.frame(rbind(DF2, DF2, DF2, DF2, DF2, DF2, DF2, DF2, DF2, DF2, DF2, DF2, DF2, DF2, DF2, DF2))
#THE NEW FUNCTION THAT APPLIES THE SYLLABLE FUNCTION BY ROW.
syllable.sum <- function(text){
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
sapply(as.character(text), function(x) sum(syllable.count(Trim(x))$syllables))
}
system.time(DF3$syllable.count <- syllable.sum(DF3$state))
head(DF3, 15)
n = 20352
> system.time(DF3$syllable.count <- syllable.sum(DF3$state))
user system elapsed
377.84 7.68 421.90
bryangoodrich
12-21-2011, 12:41 PM
If it's expected to be used on pieces of limited size, you shouldn't worry too much about speed at this time. That is something for version 2! Instead, focus on the applicability and accuracy, which you seem to have done a **** good job achieving! Seriously, this is impressive work. I still haven't gone over your code, yet. But when I get the time, I'll be glad to take a look at making suggestions for code improvement, and maybe eventually working on making C modules for optimizing any parts of it (I could use the practice lol). The real issue, from what I understand of your process, is that you do a look of looking up values in your table (dictionary (http://en.wikipedia.org/wiki/Associative_array)). This is probably where optimization can be achieved through C. There are a lot of good algorithms for doing sorting and searching. A hash table (http://en.wikipedia.org/wiki/Hash_table)/map might even be appropriate in this case, at least for simply searching out the value you want to return.
Thinking about associative arrays, and I don't know exactly how you set it up in R, would be to use lists. Each element of the list, by index, is the number of syllables. The words within that indexed element have that number of syllables. The question for speed becomes how you search for your word. Of course, you can do a simple
for (i in seq(dictionaryList))
if (myWord %in% dictionaryList[[i]])
return(i)
return(99) # This could be an error code for "not found in dictionary" and handled appropriately
Better, I would think, would be to take a guess about how many syllables the word has, because you don't want to have to search 1, 2, and 3 if the word has 4 syllables. Maybe you do. The question about the rule for this search hinges on the frequency (size) of many-syllable words. If most are 1 or 2, then doing a sequential search make sense, most of the words will be found in the first or second check. On the other hand, maybe most are 2 or 3, and you should search 2, 3, 1, 4, 5, .... That can easily be set up in the search algorithm.
Dason
12-21-2011, 12:44 PM
There are a lot of good algorithms for doing sorting and searching. A hash table (http://en.wikipedia.org/wiki/Hash_table)/map might even be appropriate in this case, at least for simply searching out the value you want to return.
I haven't taken a look at the code but this is a good suggestion. For lookups with small dictionaries it doesn't matter too much what search algorithm you use or how you store your data. But you can see significant speed ups by using a more appropriate data structure and searching algorithm.
bryangoodrich
12-21-2011, 12:52 PM
I haven't taken a look at the code but this is a good suggestion. For lookups with small dictionaries it doesn't matter too much what search algorithm you use or how you store your data. But you can see significant speed ups by using a more appropriate data structure and searching algorithm.
Yeah, I thought about that a bit more and made a suggested in my edited comment above. If we use a list as an appropriate associative array (dictionary) in R, the search is rather simple, and since it is vectorized, the actual code is pretty **** simple ("word %in% vector").
trinker
12-21-2011, 01:00 PM
Couldn't thank your original post again after the edit because I already thanked it so I'll thank this one instead :)
Powered by vBulletin™ Version 4.1.3 Copyright © 2013 vBulletin Solutions, Inc. All rights reserved.