Use R to break a word into syllables

trinker

ggplot2orBust
#1
I moved this portion of a thread from (HERE) as the original problem was solved.

Posting a problem I'm wrestling with though it's likely outside your fields your brains are generally pretty good at applying your knowledge to other fields. Thought maybe you could help me.

I'm looking to use R for Syllabication or break words into number of syllables (In my field this has all sorts of uses related to readability of a text etc).

Here's a bit of really condensed, not too hard, theory on syllabication that may be useful if you're attempting to help with this problem:
http://allenporter.tumblr.com/post/9776954743/syllables

A suggestion for R from SO to tackle this problem is to use:
Code:
nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y)))
I tested it out using text from a post on here (The frequent stats misunderstandings thread) and got surprisingly accurate responses from the simple code(76% accuracy on n =74). The major problem with the above code is it doesn't detect silent 'e' at the end of a word. If I could add a piece to the code it would surely improve the accuracy a lot. Just telling R to not look for ending e's won't work because the words "people" and "little" for instance actually uses the e to for syllabication.

QUESTION: Any ideas on how to use R's reg ex to find silent e's.

Code:
x <- "I know TheEcologist is working on an FAQ to put up at some point but I didn't think it would be a 
bad idea to compile a list of frequent misunderstandings that we see. I've been seeing quite a few people 
making the mistake of assuming that in a linear model we expect the predictor variables to be normally distributed.
 Or seeing that they expect the response itself to be normally distributed. This is wrong, of course, because we 
don't make any of those assumptions but instead assume that the error term is normally distributed.
So what other misunderstands have you come across quite a bit (either in real life or here at Talk Stats)?"

bracketX <- function(text, bracket='all'){
    switch(bracket,
        square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
        round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
        curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
        all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
             P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
             sapply(P1, function(x)gsub("\\{.+?\\}", "", x))})
}


y <- gsub("\n", " ", x)
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
y <- bracketX(y)
y <- gsub(" +", " ", y)
y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
y <- tolower(y)
y <- levels(as.factor(y))
n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y))) 
DF <- data.frame(words=y, syllables=n)
Code:
               words syllables actual
1                  a         1      1
2             across         2      2
3                 an         1      1
4                any         2      2
5             assume         3      2
6           assuming         3      3
7        assumptions         3      3
8                 at         1      1
9                bad         1      1
10                be         1      1
11           because         3      2
12              been         1      1
13               bit         1      1
14               but         1      1
15              come         2      1
16           compile         3      2
17            course         2      1
18            didn't         1      2
19       distributed         4      4
20             don't         1      1
21             error         2      2
22            expect         2      2
23               faq         1     NA
24               few         1      1
25          frequent         2      2
26              have         2      1
27                 i         1      1
28              i've         2      1
29              idea         2      3
30                in         1      1
31           instead         2      2
32                is         1      1
33                it         1      1
34            itself         2      2
35              know         1      1
36            linear         2      3
37              list         1      1
38              make         2      1
39            making         2      2
40           mistake         3      2
41 misunderstandings         5      5
42    misunderstands         4      4
43             model         2      2
44          normally         3      3
45                of         1      1
46                on         1      1
47                or         1      1
48             other         2      2
49            people         2      2
50             point         1      1
51         predictor         3      3
52               put         1      1
53             quite         2      1
54          response         3      2
55               see         1      1
56            seeing         1      2
57                so         1      1
58              some         2      1
59              term         1      1
60              that         1      1
61               the         1      1
62      theecologist         4      4
63              they         1      1
64             think         1      1
65              this         1      1
66             those         2      1
67                to         1      1
68                up         1      1
69         variables         3      4
70                we         1      1
71              what         1      1
72           working         2      2
73             would         1      1
74             wrong         1      1
75               you         1      1

Code:
table(with(DF, syllables-actual))
sum(table(with(DF, syllables-actual)))
round(prop.table(table(with(DF, syllables-actual))), digits=3)*100
Code:
COUNTS
-1  0  1 
 5 56 13 

n = 74

PERCENTAGES
 -1    0    1 
6.8 75.7 17.6
 

bryangoodrich

Probably A Mammal
#2
Well, I don't know how to identify something as "silent." What conditions would make that true? But looking for 'e' at the end of the sentence, you should be thinking "e$" or something like that.
 

Dason

Ambassador to the humans
#3
The hard part is that you can't just look for an "e" at the end of a word because sometimes it is silent and sometimes it isn't. trinker is there a good way to determine if something is a silent e?
 

trinker

ggplot2orBust
#4
OK I added a bit about silent e's to the code and have improved the accuracy to 93%. I think I'm going to a combined approach where I start with a dictionary of known syllable rule breakers (hopefully less than 5000). match the words to that list. Any words in that dictionary will be calculated using the dictionary with pre set syllable values. Anything not in the dictionary will be calculated using the formula below. I think this will increase accuracy even more to well above the 95% accuracy I was hoping for.

Code:
y <- gsub("\n", " ", x)
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
y <- bracketX(y)
y <- gsub("\\d+", "", y) 
y <- gsub(" +", " ", y)
y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
y <- tolower(y)
y <- levels(as.factor(y))
m <- gsub( "eeing", "XX", y)
m <- gsub( "eing", "XX", m)

magice<-function(z){
if(substring(z,nchar(z), nchar(z))=='e' & 
length(intersect(unlist(strsplit(z, NULL)), c('a','e','i','o','u','y')))>1){
substring(z,1, nchar(z)-1)
}else{
z
}
}

m <- sapply(m, magice)

n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", m))) 
DF <- data.frame(words=y, syllables=n)

Code:
table(with(DF2, syllables-actual))
-1  0 
 5 69 

n = 74

  -1    0 
 6.8 93.2 

       words syllables actual diff
18    didn't         1      2   -1
29      idea         2      3   -1
36    linear         2      3   -1
49    people         1      2   -1
69 variables         3      4   -1
 

trinker

ggplot2orBust
#5
I just saw Dason and bryangoodrich's post. I used a convoluted way to find ending e's. I'll fix that in the code. I have two manuals on regular expression, knew the e$ use and just didn't apply it in this situation. I'm still very uncomfortable with regexs but I'm getting there.

Dason there really isn't a good way to tell if an e is silent or not so I'm going to assume it is in the formula. But in my previous post I talk about using a dictionary of known rule breakers first. That should help with that problem. Gotta find the dictionary now (By the way I should define that term "dictionary"; a dictionary in natural language processing (NLP) is a word list of terms. So I actually have a dictionary of positive and a dictionary of negative terms [ie love and hate] that I use for sentiment analysis)
 

trinker

ggplot2orBust
#6
Currently, for punctuation I'm using

Code:
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
to say all [noparse] [[:punct:]][/noparse] minus apostrophes. Is there a way to do this more simply? I tried [noparse] [[:punt:]]^"'"[/noparse]


EDIT: THE carrot has to go inside the brackets somehow. Still reading
 
Last edited:

bryangoodrich

Probably A Mammal
#7
Do syllables follow morphemes or phonemes? I would assume the latter, and not necessarily the former. My thought on using a dictionary is that if you can break common syllables into their smallest component parts, you could create a dictionary to match those using some C module and a quick searching algorithm. You could have a pretty large dictionary that works quickly. But using a rule is always better, I think. Brute forcing known exceptions to the rule + the rule should be a very good approach. You're making something very valuable for R!
 

bryangoodrich

Probably A Mammal
#8
to say all [noparse] [[:punct:]][/noparse] minus apostrophes. Is there a way to do this more simply? I tried [noparse] [[:punt:]]^"'"[/noparse]
You might use the exclusion grouping (it's prefixed by some symbol; it's stated in the help files). You'll have to play around with how to make it work, but basically you're saying "all punctuation except apostrophes." That's the only thing I can think of it.
 

trinker

ggplot2orBust
#9
bryangoodrich said:
Do syllables follow morphemes or phonemes? I would assume the latter, and not necessarily the former.
You're correct more so the latter but in reality not so much the latter either. English is an untamed beast with lots of rules and just as many broken rules. I have it a bit easier than some with this task because I'm not looking to actually segment the words just count the syllables. Segmenting the words requires even more rules. For the purposes of the package I'm creating the counting of syllables is enough. On the surface this seems like a rather simple problem but it's very complex.
 

bryangoodrich

Probably A Mammal
#10
Of course it seems simple, but that's because understanding a phenomena in a language we are fluent with is easy! The hard part is making that phenomena formal, and because that phenomena is not exactly a property of the syntax (grammar) of the language. As a logician, I look at the language in terms of morphemes. I was hoping that maybe there was a way of looking at it in terms of phonemes, of which I'm largely unfamiliar, might prove useful, especially since syllables are a property of the way a word is sounded, not with its meaning.
 

trinker

ggplot2orBust
#11
I've improved the code to within 95% accuracy on the original data set using the following code (found one more rule with 'consonant + le' that improves accuracy). Still looking for an initial dictionary to use. I've inquired at linguistics.stackexchange.com (post located here) a stack site dedicated to linguistics about the dictionary. Hopefully I get some information there.

Code:
y <- gsub("\n", " ", x)
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
y <- bracketX(y)
y <- gsub("\\d+", "", y)  
y <- gsub(" +", " ", y)
y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
y <- tolower(y)
y <- levels(as.factor(y))
m <- gsub( "eeing", "XX", y)
m <- gsub( "eing", "XX", m)

conle<-function(z){
if(substring(z,nchar(z)-1, nchar(z))=='le' & 
!substring(z,nchar(z)-2, nchar(z)-2)%in%c('a','e','i','o','u','y')){
paste(substring(z,1, nchar(z)-1), 'X', sep="")
}else{
if(substring(z,nchar(z)-1, nchar(z))=='le' & 
substring(z,nchar(z)-2, nchar(z)-2)%in%c('a','e','i','o','u','y')){
substring(z,1, nchar(z)-1)
}else{
z
}
}
}

m <- sapply(m, conle)

conles<-function(z){
if(substring(z,nchar(z)-2, nchar(z))=='les' & 
!substring(z,nchar(z)-3, nchar(z)-3)%in%c('a','e','i','o','u','y')){
paste(substring(z,1, nchar(z)-2), 'X', sep="")
}else{
if(substring(z,nchar(z)-2, nchar(z))=='les' & 
substring(z,nchar(z)-3, nchar(z)-3)%in%c('a','e','i','o','u','y')){
substring(z,1, nchar(z)-2)
}else{
z
}
}
}
m <- sapply(m, conles)

magice<-function(z){
if(substring(z,nchar(z), nchar(z))=='e' & 
length(intersect(unlist(strsplit(z, NULL)), c('a','e','i','o','u','y')))>1){
substring(z,1, nchar(z)-1)
}else{
z
}
}

m <- sapply(m, magice)

n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", m))) 
DF <- data.frame(words=y, syllables=n)
Code:
#Raw scores
-1  0 
 4 70 

n = 74

#Percentages correct [SIZE="1"](0 means difference -1 means the algorithm calculated 1 too few syllables)[/SIZE]
  -1    0 
 5.4 94.6 

#Words not counted correctly
       words syllables actual diff
18    didn't         1      2   -1
29      idea         2      3   -1
36    linear         2      3   -1
69 variables         3      4   -1
EDIT: Sorry for the lack of indentation. I'll do it on a future post when the function is more near complete.
 
Last edited:

trinker

ggplot2orBust
#12
Alright I got my dictionary after much searching and paper reading. It's the NETtalk dictionary (LINK) and the usage defined here (LINK). Here's the paper that lead me to search for NETtalk (LINK)

My next step is to read the code into R pull out the word list, count the digits in the 3rd column and make a new data frame with word and number of syllables. It says it's tab separated so I'll try reading it right into R (haven't done this for real yet in R; don't be scared trink you got this). This about 20,000 words first derived from Webster's pocket dictionary. I'm guessing this will get me pretty darn accurate between it and the algorithm.
 

trinker

ggplot2orBust
#13
SUBTITLE: THE DICTIONARY (NECESSARY FOR THE FUNCTION TO WORK)

Might as well take you all along on my learning process. Until today I always saved web resources to a text file and then loaded them in. A while back bryangoodrich showed how easy it was to just load in a file from the web. No time to learn like the present. So following a script he freely posted I imported the NETtalk file in and doctored it up to my liking with:

Code:
j <- 'http://jklp.org/public/projects/lists/talk/corpus'

header   <- c("word", "phonemes", "stress.n.structure", "origin")

NETtalkraw <- read.delim(url(j), header=FALSE, strip.white = TRUE, sep="\t",
              col.names=header, na.strings= c("999","NA"," "))


NETtalkraw$stress.n.structure2 <- gsub("<", "", NETtalkraw$stress.n.structure)
NETtalkraw$stress.n.structure2 <- gsub(">", "", NETtalkraw$stress.n.structure2)
NETtalkraw$stress.n.structure3<-nchar(NETtalkraw$stress.n.structure2)
NETtalk <- data.frame(word=NETtalkraw$word, syllables=NETtalkraw$stress.n.structure3)
NETtalk<-NETtalk[which(!duplicated(NETtalk$word)), ]

contractions <- structure(list(word = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 
21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 
34L, 35L, 36L, 37L, 38L, 39L, 40L, 43L, 41L, 42L, 44L, 45L, 46L, 
47L, 48L, 49L, 50L), .Label = c("aren't", "can't", "couldn't", 
"didn't", "doesn't", "don't", "hadn't", "hasn't", "haven't", 
"he'd", "he'll", "he's", "I'd", "I'll", "I'm", "I've", "isn't", 
"let's", "mightn't", "mustn't", "shan't", "she'd", "she'll", 
"she's", "shouldn't", "that's", "there's", "they'd", "they'll", 
"they're", "they've", "we'd", "we're", "we've", "weren't", "what'll", 
"what're", "what's", "what've", "where's", "who'll", "who're", 
"who's", "who've", "won't", "wouldn't", "you'd", "you'll", "you're", 
"you've"), class = "factor"), syllables = c(1L, 1L, 2L, 2L, 2L, 
1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L)), .Names = c("word", 
"syllables"), class = "data.frame", row.names = c(NA, -50L))


NETtalk<-data.frame(rbind(NETtalk, contractions))
NETtalk[12503, 2] <- 1  #an error in the word pace

compound <- structure(list(word = c("backfire", "baseboard", "cheesecloth", 
"daredevil", "eavesdrop", "elsewhere", "eyeball", "eyebrow", 
"eyeglass", "eyelash", "eyelet", "eyelid", "eyepiece", "eyesight", 
"eyesore", "eyetooth", "eyewitness", "figurehead", "fireball", 
"firebrand", "firebug", "firedamp", "firefly", "firepower", "fireplug", 
"fireproof", "fireside", "firetrap", "firewood", "firework", 
"flameout", "forecastle", "forefinger", "forefoot", "forefront", 
"foregone", "forehand", "foreleg", "forelimb", "forelock", "foremast", 
"foremost", "forenoon", "forerunner", "foresail", "foreshadow", 
"foreshore", "foreword", "framework", "gamesome", "gatecrasher", 
"gatekeeper", "gateway", "globetrotter", "grapefruit", "grapeshot", 
"grapevine", "guidebook", "guideline", "guidepost", "harebrained", 
"harelip", "hedgehog", "hedgehop", "henceforth", "homebody", 
"homecoming", "homeland", "homemaker", "homespun", "homestead", 
"homestretch", "homeward", "homework", "horseback", "horseflesh", 
"horsefly", "horseflies", "horsehair", "horsehide", "horselaugh", 
"horseman", "horsemanship", "horseplay", "horsepower", "horseradish", 
"housebroke", "housebreaking", "housefly", "houseflies", "housekeeper", 
"housewarming", "iceberg", "icebound", "iceboxes", "iceland", 
"lifeblood", "lifeguard", "lifetime", "lifework", "likewise", 
"limelight", "lovebird", "lovelorn", "lovesick", "milestone", 
"moleskin", "namesake", "notebook", "pacesetter", "peacetime", 
"policeman", "racehorse", "racetrack", "safeguard", "safekeeping", 
"sagebrush", "shakedown", "shakeup", "shamefaced", "sharecrop", 
"sharecropper", "shareholder", "shoreline", "sideboard", "sideburns", 
"sidecar", "sidekick", "sideline", "sidelong", "sidesaddle", 
"sidestep", "sideswipe", "sidetrack", "sidewalk", "sideways", 
"smokehouse", "smokestack", "someday", "somehow", "someone", 
"someplace", "sometime", "sometimes", "someway", "somewhat", 
"somewhere", "spacecraft", "spaceman", "stagecoach", "stateside", 
"storehouse", "storekeeper", "stovepipe", "takeoff", "talebearer", 
"theretofore", "therewith", "tideland", "tidewater", "timekeeper", 
"timeworn", "trademark", "typeface", "typesetter", "typewrite", 
"typewriter", "warehouse", "wasteland", "waveform", "wavelength", 
"whaleboat", "whalebone", "whereby", "wherefrom", "whitewash", 
"wholehearted", "wiretap", "wisecrack"), syllables = c(2L, 2L, 
2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 
3L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 
2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 
2L)), .Names = c("word", "syllables"), row.names = c(NA, 179L
), class = "data.frame")

NETtalk<-data.frame(rbind(NETtalk, compound))
qview(NETtalk)
This yields the following data frame (15 row head)
Code:
======================================================================== 
 n =  20031           # of vars =  2             NETtalk 
======================================================================== 
      word syllables
1 aardvark         2
2    aback         2
3   abacus         3
4    abaft         2
5  abalone         4
6  abandon         3
Tomorrow's task is to incorporate the dictionary and algorithm together as a working function. 99% accuracy would be nice.

EDIT: ADDED A LINE OF CODE TO REMOVE DUPLICATED WORDS
EDIT2: ADDED CONTRACTIONS TO THE DICTIONARY BECAUSE OF THEIR IRREGULAR NATURE
EDIT3: ADDED COMMON COMPOUND WORDS WITH FIRST WORD PIECE ENDING IN E
 
Last edited:

trinker

ggplot2orBust
#14
All right I'm at 100% accuracy on my original data set. I'm going to gather a new data set and retest. I think I'm at the accuracy level I was hoping for. I may make some tweaks later but I'm pretty happy with where it's at now.

Code:
syllable.count <- function(text, remove.bracketed = TRUE, algorithm.report = FALSE) {
    q <- gsub("\n", " ", text)
    q <- if (remove.bracketed == TRUE) {
        bracketX(q) 
    } else {
        q
    }
    
    q <- gsub("[\\,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", 
        "", as.character(q))
    q <- gsub("[\\(\\)\\[\\]\\{\\}]", "", q)
    q <- gsub("\\d+", "", q)
    q <- gsub(" +", " ", q)
    q <- c(sapply(q, function(x) as.vector(unlist(strsplit(x, " ")))))
    y <- tolower(q)
    y <- levels(as.factor(y))#######################remove (for testing purposes only)
    q <- y#######################remove (for testing purposes only)
    
    SYLL <- function(x) {
        if (x %in% NETtalk$word) {
            NETtalk[which(NETtalk$word %in% x), "syllables"]
        } else {
            if (substring(x, nchar(x), nchar(x)) == "s" & substring(x, 
                1, nchar(x) - 1) %in% NETtalk$word) {
                NETtalk[which(NETtalk$word %in% substring(x, 1, nchar(x) - 
                  1)), "syllables"]
            } else {
                m <- gsub("eeing", "XX", x)
                m <- gsub("eing", "XX", m)
                
                conle <- function(z) {
                  if (substring(z, nchar(z) - 1, nchar(z)) == "le" & !substring(z, 
                    nchar(z) - 2, nchar(z) - 2) %in% c("a", "e", "i", 
                    "o", "u", "y")) {
                    paste(substring(z, 1, nchar(z) - 1), "X", sep = "")
                  } else {
                    if (substring(z, nchar(z) - 1, nchar(z)) == "le" & 
                      substring(z, nchar(z) - 2, nchar(z) - 2) %in% c("a", 
                        "e", "i", "o", "u", "y")) {
                      substring(z, 1, nchar(z) - 1)
                    } else {
                      z
                    }
                  }
                }
                
                m <- conle(m)
                
                conles <- function(z) {
                  if (substring(z, nchar(z) - 2, nchar(z)) == "les" & 
                    !substring(z, nchar(z) - 3, nchar(z) - 3) %in% c("a", 
                      "e", "i", "o", "u", "y")) {
                    paste(substring(z, 1, nchar(z) - 2), "X", sep = "")
                  } else {
                    if (substring(z, nchar(z) - 2, nchar(z)) == "les" & 
                      substring(z, nchar(z) - 3, nchar(z) - 3) %in% c("a", 
                        "e", "i", "o", "u", "y")) {
                      substring(z, 1, nchar(z) - 2)
                    } else {
                      z
                    }
                  }
                }

                m <- conles(m)
                
                magice <- function(z) {
                  if (substring(z, nchar(z), nchar(z)) == "e" & 
                    length(intersect(unlist(strsplit(z, NULL)), 
                    c("a", "e", "i", "o", "u", "y"))) > 1) {
                    substring(z, 1, nchar(z) - 1)
                  } else {
                    z
                  }
                }
                
                 m <- magice(m)
                
                nchar(gsub("[^X]", "", gsub("[aeiouy]+", "X", m)))
            }
        }
    }  

    n <- sapply(y, function(x) SYLL(x))
    k <- ifelse(y %in% NETtalk$word, "-", "NF")
    DF <- data.frame(words = q, syllables = n, in.dictionary = k)
    row.names(DF) <- 1:nrow(DF)
    if (algorithm.report == TRUE){
      list("ALGORITHM REPORT" = DF[which(DF$in.dictionary == 'NF'), ], "SYLLABLE DATAFRAME" = DF)
    } else {
      DF
    }
}
Code:
syllable.count(text)
Thanks for following along and thanks for the suggestions idea bouncing.;)
 

Lazar

Phineas Packard
#15
Excellent work. If you change the line load("NETtalk.RData") to be able to read from a url some of us could also test it out.

I think this is great work
 

trinker

ggplot2orBust
#16
Lazar Thanks for the feedback.

I eliminated the line you are referring to (it will be in the final version where the NETtalk dataset will be available). It is necessary though to get and create the dictionary modified from the netTalk data set. To get this dictionary copy and paste the code from the sub tieled post of this thread that reads: SUBTITLE: THE DICTIONARY (NECESSARY FOR THE FUNCTION TO WORK)

Also note that you need to have the bracketX function (this could be inserted into the body of the function)

Code:
bracketX <- function(text, bracket='all'){
    switch(bracket,
        square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
        round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
        curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
        all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
             P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
             sapply(P1, function(x)gsub("\\{.+?\\}", "", x))})
}
 

bryangoodrich

Probably A Mammal
#17
Impressive! Test, test, and retest this on different data sets to find out your accuracy. Keep testing until you've tested it on every word in the English language!!! hahaha

Might I add, once you're comfortable with this thing, just keep the dictionary data set with this item, and try to give a presentation of how awesome you are. This way, you can cement your name to this work, as I'm sure it will be a useful tool for your developing package, and no reason not to impress your colleagues or potential users.

Oh, and I'm curious to know, how quick is this function on large texts? I'm curious how efficient it is.
 

trinker

ggplot2orBust
#18
I know this is not 100% accurate. It doesn't work on some compound words with the first word ending in e such as 'bikestand' that isn't in the dictionary and my algorithm would say is 3 syllables when in reality it's 2. I have 3 options. Don't bother and accept this. Add a list of compound words to the dictionary. Add lines to the algorithm that makes compound words from the dictionary.

I may use an approach something like this in the algorithm to expand the dictionary:


Code:
miniNETtalk <- data.frame(word = c('dog', 'cat', 'pony', 'cracker', 'shoe', 
               'Popsicle', 'pronunciation' ), syllables = c(1, 1, 2, 2, 1, 3, 5))

data.frame(
word = as.vector(outer(miniNETtalk$word, miniNETtalk$word, FUN = function(x, y)paste(x, y, sep=""))),
syllables = as.vector(outer(miniNETtalk$syllable, miniNETtalk$syllable, FUN = function(x, y) x + y)))
I've got to think this over first. I'm not sure of the ramifications of doing this computationally as there are currently 198521 words in the dictionary so 198521^198521 is pretty big. I doubt R could handle this and it'd slow the process down considerably even if it did.

Perhaps there's a way to match a compound word to two words in the list beyond my proposal but I haven't thought of it yet.

So I think I may just add a list of common compound words to the dictionary instead.

Thoughts?
 

bryangoodrich

Probably A Mammal
#19
(1) Learn a little C and make a C module do some of the work: always efficient.

(2) Check out partial (fuzzy) matches so that you can try to match compound words (e.g., bikestand matches 'bike' and 'stand').

(3) Let it simply be part of the error that your process fails to pick up on. If that amounts to 99% of the words in the English language, you're still gonna have a **** good success rate!

Seriously, though, "stand" seems like 2 syllables to me due to the hard 'd' at the end, but I'm not expert!
 

trinker

ggplot2orBust
#20
Again thanks for the feedback.

That's a colloquialism probably in how you pronounce the word stand. Another wrench in the process to consider, but I dictionary.com says it's one syllable (LINK).

For right now C is out of the question as I've got too many things on my plate. I has just asked today to run some stats and co author on a piece. That's a big chunk of my Christmas break. It's on my to do list but it's an eventual not eminent.