+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 30

Thread: Use R to break a word into syllables

  1. #1
    FormerlyKnownAsRaptor
    Points: 25,003, Level: 95
    Level completed: 66%, Points required for next Level: 347
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,218
    Thanks
    919
    Thanked 562 Times in 509 Posts

    Use R to break a word into syllables



    I moved this portion of a thread from (HERE) as the original problem was solved.

    Posting a problem I'm wrestling with though it's likely outside your fields your brains are generally pretty good at applying your knowledge to other fields. Thought maybe you could help me.

    I'm looking to use R for Syllabication or break words into number of syllables (In my field this has all sorts of uses related to readability of a text etc).

    Here's a bit of really condensed, not too hard, theory on syllabication that may be useful if you're attempting to help with this problem:
    http://allenporter.tumblr.com/post/9776954743/syllables

    A suggestion for R from SO to tackle this problem is to use:
    Code: 
    nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y)))
    I tested it out using text from a post on here (The frequent stats misunderstandings thread) and got surprisingly accurate responses from the simple code(76% accuracy on n =74). The major problem with the above code is it doesn't detect silent 'e' at the end of a word. If I could add a piece to the code it would surely improve the accuracy a lot. Just telling R to not look for ending e's won't work because the words "people" and "little" for instance actually uses the e to for syllabication.

    QUESTION: Any ideas on how to use R's reg ex to find silent e's.

    Code: 
    x <- "I know TheEcologist is working on an FAQ to put up at some point but I didn't think it would be a 
    bad idea to compile a list of frequent misunderstandings that we see. I've been seeing quite a few people 
    making the mistake of assuming that in a linear model we expect the predictor variables to be normally distributed.
     Or seeing that they expect the response itself to be normally distributed. This is wrong, of course, because we 
    don't make any of those assumptions but instead assume that the error term is normally distributed.
    So what other misunderstands have you come across quite a bit (either in real life or here at Talk Stats)?"
    
    bracketX <- function(text, bracket='all'){
        switch(bracket,
            square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
            round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
            curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
            all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
                 P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
                 sapply(P1, function(x)gsub("\\{.+?\\}", "", x))})
    }
    
    
    y <- gsub("\n", " ", x)
    y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
    y <- bracketX(y)
    y <- gsub(" +", " ", y)
    y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
    y <- tolower(y)
    y <- levels(as.factor(y))
    n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y))) 
    DF <- data.frame(words=y, syllables=n)
    Code: 
                   words syllables actual
    1                  a         1      1
    2             across         2      2
    3                 an         1      1
    4                any         2      2
    5             assume         3      2
    6           assuming         3      3
    7        assumptions         3      3
    8                 at         1      1
    9                bad         1      1
    10                be         1      1
    11           because         3      2
    12              been         1      1
    13               bit         1      1
    14               but         1      1
    15              come         2      1
    16           compile         3      2
    17            course         2      1
    18            didn't         1      2
    19       distributed         4      4
    20             don't         1      1
    21             error         2      2
    22            expect         2      2
    23               faq         1     NA
    24               few         1      1
    25          frequent         2      2
    26              have         2      1
    27                 i         1      1
    28              i've         2      1
    29              idea         2      3
    30                in         1      1
    31           instead         2      2
    32                is         1      1
    33                it         1      1
    34            itself         2      2
    35              know         1      1
    36            linear         2      3
    37              list         1      1
    38              make         2      1
    39            making         2      2
    40           mistake         3      2
    41 misunderstandings         5      5
    42    misunderstands         4      4
    43             model         2      2
    44          normally         3      3
    45                of         1      1
    46                on         1      1
    47                or         1      1
    48             other         2      2
    49            people         2      2
    50             point         1      1
    51         predictor         3      3
    52               put         1      1
    53             quite         2      1
    54          response         3      2
    55               see         1      1
    56            seeing         1      2
    57                so         1      1
    58              some         2      1
    59              term         1      1
    60              that         1      1
    61               the         1      1
    62      theecologist         4      4
    63              they         1      1
    64             think         1      1
    65              this         1      1
    66             those         2      1
    67                to         1      1
    68                up         1      1
    69         variables         3      4
    70                we         1      1
    71              what         1      1
    72           working         2      2
    73             would         1      1
    74             wrong         1      1
    75               you         1      1

    Code: 
    table(with(DF, syllables-actual))
    sum(table(with(DF, syllables-actual)))
    round(prop.table(table(with(DF, syllables-actual))), digits=3)*100
    Code: 
    COUNTS
    -1  0  1 
     5 56 13 
    
    n = 74
    
    PERCENTAGES
     -1    0    1 
    6.8 75.7 17.6
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  2. The Following User Says Thank You to trinker For This Useful Post:

    TheEcologist (12-21-2011)

  3. #2
    Probably A Mammal
    Points: 14,712, Level: 78
    Level completed: 66%, Points required for next Level: 138
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,963
    Thanks
    223
    Thanked 422 Times in 389 Posts

    Re: Use R to break a word into syllables

    Well, I don't know how to identify something as "silent." What conditions would make that true? But looking for 'e' at the end of the sentence, you should be thinking "e$" or something like that.

  4. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (12-19-2011)

  5. #3
    RotParaTon
    Points: 47,119, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Discussion EnderPosting AwardCommunity AwardMaster TaggerFrequent Poster
    Dason's Avatar
    Location
    Ames, IA
    Posts
    9,185
    Thanks
    212
    Thanked 1,640 Times in 1,401 Posts

    Re: Use R to break a word into syllables

    The hard part is that you can't just look for an "e" at the end of a word because sometimes it is silent and sometimes it isn't. trinker is there a good way to determine if something is a silent e?

  6. #4
    FormerlyKnownAsRaptor
    Points: 25,003, Level: 95
    Level completed: 66%, Points required for next Level: 347
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,218
    Thanks
    919
    Thanked 562 Times in 509 Posts

    Re: Use R to break a word into syllables

    OK I added a bit about silent e's to the code and have improved the accuracy to 93%. I think I'm going to a combined approach where I start with a dictionary of known syllable rule breakers (hopefully less than 5000). match the words to that list. Any words in that dictionary will be calculated using the dictionary with pre set syllable values. Anything not in the dictionary will be calculated using the formula below. I think this will increase accuracy even more to well above the 95% accuracy I was hoping for.

    Code: 
    y <- gsub("\n", " ", x)
    y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
    y <- bracketX(y)
    y <- gsub("\\d+", "", y) 
    y <- gsub(" +", " ", y)
    y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
    y <- tolower(y)
    y <- levels(as.factor(y))
    m <- gsub( "eeing", "XX", y)
    m <- gsub( "eing", "XX", m)
    
    magice<-function(z){
    if(substring(z,nchar(z), nchar(z))=='e' & 
    length(intersect(unlist(strsplit(z, NULL)), c('a','e','i','o','u','y')))>1){
    substring(z,1, nchar(z)-1)
    }else{
    z
    }
    }
    
    m <- sapply(m, magice)
    
    n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", m))) 
    DF <- data.frame(words=y, syllables=n)

    Code: 
    table(with(DF2, syllables-actual))
    -1  0 
     5 69 
    
    n = 74
    
      -1    0 
     6.8 93.2 
    
           words syllables actual diff
    18    didn't         1      2   -1
    29      idea         2      3   -1
    36    linear         2      3   -1
    49    people         1      2   -1
    69 variables         3      4   -1
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  7. #5
    FormerlyKnownAsRaptor
    Points: 25,003, Level: 95
    Level completed: 66%, Points required for next Level: 347
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,218
    Thanks
    919
    Thanked 562 Times in 509 Posts

    Re: Use R to break a word into syllables

    I just saw Dason and bryangoodrich's post. I used a convoluted way to find ending e's. I'll fix that in the code. I have two manuals on regular expression, knew the e$ use and just didn't apply it in this situation. I'm still very uncomfortable with regexs but I'm getting there.

    Dason there really isn't a good way to tell if an e is silent or not so I'm going to assume it is in the formula. But in my previous post I talk about using a dictionary of known rule breakers first. That should help with that problem. Gotta find the dictionary now (By the way I should define that term "dictionary"; a dictionary in natural language processing (NLP) is a word list of terms. So I actually have a dictionary of positive and a dictionary of negative terms [ie love and hate] that I use for sentiment analysis)
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  8. #6
    FormerlyKnownAsRaptor
    Points: 25,003, Level: 95
    Level completed: 66%, Points required for next Level: 347
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,218
    Thanks
    919
    Thanked 562 Times in 509 Posts

    Re: Use R to break a word into syllables

    Currently, for punctuation I'm using

    Code: 
    y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
    to say all [[:punct:]] minus apostrophes. Is there a way to do this more simply? I tried [[:punt:]]^"'"


    EDIT: THE carrot has to go inside the brackets somehow. Still reading
    Last edited by trinker; 12-19-2011 at 11:45 AM. Reason: Added info
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  9. #7
    Probably A Mammal
    Points: 14,712, Level: 78
    Level completed: 66%, Points required for next Level: 138
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,963
    Thanks
    223
    Thanked 422 Times in 389 Posts

    Re: Use R to break a word into syllables

    Do syllables follow morphemes or phonemes? I would assume the latter, and not necessarily the former. My thought on using a dictionary is that if you can break common syllables into their smallest component parts, you could create a dictionary to match those using some C module and a quick searching algorithm. You could have a pretty large dictionary that works quickly. But using a rule is always better, I think. Brute forcing known exceptions to the rule + the rule should be a very good approach. You're making something very valuable for R!

  10. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (12-19-2011)

  11. #8
    Probably A Mammal
    Points: 14,712, Level: 78
    Level completed: 66%, Points required for next Level: 138
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,963
    Thanks
    223
    Thanked 422 Times in 389 Posts

    Re: Use R to break a word into syllables

    Quote Originally Posted by trinker View Post
    to say all [[:punct:]] minus apostrophes. Is there a way to do this more simply? I tried [[:punt:]]^"'"
    You might use the exclusion grouping (it's prefixed by some symbol; it's stated in the help files). You'll have to play around with how to make it work, but basically you're saying "all punctuation except apostrophes." That's the only thing I can think of it.

  12. #9
    FormerlyKnownAsRaptor
    Points: 25,003, Level: 95
    Level completed: 66%, Points required for next Level: 347
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,218
    Thanks
    919
    Thanked 562 Times in 509 Posts

    Re: Use R to break a word into syllables

    Quote Originally Posted by bryangoodrich
    Do syllables follow morphemes or phonemes? I would assume the latter, and not necessarily the former.
    You're correct more so the latter but in reality not so much the latter either. English is an untamed beast with lots of rules and just as many broken rules. I have it a bit easier than some with this task because I'm not looking to actually segment the words just count the syllables. Segmenting the words requires even more rules. For the purposes of the package I'm creating the counting of syllables is enough. On the surface this seems like a rather simple problem but it's very complex.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  13. #10
    Probably A Mammal
    Points: 14,712, Level: 78
    Level completed: 66%, Points required for next Level: 138
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,963
    Thanks
    223
    Thanked 422 Times in 389 Posts

    Re: Use R to break a word into syllables

    Of course it seems simple, but that's because understanding a phenomena in a language we are fluent with is easy! The hard part is making that phenomena formal, and because that phenomena is not exactly a property of the syntax (grammar) of the language. As a logician, I look at the language in terms of morphemes. I was hoping that maybe there was a way of looking at it in terms of phonemes, of which I'm largely unfamiliar, might prove useful, especially since syllables are a property of the way a word is sounded, not with its meaning.

  14. #11
    FormerlyKnownAsRaptor
    Points: 25,003, Level: 95
    Level completed: 66%, Points required for next Level: 347
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,218
    Thanks
    919
    Thanked 562 Times in 509 Posts

    Re: Use R to break a word into syllables

    I've improved the code to within 95% accuracy on the original data set using the following code (found one more rule with 'consonant + le' that improves accuracy). Still looking for an initial dictionary to use. I've inquired at linguistics.stackexchange.com (post located here) a stack site dedicated to linguistics about the dictionary. Hopefully I get some information there.

    Code: 
    y <- gsub("\n", " ", x)
    y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
    y <- bracketX(y)
    y <- gsub("\\d+", "", y)  
    y <- gsub(" +", " ", y)
    y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
    y <- tolower(y)
    y <- levels(as.factor(y))
    m <- gsub( "eeing", "XX", y)
    m <- gsub( "eing", "XX", m)
    
    conle<-function(z){
    if(substring(z,nchar(z)-1, nchar(z))=='le' & 
    !substring(z,nchar(z)-2, nchar(z)-2)%in%c('a','e','i','o','u','y')){
    paste(substring(z,1, nchar(z)-1), 'X', sep="")
    }else{
    if(substring(z,nchar(z)-1, nchar(z))=='le' & 
    substring(z,nchar(z)-2, nchar(z)-2)%in%c('a','e','i','o','u','y')){
    substring(z,1, nchar(z)-1)
    }else{
    z
    }
    }
    }
    
    m <- sapply(m, conle)
    
    conles<-function(z){
    if(substring(z,nchar(z)-2, nchar(z))=='les' & 
    !substring(z,nchar(z)-3, nchar(z)-3)%in%c('a','e','i','o','u','y')){
    paste(substring(z,1, nchar(z)-2), 'X', sep="")
    }else{
    if(substring(z,nchar(z)-2, nchar(z))=='les' & 
    substring(z,nchar(z)-3, nchar(z)-3)%in%c('a','e','i','o','u','y')){
    substring(z,1, nchar(z)-2)
    }else{
    z
    }
    }
    }
    m <- sapply(m, conles)
    
    magice<-function(z){
    if(substring(z,nchar(z), nchar(z))=='e' & 
    length(intersect(unlist(strsplit(z, NULL)), c('a','e','i','o','u','y')))>1){
    substring(z,1, nchar(z)-1)
    }else{
    z
    }
    }
    
    m <- sapply(m, magice)
    
    n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", m))) 
    DF <- data.frame(words=y, syllables=n)
    Code: 
    #Raw scores
    -1  0 
     4 70 
    
    n = 74
    
    #Percentages correct (0 means difference -1 means the algorithm calculated 1 too few syllables)
      -1    0 
     5.4 94.6 
    
    #Words not counted correctly
           words syllables actual diff
    18    didn't         1      2   -1
    29      idea         2      3   -1
    36    linear         2      3   -1
    69 variables         3      4   -1
    EDIT: Sorry for the lack of indentation. I'll do it on a future post when the function is more near complete.
    Last edited by trinker; 12-19-2011 at 07:26 PM. Reason: Apologize for lack of indent
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  15. The Following User Says Thank You to trinker For This Useful Post:

    Lazar (12-19-2011)

  16. #12
    FormerlyKnownAsRaptor
    Points: 25,003, Level: 95
    Level completed: 66%, Points required for next Level: 347
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,218
    Thanks
    919
    Thanked 562 Times in 509 Posts

    Re: Use R to break a word into syllables

    Alright I got my dictionary after much searching and paper reading. It's the NETtalk dictionary (LINK) and the usage defined here (LINK). Here's the paper that lead me to search for NETtalk (LINK)

    My next step is to read the code into R pull out the word list, count the digits in the 3rd column and make a new data frame with word and number of syllables. It says it's tab separated so I'll try reading it right into R (haven't done this for real yet in R; don't be scared trink you got this). This about 20,000 words first derived from Webster's pocket dictionary. I'm guessing this will get me pretty darn accurate between it and the algorithm.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  17. #13
    FormerlyKnownAsRaptor
    Points: 25,003, Level: 95
    Level completed: 66%, Points required for next Level: 347
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,218
    Thanks
    919
    Thanked 562 Times in 509 Posts

    Re: Use R to break a word into syllables

    SUBTITLE: THE DICTIONARY (NECESSARY FOR THE FUNCTION TO WORK)

    Might as well take you all along on my learning process. Until today I always saved web resources to a text file and then loaded them in. A while back bryangoodrich showed how easy it was to just load in a file from the web. No time to learn like the present. So following a script he freely posted I imported the NETtalk file in and doctored it up to my liking with:

    Code: 
    j <- 'http://jklp.org/public/projects/lists/talk/corpus'
    
    header   <- c("word", "phonemes", "stress.n.structure", "origin")
    
    NETtalkraw <- read.delim(url(j), header=FALSE, strip.white = TRUE, sep="\t",
                  col.names=header, na.strings= c("999","NA"," "))
    
    
    NETtalkraw$stress.n.structure2 <- gsub("<", "", NETtalkraw$stress.n.structure)
    NETtalkraw$stress.n.structure2 <- gsub(">", "", NETtalkraw$stress.n.structure2)
    NETtalkraw$stress.n.structure3<-nchar(NETtalkraw$stress.n.structure2)
    NETtalk <- data.frame(word=NETtalkraw$word, syllables=NETtalkraw$stress.n.structure3)
    NETtalk<-NETtalk[which(!duplicated(NETtalk$word)), ]
    
    contractions <- structure(list(word = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
    8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 
    21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 
    34L, 35L, 36L, 37L, 38L, 39L, 40L, 43L, 41L, 42L, 44L, 45L, 46L, 
    47L, 48L, 49L, 50L), .Label = c("aren't", "can't", "couldn't", 
    "didn't", "doesn't", "don't", "hadn't", "hasn't", "haven't", 
    "he'd", "he'll", "he's", "I'd", "I'll", "I'm", "I've", "isn't", 
    "let's", "mightn't", "mustn't", "shan't", "she'd", "she'll", 
    "she's", "shouldn't", "that's", "there's", "they'd", "they'll", 
    "they're", "they've", "we'd", "we're", "we've", "weren't", "what'll", 
    "what're", "what's", "what've", "where's", "who'll", "who're", 
    "who's", "who've", "won't", "wouldn't", "you'd", "you'll", "you're", 
    "you've"), class = "factor"), syllables = c(1L, 1L, 2L, 2L, 2L, 
    1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 
    1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L)), .Names = c("word", 
    "syllables"), class = "data.frame", row.names = c(NA, -50L))
    
    
    NETtalk<-data.frame(rbind(NETtalk, contractions))
    NETtalk[12503, 2] <- 1  #an error in the word pace
    
    compound <- structure(list(word = c("backfire", "baseboard", "cheesecloth", 
    "daredevil", "eavesdrop", "elsewhere", "eyeball", "eyebrow", 
    "eyeglass", "eyelash", "eyelet", "eyelid", "eyepiece", "eyesight", 
    "eyesore", "eyetooth", "eyewitness", "figurehead", "fireball", 
    "firebrand", "firebug", "firedamp", "firefly", "firepower", "fireplug", 
    "fireproof", "fireside", "firetrap", "firewood", "firework", 
    "flameout", "forecastle", "forefinger", "forefoot", "forefront", 
    "foregone", "forehand", "foreleg", "forelimb", "forelock", "foremast", 
    "foremost", "forenoon", "forerunner", "foresail", "foreshadow", 
    "foreshore", "foreword", "framework", "gamesome", "gatecrasher", 
    "gatekeeper", "gateway", "globetrotter", "grapefruit", "grapeshot", 
    "grapevine", "guidebook", "guideline", "guidepost", "harebrained", 
    "harelip", "hedgehog", "hedgehop", "henceforth", "homebody", 
    "homecoming", "homeland", "homemaker", "homespun", "homestead", 
    "homestretch", "homeward", "homework", "horseback", "horseflesh", 
    "horsefly", "horseflies", "horsehair", "horsehide", "horselaugh", 
    "horseman", "horsemanship", "horseplay", "horsepower", "horseradish", 
    "housebroke", "housebreaking", "housefly", "houseflies", "housekeeper", 
    "housewarming", "iceberg", "icebound", "iceboxes", "iceland", 
    "lifeblood", "lifeguard", "lifetime", "lifework", "likewise", 
    "limelight", "lovebird", "lovelorn", "lovesick", "milestone", 
    "moleskin", "namesake", "notebook", "pacesetter", "peacetime", 
    "policeman", "racehorse", "racetrack", "safeguard", "safekeeping", 
    "sagebrush", "shakedown", "shakeup", "shamefaced", "sharecrop", 
    "sharecropper", "shareholder", "shoreline", "sideboard", "sideburns", 
    "sidecar", "sidekick", "sideline", "sidelong", "sidesaddle", 
    "sidestep", "sideswipe", "sidetrack", "sidewalk", "sideways", 
    "smokehouse", "smokestack", "someday", "somehow", "someone", 
    "someplace", "sometime", "sometimes", "someway", "somewhat", 
    "somewhere", "spacecraft", "spaceman", "stagecoach", "stateside", 
    "storehouse", "storekeeper", "stovepipe", "takeoff", "talebearer", 
    "theretofore", "therewith", "tideland", "tidewater", "timekeeper", 
    "timeworn", "trademark", "typeface", "typesetter", "typewrite", 
    "typewriter", "warehouse", "wasteland", "waveform", "wavelength", 
    "whaleboat", "whalebone", "whereby", "wherefrom", "whitewash", 
    "wholehearted", "wiretap", "wisecrack"), syllables = c(2L, 2L, 
    2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
    2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 
    3L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 
    2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 
    2L)), .Names = c("word", "syllables"), row.names = c(NA, 179L
    ), class = "data.frame")
    
    NETtalk<-data.frame(rbind(NETtalk, compound))
    qview(NETtalk)
    This yields the following data frame (15 row head)
    Code: 
    ======================================================================== 
     n =  20031           # of vars =  2             NETtalk 
    ======================================================================== 
          word syllables
    1 aardvark         2
    2    aback         2
    3   abacus         3
    4    abaft         2
    5  abalone         4
    6  abandon         3
    Tomorrow's task is to incorporate the dictionary and algorithm together as a working function. 99% accuracy would be nice.

    EDIT: ADDED A LINE OF CODE TO REMOVE DUPLICATED WORDS
    EDIT2: ADDED CONTRACTIONS TO THE DICTIONARY BECAUSE OF THEIR IRREGULAR NATURE
    EDIT3: ADDED COMMON COMPOUND WORDS WITH FIRST WORD PIECE ENDING IN E
    Last edited by trinker; 12-20-2011 at 04:27 PM. Reason: Additional code needed
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  18. #14
    FormerlyKnownAsRaptor
    Points: 25,003, Level: 95
    Level completed: 66%, Points required for next Level: 347
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,218
    Thanks
    919
    Thanked 562 Times in 509 Posts

    Re: Use R to break a word into syllables

    All right I'm at 100% accuracy on my original data set. I'm going to gather a new data set and retest. I think I'm at the accuracy level I was hoping for. I may make some tweaks later but I'm pretty happy with where it's at now.

    Code: 
    syllable.count <- function(text, remove.bracketed = TRUE, algorithm.report = FALSE) {
        q <- gsub("\n", " ", text)
        q <- if (remove.bracketed == TRUE) {
            bracketX(q) 
        } else {
            q
        }
        
        q <- gsub("[\\,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", 
            "", as.character(q))
        q <- gsub("[\\(\\)\\[\\]\\{\\}]", "", q)
        q <- gsub("\\d+", "", q)
        q <- gsub(" +", " ", q)
        q <- c(sapply(q, function(x) as.vector(unlist(strsplit(x, " ")))))
        y <- tolower(q)
        y <- levels(as.factor(y))#######################remove (for testing purposes only)
        q <- y#######################remove (for testing purposes only)
        
        SYLL <- function(x) {
            if (x %in% NETtalk$word) {
                NETtalk[which(NETtalk$word %in% x), "syllables"]
            } else {
                if (substring(x, nchar(x), nchar(x)) == "s" & substring(x, 
                    1, nchar(x) - 1) %in% NETtalk$word) {
                    NETtalk[which(NETtalk$word %in% substring(x, 1, nchar(x) - 
                      1)), "syllables"]
                } else {
                    m <- gsub("eeing", "XX", x)
                    m <- gsub("eing", "XX", m)
                    
                    conle <- function(z) {
                      if (substring(z, nchar(z) - 1, nchar(z)) == "le" & !substring(z, 
                        nchar(z) - 2, nchar(z) - 2) %in% c("a", "e", "i", 
                        "o", "u", "y")) {
                        paste(substring(z, 1, nchar(z) - 1), "X", sep = "")
                      } else {
                        if (substring(z, nchar(z) - 1, nchar(z)) == "le" & 
                          substring(z, nchar(z) - 2, nchar(z) - 2) %in% c("a", 
                            "e", "i", "o", "u", "y")) {
                          substring(z, 1, nchar(z) - 1)
                        } else {
                          z
                        }
                      }
                    }
                    
                    m <- conle(m)
                    
                    conles <- function(z) {
                      if (substring(z, nchar(z) - 2, nchar(z)) == "les" & 
                        !substring(z, nchar(z) - 3, nchar(z) - 3) %in% c("a", 
                          "e", "i", "o", "u", "y")) {
                        paste(substring(z, 1, nchar(z) - 2), "X", sep = "")
                      } else {
                        if (substring(z, nchar(z) - 2, nchar(z)) == "les" & 
                          substring(z, nchar(z) - 3, nchar(z) - 3) %in% c("a", 
                            "e", "i", "o", "u", "y")) {
                          substring(z, 1, nchar(z) - 2)
                        } else {
                          z
                        }
                      }
                    }
    
                    m <- conles(m)
                    
                    magice <- function(z) {
                      if (substring(z, nchar(z), nchar(z)) == "e" & 
                        length(intersect(unlist(strsplit(z, NULL)), 
                        c("a", "e", "i", "o", "u", "y"))) > 1) {
                        substring(z, 1, nchar(z) - 1)
                      } else {
                        z
                      }
                    }
                    
                     m <- magice(m)
                    
                    nchar(gsub("[^X]", "", gsub("[aeiouy]+", "X", m)))
                }
            }
        }  
    
        n <- sapply(y, function(x) SYLL(x))
        k <- ifelse(y %in% NETtalk$word, "-", "NF")
        DF <- data.frame(words = q, syllables = n, in.dictionary = k)
        row.names(DF) <- 1:nrow(DF)
        if (algorithm.report == TRUE){
          list("ALGORITHM REPORT" = DF[which(DF$in.dictionary == 'NF'), ], "SYLLABLE DATAFRAME" = DF)
        } else {
          DF
        }
    }
    Code: 
    syllable.count(text)
    Thanks for following along and thanks for the suggestions idea bouncing.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  19. The Following 2 Users Say Thank You to trinker For This Useful Post:

    bryangoodrich (12-20-2011), Lazar (12-20-2011)

  20. #15
    TS Contributor
    Points: 6,701, Level: 53
    Level completed: 76%, Points required for next Level: 49
    Lazar's Avatar
    Location
    Sydney
    Posts
    675
    Thanks
    111
    Thanked 167 Times in 152 Posts

    Re: Use R to break a word into syllables


    Excellent work. If you change the line load("NETtalk.RData") to be able to read from a url some of us could also test it out.

    I think this is great work

+ Reply to Thread
Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts








Advertise on Talk Stats