Using a website's tool's in R

trinker

ggplot2orBust
#1
Alright I'm trying to develop this package for quantitative discourse analysis. Anyway I decide i need to do syllable counts (my ultimate goal) and found out this is pretty darn complex SO LINK. So I read a bunch of articles and find out entire dissertations can be devoted to this. I'm not interested in reinventing the wheel so I look at what's been done already that's good and I find a site that http://www.syllablecount.com is pretty darn good, maybe even the best. I also know latex uses an algorithm to split words into syllables (This can be useful for readability statistics and such).

So this is what I've got so far:
1) a website that does syllabication pretty well
2) LATEX does syllabication pretty well too

I want to use either one to create an R function that will do syllable counts for a vector of words. So for instance:

Code:
x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle', 'pronunciation' )
would yield: 1, 1, 2, 2, 1, 3, 5

The web based one is more accurate than the conservative LATEX solution. I'm open to any solution that gets the job done reasonably well (95% accuracy) and fairly fast. Ideally the solution would not be web based but if I don't want to reinvent the wheel and still want to be accurate I may have to got hat route.

What I need help with is making this work. Questions I have:

1) Is it possible to use a web application like this through R?
2) Is it possible to harness and use the algorithm of LATEX (basically call latex to compute the syllable counts) with R?
3) Is either legal?


I' used to using R at a pretty basic level so I don;t even know what questions to ask here. Maybe there's a simple answer I'm missing.
 

jpkelley

TS Contributor
#2
Whoa. I would be very interested in this as well, as I'm interested in tackling syllable counts in singing bouts of birds.

The web app is in the public domain, right? And you're not reselling it (if you incorporate the code into an R package)?
 

trinker

ggplot2orBust
#5
Posting a problem I'm wrestling with though it's likely outside your fields your brains are generally pretty good at applying your knowledge to other fields. Thought maybe you could help me.

I have included everything you'd need to begin linguistic studies :) so you can follow along if this is interesting to you.

A suggestion for R from SO to tackle this problem is to use:
Code:
nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y)))
I tested it out using text from a post on here (The frequent stats misunderstandings thread) and got surprisingly accurate responses from a simple code(76% accuracy on n =74). The major problem with the above code is it doesn't detect silent 'e' at the end of a word. If I could add a piece to the code it would surely improve the accuracy a lot. Just telling R to not look for ending e's won't work because the words "people" and "little" for instance actually uses the e to for syllabication.

QUESTION: Any ideas on how to use R's reg ex to find silent e's.

Code:
x <- "I know TheEcologist is working on an FAQ to put up at some point but I didn't think it would be a 
bad idea to compile a list of frequent misunderstandings that we see. I've been seeing quite a few people 
making the mistake of assuming that in a linear model we expect the predictor variables to be normally distributed.
 Or seeing that they expect the response itself to be normally distributed. This is wrong, of course, because we 
don't make any of those assumptions but instead assume that the error term is normally distributed.
So what other misunderstands have you come across quite a bit (either in real life or here at Talk Stats)?"

bracketX <- function(text, bracket='all'){
    switch(bracket,
        square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
        round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
        curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
        all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
             P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
             sapply(P1, function(x)gsub("\\{.+?\\}", "", x))})
}


y <- gsub("\n", " ", x)
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
y <- bracketX(y)
y <- gsub(" +", " ", y)
y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
y <- tolower(y)
y <- levels(as.factor(y))
n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y))) 
DF <- data.frame(words=y, syllables=n)
Code:
               words syllables actual
1                  a         1      1
2             across         2      2
3                 an         1      1
4                any         2      2
5             assume         3      2
6           assuming         3      3
7        assumptions         3      3
8                 at         1      1
9                bad         1      1
10                be         1      1
11           because         3      2
12              been         1      1
13               bit         1      1
14               but         1      1
15              come         2      1
16           compile         3      2
17            course         2      1
18            didn't         1      2
19       distributed         4      4
20             don't         1      1
21             error         2      2
22            expect         2      2
23               faq         1     NA
24               few         1      1
25          frequent         2      2
26              have         2      1
27                 i         1      1
28              i've         2      1
29              idea         2      3
30                in         1      1
31           instead         2      2
32                is         1      1
33                it         1      1
34            itself         2      2
35              know         1      1
36            linear         2      3
37              list         1      1
38              make         2      1
39            making         2      2
40           mistake         3      2
41 misunderstandings         5      5
42    misunderstands         4      4
43             model         2      2
44          normally         3      3
45                of         1      1
46                on         1      1
47                or         1      1
48             other         2      2
49            people         2      2
50             point         1      1
51         predictor         3      3
52               put         1      1
53             quite         2      1
54          response         3      2
55               see         1      1
56            seeing         1      2
57                so         1      1
58              some         2      1
59              term         1      1
60              that         1      1
61               the         1      1
62      theecologist         4      4
63              they         1      1
64             think         1      1
65              this         1      1
66             those         2      1
67                to         1      1
68                up         1      1
69         variables         3      4
70                we         1      1
71              what         1      1
72           working         2      2
73             would         1      1
74             wrong         1      1
75               you         1      1

Code:
table(with(DF, syllables-actual))
sum(table(with(DF, syllables-actual)))
round(prop.table(table(with(DF, syllables-actual))), digits=3)*100
Code:
COUNTS
-1  0  1 
 5 56 13 

n = 74

PERCENTAGES
 -1    0    1 
6.8 75.7 17.6
 
Last edited:
#8
Posting a problem I'm wrestling with though it's likely outside your fields your brains are generally pretty good at applying your knowledge to other fields. Thought maybe you could help me.

I have included everything you'd need to begin linguistic studies :) so you can follow along if this is interesting to you.

A suggestion for R from SO to tackle this problem is to use:
Code:
nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y)))
I tested it out using text from a post on here (The frequent stats misunderstandings thread) and got surprisingly accurate responses from a simple code(76% accuracy on n =74). The major problem with the above code is it doesn't detect silent 'e' at the end of a word. If I could add a piece to the code it would surely improve the accuracy a lot. Just telling R to not look for ending e's won't work because the words "people" and "little" for instance actually uses the e to for syllabication.

QUESTION: Any ideas on how to use R's reg ex to find silent e's.

Code:
x <- "I know TheEcologist is working on an FAQ to put up at some point but I didn't think it would be a
bad idea to compile a list of frequent misunderstandings that we see. I've been seeing quite a few people
making the mistake of assuming that in a linear model we expect the predictor variables to be normally distributed.
Or seeing that they expect the response itself to be normally distributed. This is wrong, of course, because we
don't make any of those assumptions but instead assume that the error term is normally distributed.
So what other misunderstands have you come across quite a bit (either in real life or here at Talk Stats)?"

bracketX <- function(text, bracket='all'){
    switch(bracket,
        square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
        round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
        curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
        all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
             P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
             sapply(P1, function(x)gsub("\\{.+?\\}", "", x))})
}


y <- gsub("\n", " ", x)
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
y <- bracketX(y)
y <- gsub(" +", " ", y)
y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
y <- tolower(y)
y <- levels(as.factor(y))
n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y)))
DF <- data.frame(words=y, syllables=n)
Code:
               words syllables actual
1                  a         1      1
2             across         2      2
3                 an         1      1
4                any         2      2
5             assume         3      2
6           assuming         3      3
7        assumptions         3      3
8                 at         1      1
9                bad         1      1
10                be         1      1
11           because         3      2
12              been         1      1
13               bit         1      1
14               but         1      1
15              come         2      1
16           compile         3      2
17            course         2      1
18            didn't         1      2
19       distributed         4      4
20             don't         1      1
21             error         2      2
22            expect         2      2
23               faq         1     NA
24               few         1      1
25          frequent         2      2
26              have         2      1
27                 i         1      1
28              i've         2      1
29              idea         2      3
30                in         1      1
31           instead         2      2
32                is         1      1
33                it         1      1
34            itself         2      2
35              know         1      1
36            linear         2      3
37              list         1      1
38              make         2      1
39            making         2      2
40           mistake         3      2
41 misunderstandings         5      5
42    misunderstands         4      4
43             model         2      2
44          normally         3      3
45                of         1      1
46                on         1      1
47                or         1      1
48             other         2      2
49            people         2      2
50             point         1      1
51         predictor         3      3
52               put         1      1
53             quite         2      1
54          response         3      2
55               see         1      1
56            seeing         1      2
57                so         1      1
58              some         2      1
59              term         1      1
60              that         1      1
61               the         1      1
62      theecologist         4      4
63              they         1      1
64             think         1      1
65              this         1      1
66             those         2      1
67                to         1      1
68                up         1      1
69         variables         3      4
70                we         1      1
71              what         1      1
72           working         2      2
73             would         1      1
74             wrong         1      1
75               you         1      1

Code:
table(with(DF, syllables-actual))
sum(table(with(DF, syllables-actual)))
round(prop.table(table(with(DF, syllables-actual))), digits=3)*100
Code:
COUNTS
-1  0  1
5 56 13

n = 74

PERCENTAGES
-1    0    1
6.8 75.7 17.6
this is one i m looking for....