I moved this portion of a thread from (HERE) as the original problem was solved.
Posting a problem I'm wrestling with though it's likely outside your fields your brains are generally pretty good at applying your knowledge to other fields. Thought maybe you could help me.
I'm looking to use R for Syllabication or break words into number of syllables (In my field this has all sorts of uses related to readability of a text etc).
Here's a bit of really condensed, not too hard, theory on syllabication that may be useful if you're attempting to help with this problem:
http://allenporter.tumblr.com/post/9776954743/syllables
A suggestion for R from SO to tackle this problem is to use:
I tested it out using text from a post on here (The frequent stats misunderstandings thread) and got surprisingly accurate responses from the simple code(76% accuracy on n =74). The major problem with the above code is it doesn't detect silent 'e' at the end of a word. If I could add a piece to the code it would surely improve the accuracy a lot. Just telling R to not look for ending e's won't work because the words "people" and "little" for instance actually uses the e to for syllabication.
QUESTION: Any ideas on how to use R's reg ex to find silent e's.
Posting a problem I'm wrestling with though it's likely outside your fields your brains are generally pretty good at applying your knowledge to other fields. Thought maybe you could help me.
I'm looking to use R for Syllabication or break words into number of syllables (In my field this has all sorts of uses related to readability of a text etc).
Here's a bit of really condensed, not too hard, theory on syllabication that may be useful if you're attempting to help with this problem:
http://allenporter.tumblr.com/post/9776954743/syllables
A suggestion for R from SO to tackle this problem is to use:
Code:
nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y)))
QUESTION: Any ideas on how to use R's reg ex to find silent e's.
Code:
x <- "I know TheEcologist is working on an FAQ to put up at some point but I didn't think it would be a
bad idea to compile a list of frequent misunderstandings that we see. I've been seeing quite a few people
making the mistake of assuming that in a linear model we expect the predictor variables to be normally distributed.
Or seeing that they expect the response itself to be normally distributed. This is wrong, of course, because we
don't make any of those assumptions but instead assume that the error term is normally distributed.
So what other misunderstands have you come across quite a bit (either in real life or here at Talk Stats)?"
bracketX <- function(text, bracket='all'){
switch(bracket,
square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
sapply(P1, function(x)gsub("\\{.+?\\}", "", x))})
}
y <- gsub("\n", " ", x)
y <- gsub("[,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\_\\^\\%\\$\\#\\<\\>]", "", as.character(y))
y <- bracketX(y)
y <- gsub(" +", " ", y)
y <- c(sapply(y, function(x)as.vector(unlist(strsplit(x, " ")))))
y <- tolower(y)
y <- levels(as.factor(y))
n <- nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", y)))
DF <- data.frame(words=y, syllables=n)
Code:
words syllables actual
1 a 1 1
2 across 2 2
3 an 1 1
4 any 2 2
5 assume 3 2
6 assuming 3 3
7 assumptions 3 3
8 at 1 1
9 bad 1 1
10 be 1 1
11 because 3 2
12 been 1 1
13 bit 1 1
14 but 1 1
15 come 2 1
16 compile 3 2
17 course 2 1
18 didn't 1 2
19 distributed 4 4
20 don't 1 1
21 error 2 2
22 expect 2 2
23 faq 1 NA
24 few 1 1
25 frequent 2 2
26 have 2 1
27 i 1 1
28 i've 2 1
29 idea 2 3
30 in 1 1
31 instead 2 2
32 is 1 1
33 it 1 1
34 itself 2 2
35 know 1 1
36 linear 2 3
37 list 1 1
38 make 2 1
39 making 2 2
40 mistake 3 2
41 misunderstandings 5 5
42 misunderstands 4 4
43 model 2 2
44 normally 3 3
45 of 1 1
46 on 1 1
47 or 1 1
48 other 2 2
49 people 2 2
50 point 1 1
51 predictor 3 3
52 put 1 1
53 quite 2 1
54 response 3 2
55 see 1 1
56 seeing 1 2
57 so 1 1
58 some 2 1
59 term 1 1
60 that 1 1
61 the 1 1
62 theecologist 4 4
63 they 1 1
64 think 1 1
65 this 1 1
66 those 2 1
67 to 1 1
68 up 1 1
69 variables 3 4
70 we 1 1
71 what 1 1
72 working 2 2
73 would 1 1
74 wrong 1 1
75 you 1 1
Code:
table(with(DF, syllables-actual))
sum(table(with(DF, syllables-actual)))
round(prop.table(table(with(DF, syllables-actual))), digits=3)*100
Code:
COUNTS
-1 0 1
5 56 13
n = 74
PERCENTAGES
-1 0 1
6.8 75.7 17.6