Create a fast dictionary

trinker

ggplot2orBust
#1
This conversation started in the chat box about speeding up a the use of a dictionary (wordlist) to match a text string and return a numeric output corresponding to the number of syllables in the word. The dictionary currently is stored as a dataframe and consists of two columns (words, syllables).

The dictionary is very slow and I want to trim the fat and speed it up as much as possible. The function the uses this dictionary firsts checks the dictionary for the word and either matches and spits out the syllable count or moves on to use an algorithm. Generally algorithms are much faster than dictionary matching but less accurate because of corner cases (and the English language is full of them).

Ideas I have to speed up the function:
  1. Find out what words the algorithm accurate predicts and remove them from the dictionary.
  2. Possibly use C and a hash table (words that I don't yet really comprehend) to encode the dictionary

Ideas Bryangoodrich has suggested:
not match every word against the entire dictionary (about 20,000 terms) but generate sub lists within the dictionary (possibly by first letter and recently he has suggested by length too)

Ideas Dason has suggested
  1. Don't use a dataframe it's slow. Use a faster matrix structure
  2. Possibly use: https://stat.ethz.ch/pipermail/r-help/2011-April/274018.html

I have tried to write up a mock conversion of the current dictionary to a list of matrices stored by alphabetical order. I have then tried to take a text string, remove punctuation, split it into words and apply the dictionary to each word. I am failing in this task because I realize my discomfort with matrix structures gives me limited knowledge in handling them (see the code below)

From this thread I hope to:
  1. generate ideas/code to speed up the dictionary matching
  2. generate a method/code to convert my present dataframe dictionary to a faster form
  3. learn some cool nija tricks to make me a better programmer

I moved the discussion here because it's easier to wrote code and share it plus this gives us a mock example to use. Any ideas to speed up the matching is very much appreciated. I definitely plan on removing words from the dictionary that the algorithm correctly assigns syllable value to as well. This should garner a speed boost.

Thank you in advance! :D

Tyler

Code:
#############################################################################
# THE DICTIONARY CREATION, COVERSION FROM A DATAFRAME TO A LIST OF MATRICES #
#############################################################################
dict <- structure(list(word = c("a", "about", "all", "an", "and", "are", 
"as", "at", "be", "been", "but", "by", "call", "can", "come", 
"could", "day", "did", "do", "down", "each", "find", "first", 
"for", "from", "get", "go", "had", "has", "have", "he", "her", 
"him", "his", "how", "I", "if", "in", "into", "is", "it", "its", 
"like", "long", "look", "made", "make", "many", "may", "more", 
"my", "no", "not", "now", "number", "of", "oil", "on", "one", 
"or", "other", "out", "part", "people", "said", "see", "she", 
"so", "some", "than", "that", "the", "their", "them", "then", 
"there", "these", "they", "this", "time", "to", "two", "up", 
"use", "was", "water", "way", "we", "were", "what", "when", "which", 
"who", "will", "with", "word", "would", "write", "you", "your"
), syll = c(1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 
1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("word", 
"syll"), row.names = c(NA, -100L), class = "data.frame")

dict$lett1 <- unlist(lapply(dict[,'word'], function(x) substring(x, 1, 1)))
dict2 <- lapply(unique(dict$lett1), function(x) subset(dict, dict$lett1%in%x, select = -c(lett1)))
names(dict2) <- unique(dict$lett1)
dict3 <- lapply(dict2, as.matrix)

############
# THE DATA #
############
y <- "I look at other people all day."
y <- gsub("[[:punct:]]", "", y)
y <- strsplit(y, " ")
x <- "the"
############
# MATCHING #
############
#A matching function that finds lists matching 
#first letter then matches the word
MATCH <- function(x) {
    LET <- substring(x, 1, 1)
    SUBdict <- dict3[[LET]]
    as.numeric(SUBdict[SUBdict[,1]%in%x, 2])
}

lapply(x, MATCH) # good here
lapply(y, MATCH) #R has been angered by my lack of understanding and burped forth a "You idiot" message
 

bryangoodrich

Probably A Mammal
#2
Working on what I talked about in the chatbox, and not what you presented here (yet), just look at how easy this is.

Code:
dictionary <- list(c=list(a=list(t=list(s=1))))
dictionary$d$i$g$g$e$r <- 2

dictionary
# $c
# $c$a
# $c$a$t
# $c$a$t$s
# $c$a$t$s[[1]]
# [1] 1

# $d
# $d$i
# $d$i$g
# $d$i$g$g
# $d$i$g$g$e
# $d$i$g$g$e$r
# [1] 2
And to access

Code:
grabValue <- function(word) {
  x <- unlist(strsplit(word, ""))
  n <- length(x)
  expr <- "dictionary[["
  for (i in seq(n)) 
    expr <- paste(expr, x[i], "]][[", sep = "'")
  expr <- substr(expr, 1, nchar(expr)-2)
  x <- eval(parse(text=expr))
  return(unlist(x))
}

getValue("cats")
# [1] 1

getValue("digger")
# [1] 2
I'm actually curious to know, what takes up more size in memory/on disk, that list-tree, as I call it, or the frame data.frame(word = c("cats", "digger"), value = c(1, 2)). Anyone know how to check??

EDIT: To check an object's size it is ... wait for it ... object.size

The list is twice the size, but this may not reflect what it is at a larger scale, mind you. The same goes with benchmarking a search on this versus the frame, because the benefits are when we check things at scale.

I don't think size will ultimately be an issue here in performance. I'll argue that I think my approach of a list-tree will have benefits at scale because the search is comparable to scanning some 30 ought sized vector (the size of the names at any level of the tree) as many times the length of the word. Compare that with searching a 20,000 sized vector. If that strain is multiplicative (it very well may not be), then we're looking at needing a hundred size word to get comparable strain on the list-tree approach. My benchmark on the above example was that the list-tree was lagging 1.5% behind, and this was on searching "cats", requiring the list tree to do a 2x2x2x2 search on each position supplied.

I arbitrarily extended the size of the data frame to over 8000 records and did the same test (not fair since the dictionary didn't change in equal size, but it's not easily extended! lol). The data frame approach took 3 times as long.
 

bryangoodrich

Probably A Mammal
#3
To create the list-tree off your data frame, it's actually just a tweak on the accessor method.

Code:
putValue <- function(word, value) {
  x <- unlist(strsplit(word, ""))
  n <- length(x)
  expr <- "dictionary"
  for (i in seq(n))
    expr <- paste(expr, x[i], sep="$" )
  expr <- paste(substr(expr, 1, nchar(expr)), "<-", value)
  eval(parse(text=expr), envir = .GlobalEnv) # This may not be the best way to specify the environment, but it works for now!
}
The only problem is I need to control the environment in the evaluation statement at the end (Updated/Fixed). Those commands work if you do it manually. Call it from the function, no good. At least, I'm assuming this works (it works in individual cases). The idea is then to call

Code:
apply(dict, 1, function(col) putValue(col$word, col$syll))
I'm sure this will be very, very slow, when converting 20,000 words, but as I said in the chatbox, it only needs to be run once! I'd recommend testing it on the shorter list provided here to make sure it works, and it will give something of better scale to play with. Note, by doing this I've effectively created a new class object with two accessor methods: one for getting and one for setting values.

Scratch this

This doesn't work because it only populates up to the assigned portion, it doesn't maintain what is already there. I think an approach using environments may prove more useful.
 
Last edited:

bryangoodrich

Probably A Mammal
#4
You might want to consider using a new environment.

Code:
dictionary <- new.env(hash = TRUE)  # The environment is based on a hash table; speed boost
assign(key, value, envir = dictionary)
get(key, envir = dictionary)
In particular

Code:
dictionary <- new.env(hash = TRUE, size = nrow(dict))  [COLOR="red"]# Good call on specifying size, Dason![/COLOR]
apply(dict, 1, function(col) assign(col[1], col[2], envir=dictionary))
get("their", envir=dictionary)
 
Last edited:

Dason

Ambassador to the humans
#5
Code:
eh <- new.env(hash = TRUE, size = dim(dict)[1])
# Add your dictionary to the environment
lapply(seq(dim(dict)[1]), function(x){assign(dict[x, 1], dict[x, 2], envir = eh)})

lookup <- function(val){
	get(val, envir = eh)
}

lapply(y[[1]], lookup)
#or
sapply(y[[1]], lookup)
 

bryangoodrich

Probably A Mammal
#6
O
M
G

I expanded (randomly) the data provided here to nearly 15,000 key/value combos, and then I searched for "their" with a lookup function I named "getSyll".

Code:
                           test replications elapsed relative user.self
1              getSyll("their")          100   0.007  1.00000     0.008
2 z[z$word == "their", "value"]          100   0.627 89.57143     0.464
That's a 90 fold difference! With 1000 replications it was 100 fold!
 

trinker

ggplot2orBust
#7
Nice guys would converting to a matrix help as well?
What about subsetting into alphabetized sub lists or is this really unnecessary?

I think I'm going to reduce the dictionary size and then assign it to it's own environment as you guys suggested.

Should I store the dictionary in an .Rdata file, call it, and then assign it to it's own environment? Eventually I'll store the dictionary as a data set in the package.
 

Dason

Ambassador to the humans
#9
Until then here's a function to create a random dictionary:
Code:
makeword <- function(n = 10){
	return(paste(sample(LETTERS, n, replace = TRUE), sep = "", collapse = ""))
}

createdict <- function(asize = 100000){
	words <- unique(unlist(lapply(rpois(asize, 7)+1, makeword)))
	syl <- rpois(length(words), 2)
	return(data.frame(word = words, syll = syl, stringsAsFactors = FALSE))
}

# Will contain a little less than 10000 words
dict <- createdict(10000)
 

bryangoodrich

Probably A Mammal
#10
Seriously, unless you write some C code, this is probably going to be the fastest approach, and it is very simple to use. Unless you can transport environments like data, I'd

(1) Keep it as a data frame, unless you find making it a character matrix is any faster (probably not significantly in speed or size)

(2) Create a function that loads it into its own environment, sort of like connecting to a database

(3) Create accessor methods, most importantly for getting values out of the table.

Code:
hash <- function(x) {e <- new.env(hash = TRUE, size = nrow(x)); apply(x, 1, function(col) assign(col[1], col[2], envir = e)); return(e)}
data(dictionary)
env <- hash(dictionary)
get("their", envir=env)
Really though, the "get" function is already your accessor method.
 

Jake

Cookie Scientist
#11
Trinker, the reason your match works for x but not y is because x is a character vector while y is a list that contains a single character vector. The method you are currently implementing doesn't work for both data types at once. You should decide what form you want to work with and either require the input to be in that form, or coerce the input to the desired form internally.

Also, just for clarification, do you want the return value to be a numeric vector of syllable numbers of length equal to the input sentence?

One final suggestion, have you tried simply storing your dictionary as a data.table (with key = "word") and looking up words using data.table's built-in binary search?

If you could post a much larger dictionary somewhere on the Web we could play around with different implementations and compare their performance on a more realistic scale.
 

Dason

Ambassador to the humans
#12
I'm not sure how much of a speed gain data.table would have over using a hashed environment but I'd be interested to see any results.

You could probably use that dictionary function I posted earlier to play around with different implementations if you want. It won't give the same list as trinker but it will contain words of various lengths that are matched with some sort of number - pretty much what we're dealing with.
 

trinker

ggplot2orBust
#13
I'm not sure how much/if the algorith after the dictionary look up utilizes the dictionary at some points (ie to search for dictionary words with some endings).

In make this process genuine I've provided the function and dictionary.

HERE is the current dictionary. It's comma separated but because of the size this was the only way I could think to deliver the file. The size of the dictionary is > nrow(NETtalk) [1] 20031. Below is the final code for the dictionary as well. I have the dictionary saved as NETtalk.RData and load it from within the function. This will not work obviously since you don't have the RData file.

Code:
syllable.count <- function(text, remove.bracketed = TRUE, algorithm.report = FALSE) {

bracketX <- function(text, bracket='all'){
    switch(bracket,
        square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
        round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
        curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
        all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
             P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
             sapply(P1, function(x)gsub("\\{.+?\\}", "", x))
             }
      )                                                                                                                                                              
}


    load("NETtalk.RData")
    if (is.na(text)) {
        NA
    } else {
        q <- gsub("\n", " ", text)
        q <- gsub("\t", " ", q)
        q <- if (remove.bracketed == TRUE) {
        bracketX(q) 
        } else {
            q
        }
    
        q <- gsub("[\\,\\&\\*\\?\\.\\!\\;\\:\\,\\+\\=\\-\\*–\\_\\^\\%\\$\\#\\<\\>]", 
            "", as.character(q))
        q <- gsub("[\\(\\)\\[\\]\\{\\}]", "", q)
        q <- gsub("\\d+", "", q)
        q <- gsub('"', "", q) 
        q <- gsub('-', " ", q) 
        q <- gsub('—', " ", q) 
        q <- gsub(" +", " ", q)
        q <- c(sapply(q, function(x) as.vector(unlist(strsplit(x, " ")))))
        y <- tolower(q)
    #y <- levels(as.factor(y))#######################remove
    #q <- y#######################remove
    
        SYLL <- function(x) {
            if (x %in% NETtalk$word) {
                NETtalk[which(NETtalk$word %in% x), "syllables"]
            } else {
                if (substring(x, nchar(x), nchar(x)) == "s" & substring(x, 
                    1, nchar(x) - 1) %in% NETtalk$word) {
                    NETtalk[which(NETtalk$word %in% substring(x, 1, nchar(x) - 
                      1)), "syllables"]
                } else {
                    m <- gsub("eeing", "XX", x)
                    m <- gsub("eing", "XX", m)

                    ended <- function(z) {
                      if (substring(z, nchar(z) - 1, nchar(z)) == "ed" & substring(z, 
                        nchar(z) - 2, nchar(z) - 2) %in% c("t", "d")) {
                        z
                      } else {
                      if (substring(z, nchar(z) - 1, nchar(z)) == "ed" & !substring(z, 
                        nchar(z) - 2, nchar(z) - 2) %in% c("t", "d")) {
                          substring(z, 1, nchar(z) - 2)
                        } else {
                          z
                        }                      
                      }
                    }

                    m <- ended(m)

                    conely <- function(z) {
                      if (substring(z, nchar(z) - 2, nchar(z)) == "ely"){
                        paste(substring(z, 1, nchar(z) - 3), "ly", sep = "")
                      } else {
                          z
                      }
                    }
                
                    m <- conely(m)
                
                    conle <- function(z) {
                      if (substring(z, nchar(z) - 1, nchar(z)) == "le" & !substring(z, 
                        nchar(z) - 2, nchar(z) - 2) %in% c("a", "e", "i", 
                        "o", "u", "y")) {
                        paste(substring(z, 1, nchar(z) - 1), "X", sep = "")
                      } else {
                        if (substring(z, nchar(z) - 1, nchar(z)) == "le" & 
                          substring(z, nchar(z) - 2, nchar(z) - 2) %in% c("a", 
                            "e", "i", "o", "u", "y")) {
                          substring(z, 1, nchar(z) - 1)
                        } else {
                          z
                        }
                      }
                    }
                
                    m <- conle(m)
                
                    conles <- function(z) {
                      if (substring(z, nchar(z) - 2, nchar(z)) == "les" & 
                        !substring(z, nchar(z) - 3, nchar(z) - 3) %in% c("a", 
                          "e", "i", "o", "u", "y")) {
                        paste(substring(z, 1, nchar(z) - 2), "X", sep = "")
                      } else {
                        if (substring(z, nchar(z) - 2, nchar(z)) == "les" & 
                          substring(z, nchar(z) - 3, nchar(z) - 3) %in% c("a", 
                            "e", "i", "o", "u", "y")) {
                          substring(z, 1, nchar(z) - 2)
                        } else {
                          z
                        }
                      }
                    }

                    m <- conles(m)
                
                    magice <- function(z) {
                      if (substring(z, nchar(z), nchar(z)) == "e" & 
                        length(intersect(unlist(strsplit(z, NULL)), 
                        c("a", "e", "i", "o", "u", "y"))) > 1) {
                        substring(z, 1, nchar(z) - 1)
                      } else {
                        z
                      }
                    }
                
                     m <- magice(m)
                    
                    nchar(gsub("[^X]", "", gsub("[aeiouy]+", "X", m)))
                }
            }
        }  

        n <- sapply(y, function(x) SYLL(x))
        k <- ifelse(y %in% NETtalk$word, "-", "NF")
        DF <- data.frame(words = q, syllables = n, in.dictionary = k)
        row.names(DF) <- 1:nrow(DF)
        if (algorithm.report == TRUE){
          list("ALGORITHM REPORT" = DF[which(DF$in.dictionary == 'NF'), ], "SYLLABLE DATAFRAME" = DF)
        } else {
            DF
        }
    }
}
EXAMPLES
Code:
 text <- "I know TheEcologist is working on an FAQ to put up at some point but I didn't think it would be a           
 bad idea to compile a list of frequent misunderstandings that we see. I've been seeing quite a few people            
 making the mistake of assuming that in a linear model we expect the predictor variables to be normally distributed.  
 Or seeing that they expect the response itself to be normally distributed. This is wrong, of course, because we      
 don't make any of 345 those assumptions but instead assume that the error term is normally distributed.              
 So what other misunderstands have you come across quite a bit (either in real life or here at Talk Stats)?"          
                                                                                                                      
 syllable.count(text)                                                                                                 
                                                                                                                      
 #PROVIDES LIST OUTPUT WITH WORDS                                                                                     
 #THE ALGORITH WAS USED ON                                                                                            
 syllable.count(text, algorithm.report = TRUE)
EDIT: It's prudent for me to mention the name of the dictionary dataframe utilized by the function is NETtalk so loading the dictionary uner the same named data frame object should allow the function to word for you all as well
 

bryangoodrich

Probably A Mammal
#14
Thanks, loading that **** web page almost killed my Firefox! lol

I ran a benchmark on get("zombie", env) versus a vector search, and it was a 120 fold difference! Time to try out this data.table already!

Oh yeah, and your dictionary with 'xz' compression is only 75KB. Not that big a file to deal with!
 

bryangoodrich

Probably A Mammal
#15
There's something up with the way benchmark works that it isn't right. It's not a 1000+ fold difference! The use of system.time, however, gives 0 for the environments approach and 0.068 (overall) for the data.table approach. As I expected, the hash table built into R environments is going to be the best you can get in R!
 

Dason

Ambassador to the humans
#16
There's something up with the way benchmark works that it isn't right. It's not a 1000+ fold difference! The use of system.time, however, gives 0 for the environments approach and 0.068 (overall) for the data.table approach. As I expected, the hash table built into R environments is going to be the best you can get in R!
What's not right? It replicates the call multiple times whereas system.time only evaluates the call once.
 

bryangoodrich

Probably A Mammal
#17
Okay, I used a while loop to have them do the same search repeatedly within a system.time block. Apparently data.table fails miserably! I then randomly picked out 10,000 words from the list and had them all do it. The environment is still fractions of a second, the vector search (data frame) is still not too slow, and the tables approach just sucks! Is anyone getting different results? I expected better from them!
 

bryangoodrich

Probably A Mammal
#18
What's not right? It replicates the call multiple times whereas system.time only evaluates the call once.
Just doesn't seem right that data.table's binary search should be going so **** slow compared to the exact same vector search with frames. Yet, the tests keep showing the same thing. I don't get it. Maybe there's a lot of non-searching overhead that the tables are producing? Maybe I should loop through the search list and see how it takes to search consecutive single items instead of a cluster of them.
 

trinker

ggplot2orBust
#19
Dason and bryangoodrich I tried to assign the NETtalk dictionary to a new environment by just inserting in NETtalk for dict and get the following error:


Code:
> eh <- new.env(hash = TRUE, size = dim(NETtalk)[1])
> lapply(seq(dim(NETtalk)[1]), function(x){assign(NETtalk[x, 1], NETtalk[x, 2], envir = eh)})
Error in assign(NETtalk[x, 1], NETtalk[x, 2], envir = eh) : 
  invalid first argument
Once I save the dictionary into a new environment how do I call it from within the function?
NETtalk::eh

EDIT: bryan goodrich's envir. assignment works (not sure if it's doing the same thing as Dason's as this is all new stuff to me) but now what do I do with it? How do I make it work in the function? Sorry this environments business is all greek to me.

Code:
hash <- function(x) {e <- new.env(hash = TRUE, size = nrow(x)); apply(x, 1, function(col) assign(col[1], col[2], envir = e)); return(e)}
env <- hash(NETtalk)
get("trademark", envir=env)
 

bryangoodrich

Probably A Mammal
#20
Why are you using lapply? Isn't it a data frame? Just use apply like we showed. And you get access to the new environment by first extracting what you want. See ls(eh) to see what's in there. It's just like your current environment. It holds a bunch of objects that are easy and quick to look up. It's no different than if you said

Code:
x <- rnorm(x)
get("x")