Today I Learned: ____

Dason

Ambassador to the humans
I guess. Why do you want vectors as elements in your data frame? It seems like a list is the better data structure for that.
 

bryangoodrich

Probably A Mammal
I absolutely detest having vector elements in your table, but I could see an intuitive appeal to it as a NoSQL motivated data structure. It allows you to do complex nesting of data elements while maintaining semi-structure in other fields to define the tuple, do linking, and such. This can all still be done with lists and I don't see any advantages coming from this data structure because behind the scenes, it's still just a list and none of the frame functions will be designed for vector table elements.
 

trinker

ggplot2orBust
Sometimes it looks better and is easier to interpret if you're trying to view everything as a dataframe. For instance I append a list of vectors of positive and negative words in qdap's polarity functions. It allows you to see lot's of information across people and turns of talk in a tabular format. It's not ideal to work with computationally but it is ideal in some situations to mentally process the data.
 

bryangoodrich

Probably A Mammal
You can create your own display functions to do it better, and you should, as your qdap should have its own class of functions to deal with specifically qdap oriented stuff! :p
 

bryangoodrich

Probably A Mammal
TIL about lookback and lookahead in Perl regular expressions. So I tried it in R and they work!

Code:
s <- "Of Mice and Men, Goodbye."
ss <- "Of Mole and Men, Goodbye."
sss <- "Of Foo and Bar, Tag!"
regexpr(" and Men", c(s, ss, sss), perl = TRUE)           # 1
regexpr("Mice and Men", c(s, ss, sss), perl = TRUE)       # 2
regexpr("(?<=Mice) and Men", c(s, ss, sss), perl = TRUE)  # 3
regexpr("Of .*(?=and Men)", c(s, ss, sss), perl = TRUE)   # 4
regexpr("Of .* and Men", c(s, ss, sss), perl = TRUE)      # 5
Explanation
  1. Since both s and ss have " and Men" the strings match and at the same position (8), but sss doesn't have that literal string, so it fails to match.
  2. Only s is matched at position 4 after "Of " because it contains that literal string, while ss and sss fail to match.
  3. Only s is matched again, but this time at position 8 because the lookback confirms that "Mice" comes exactly before the literal " and Men" but it does not match the lookback.
  4. This matches s and ss because they both contain a literal "Of " and some strings ('Mice' and 'Mole', respectively) followed by a lookahead that matches in them "and Men" that sss lacks.
  5. This is again matched to s and ss but notice the length of the matches is much longer because it contains the lookahead string used in (4).
Thus, when you want to match some pattern that is in relation to something else (back or ahead), you can use these lookahead and lookback modifiers (think of "?" as "look" and "<=" and "=" as "back" and "ahead", respectively). It has the advantage demonstrated here that your matches won't contain those lookback or lookahead strings.

Use Case

Suppose you want to identify the string in

Code:
someVar = String
Then the regex "(?<=someVar = )String" should find only the 'String' part after the beginning part, but only keep the match String we're interested in.

Code:
s <- "someVar = String"
regex <- "(?<=someVar = )String"
regmatch <- regexpr(regex, s, perl = TRUE)
regmatches(s, regmatch)  # Returns "String"
 

trinker

ggplot2orBust
Probably silly to most of you but I am creating classes and need another function to check if an object is of that class. I was going to make my own is function when I put is in the command line and discovered there already is an is function that allows you to check if an object is of a class.

Code:
class(mtcars) <- c("tt", class(mtcars))
class(mtcars)
is(mtcars, "tt")
Code:
[COLOR="#696969"]> class(mtcars) <- c("tt", class(mtcars))
> class(mtcars)
[1] "tt"         "data.frame"
> is(mtcars, "tt")
[1] TRUE[/COLOR]
 

bryangoodrich

Probably A Mammal
The help page says inherits is also faster in most cases, but it begs a question why you need to know the class to perform a given action. That action should be contained in a method for which if it is a commonly named method (like print), you can just define your own version as plot.whatever for whatever class you want. Behind the scenes, R will figure out if it is the right class or not. So if you're doing the leg work there, maybe you're thinking about the problem the wrong way!
 

trinker

ggplot2orBust
@BG it depends on what you're using class for. If you assign 2 classes to the same object (a parent and a child) and use a common print function for both but different plot methods for each. The print method needs to figure out the specific child class to make a small tweak.

It could be done in better ways (i.e., send the information via a list and have the print method make it look like a dataframe that you're trying to print) but the audience for qdap are non programmers. list is scary and more difficult to deal with to a non programmer than a dataframe. It;s especially dangerous to a new R user if it looks like a dataframe because of the print method but the object is actually a list.

If the programming as ideal and for programmers I'd say you're right but the ultimate goal of qdap is to get people in my field to use the package.
 

bryangoodrich

Probably A Mammal
TIL that R handles complex numbers with an 'i' modifier

Code:
5i
# [1] 0+5i
5i * (1+2i)
# [1] -10+5i
Just think of all the things you can do in R now! :p



@Trinker, looking at your last comment again, you have something with 2 classes but you want to use the same print method but different plot methods? Just don't make a print.Child method. It'll automatically go to the print.Parent method, in the order of the class vector (faux inheritance). In any case, none of these methods and the data types being used should be revealed to the naive user. They should just have a comfortable time using your class, just like you don't know what "+" is really doing with the gg and ggplot classes returned and used amongst geom_line() and such in your ggplot2 outputs.
 

trinker

ggplot2orBust
TIL when you want to switch A for B and B for A there's a function to do that so you don't need a place holder.

How I used to do these substitutions:
Code:
x <- "ABCDBBABDC"

x2 <- gsub("A", "PLACE_HOLDER", x)
x3 <- gsub("B", "A", x2)
gsub("PLACE_HOLDER", "B", x3)
The chartr approach:
Code:
x <- "ABCDBBABDC"
chartr("AB", "BA", x)

y <- "1234567"
chartr("1234", "4321", y)
 

trinker

ggplot2orBust
I was testing a simple use of loop-type functions and R and thought I'd share the results:

The code:
Code:
x <- rnorm(100000)

FOR <- function() {
    m <- rep(NA, 100000)
    for (i in 1:100000) {
        m[i] <- x[i] *100
    }
}

LAPPLY_UNLIST <- function() unlist(lapply(x, "*", 100))
LAPPLY_C <- function() c(lapply(x, "*", 100))
VECT <- function() x*100
SAPPLY <- function() sapply(x, "*", 100, USE.NAMES = FALSE)
VAPPLY <- function() vapply(x, "*", 0, 100, USE.NAMES = FALSE)

library(microbenchmark)
microbenchmark( 
    FOR(), 
    LAPPLY_UNLIST(),
    LAPPLY_C(),
    SAPPLY(),
    VAPPLY(),
    VECT(),
times=100L)
The Results:
Code:
## Unit: microseconds
##             expr        min         lq     median         uq        max neval
##            FOR() 215543.498 220728.646 222126.280 224245.356 293300.641   100
##  LAPPLY_UNLIST()   4412.157   4455.774   4511.522   4591.760   8728.681   100
##       LAPPLY_C()   4539.512   4588.261   4635.843   4740.807   8838.308   100
##         SAPPLY()   9059.896   9205.910   9366.387   9718.360  13330.236   100
##         VAPPLY()   4255.879   4313.026   4378.569   4526.682   7950.559   100
##           VECT()     11.196     16.095     18.194     21.926     41.986   100
The Visual:


And mapply too (I need to use you more often):

The code:
Code:
x <- matrix(rnorm(10000), ncol = 100)
y <- matrix(rnorm(10000), ncol = 100)

FOR <- function() {
    m <- matrix(rep(NA, 10000), ncol = 100)
    for (i in 1:100) {
        for (j in 1:100) {
            m[j, i] <- x[j, 1] * x[j, i]
        }
    }
}


MAPPLY <- function() mapply("*", x, y)

library(microbenchmark)
microbenchmark( 
    FOR(), 
    MAPPLY(),
times=100L)
The Results:
Code:
## Unit: milliseconds
##      expr      min       lq   median       uq      max neval
##     FOR() 23.15566 24.11758 26.30850 26.83961 31.35486   100
##  MAPPLY() 16.56728 16.95471 17.84852 19.89972 79.74163   100
The Visual:
 

trinker

ggplot2orBust
TIL you can test if the output from try is an error. You could do the same thing with tryCatch but still interesting:

Code:
x <- try(sum(letters))
inherits(x, "try-error")
Produces:

Code:
## > x <- try(sum(letters))
## Error in sum(letters) : invalid 'type' (character) of argument
## > inherits(x, "try-error")
## [1] TRUE
 

bryangoodrich

Probably A Mammal
That's interesting. I've only barely ever used error handling in R. Really, I just don't like it that much lol. The tryCatch thing just seems awkward, but maybe this try command is more intuitive or flexible. In any case, error handling is very important to incorporate in production quality code.
 

bryangoodrich

Probably A Mammal
TIL I learned (though I may have seen them before), some nice filename functions

Code:
basename("dir/foo.csv")  # foo.csv
dirname("dir/foo.csv")  # dir
dirname("foo.csv")  # .
This can be handy, especially when dealing with coordinating moving files or naming objects based on file names, etc. I found it used https://gist.github.com/schaunwheeler/5825002#file-xlsxtor-r which is an XML parser of spreadsheets not unlike what I mash together for docx (see my github).
 

bryangoodrich

Probably A Mammal
I know I mentioned this in the chatbox, but I always forget about the TIL thread. So even if I'm duplicating, it's worth it!

https://gephi.org/

Gephi is a nice open source network (graph) analysis and visualization tool. Between this and the igraph package in R (igraph library has many language APIs, including C++, Python, and R), you can do some pretty amazing stuff! I've not had a chance to explore this deeply, but I'm sure a workflow can be made between the two, but I'm still trying to learn about the Gephi data type and graphs in general.
 

bryangoodrich

Probably A Mammal
TIL how awesome dbWriteTable (DBI) is for pushing a CSV into a database from R. Sure, you can use the database command-line utility and import it, but then you need to create the table structure it is going into beforehand, and it can get really messy, especially when your table has 117 fields! I managed to do this in R to make it a lot easier, but then I found dbWriteTable and my problems were solved.

Code:
createMyTable <- function(dbname, data_src, tblname, sep = ";") # Yes, my file is really ";" separated. Old database artifacts!
{
    require(RSQLite)
    
    # Make Cleaner Field Names
    fields <- tolower(strsplit(readLines(data_src, n=1), sep)[[1]])
    fields <- gsub("1st", "first", fields)
    fields <- gsub("2nd", "second", fields)
    fields <- make.names(gsub("-|:|\\.|\\(|\\)", "", fields))  # removes - : . ( and )
    fields <- gsub("\\.\\.", "\\.", fields)  # remove any missed additional '.' (extra spaces and such)
    fields <- gsub("\\.", "_", fields)  # Convert '.' word separation to "_" word separation
    
    con <- dbConnect(SQLite(), dbname = dbname)
    on.exit(expr=dbDisconnect(con), add=TRUE)
    dbWriteTable(conn = con, name = tblname, value = data_src, sep = sep,
                 row.names = FALSE, col.names = fields, header = TRUE)
}
That's it! Stupid easy!!

You can tell it was all database driven, too, because the memory usage was minimal and it went fast. For scale, I have almost a 4 GB file I compiled (using Python) into a single CSV that this script then pushes into a database (result: 2 GB).

The documentation sucks, however. I seriously didn't know half of those options existed! There's nothing about "sep" working or that "col.names" was legitimate. I just assumed it was read.table like, and it worked. The documentation, however, only says "any optional arguments that the underlying database driver supports" and gives examples for row.names, overwrite, and append. Not useful! WTF does the database driver support? How do I find out? Bad documentation.

In any case, it took about 1 minute per GB, so I might try this again with a colClasses argument building off my earlier script where I generated code to produce a CREATE TABLE statement in SQL. Since I want the classes to match up (and a lot of the ID columns are now numeric on this import; I don't want that!). So in some sense, I'm still doing the busy work I would have had to anyway, but the semantics and uniformity of doing this all in R is provided. I like DBI. It's my go to database R package!
 

bryangoodrich

Probably A Mammal
Tried out colClasses ... didn't help. Seen something for RMySQL that had a field.types or something like that, but that starts to look driver specific, maybe? I don't see any documentation for it. That angers me to no end -_-