Share your functions & code

bryangoodrich

Probably A Mammal
Well using the example from the other thread,

Closure definitions

Code:
match_between_gen <- function(value) {
    function(x, lookback, lookahead, pattern = ".*") {
        regex <- "(?<=__LOOKBACK__)__PATTERN__(?=__LOOKAHEAD__)"
        regex <- gsub("__LOOKBACK__", lookback, regex)
        regex <- gsub("__LOOKAHEAD__", lookahead, regex)
        regex <- gsub("__PATTERN__", pattern, regex)
        regmatch <- gregexpr(regex, x, perl = TRUE)
        if (!value) regmatch else regmatches(x, regmatch)
    }
}

regex_values  <- match_between_gen(TRUE)
regex_matches <- match_between_gen(FALSE)
Typical definition

Code:
match_between <- function(x, lookback, lookahead, pattern = ".*", value = TRUE) {
    regex <- "(?<=__LOOKBACK__)__PATTERN__(?=__LOOKAHEAD__)"
    regex <- gsub("__LOOKBACK__", lookback, regex)
    regex <- gsub("__LOOKAHEAD__", lookahead, regex)
    regex <- gsub("__PATTERN__", pattern, regex)
    regmatch <- gregexpr(regex, x, perl = TRUE)
    if (!value) regmatch else regmatches(x, regmatch)
}
Then the usage

Code:
html <- c("<blockquote>", "<A NAME=1.1.2>And I in going, madam, weep o'er my father's death[/COLOR]</A><br>", 
"<A NAME=1.1.3>anew: but I must attend his majesty's command, to</A><br>", 
"<A NAME=1.1.4>whom I am now in ward, evermore in subjection.</A><br>", 
"</blockquote>")

open_tag <- "<A NAME=[0-9]\\.[0-9]\\.[0-9]>"
close_tag <- "</A>"

regex_values(html, lookback = open_tag, lookahead = close_tag)
[COLOR="royalblue"]# [[1]]
# character(0)
# 
# [[2]]
# [1] "And I in going, madam, weep o'er my father's death[noparse][/COLOR][/noparse]"
# 
# [[3]]
# [1] "anew: but I must attend his majesty's command, to"
# 
# [[4]]
# [1] "whom I am now in ward, evermore in subjection."
# 
# [[5]]
# character(0)[/COLOR]

regex_matches(html, open_tag, close_tag)
[COLOR="royalblue"]# [[1]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"useBytes")
# [1] TRUE
# 
# [[2]]
# [1] 15
# attr(,"match.length")
# [1] 58
# attr(,"useBytes")
# [1] TRUE
# 
# [[3]]
# [1] 15
# attr(,"match.length")
# [1] 49
# attr(,"useBytes")
# [1] TRUE
# 
# [[4]]
# [1] 15
# attr(,"match.length")
# [1] 46
# attr(,"useBytes")
# [1] TRUE
# 
# [[5]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"useBytes")
# [1] TRUE[/COLOR]

match_between(html, open_tag, close_tag)  [COLOR="royalblue"]# values[/COLOR]
[COLOR="royalblue"]# regex_values(...) results[/COLOR]

match_between(html, open_tag, close_tag, value = FALSE) [COLOR="royalblue"]# matches[/COLOR]
[COLOR="royalblue"]# regex_matches(...) results[/COLOR]
Now one might say "you can just make wrappers of match_between where the value parameter is set appropriately." That is true, but it requires you to duplicate code manually, potentially introduce errors, and is harder to maintain throughout developmental changes. Functional programming, as I've experienced it, seems to give better management both over the code and semantics (if you design them well), but also over the state of variables, which to me seems to be the coolest benefit of closures: controlling a variable within a function environment.

One other thing I like is that with the closure definition, the formal arguments for regex_matches and regex_values are the same, which is something I think should be expected as they do the same thing. What changes is the return object. The complexity is increased by having more functions to remember, but that is also balanced by being able to better management the parameter space and the semantics of those functions to be remembered.
 

bryangoodrich

Probably A Mammal
Well I used the stuff I learned in Hadley's functional programming to take two lists, one of measures (sum, mean, load factor, ...) and one of time aggregations (year, quarter, week, ...) and generate all permutations of them into Level_Measure functions. Thus, a person can read.meter(...meter number...) to read in a meter number. Then they can Quarter_LoadFactor and generate a table (data.table) of that measure by that level of aggregation. This can be directly used in ggplot quite easily. I'm thinking of building classes for certain plots so maybe I can do something like "Trend + Week_Mean(x)" and "Profile + Work_Schedule(x)" to do the appropriate ggplot on the appropriate table transformation. That may be a much later thing, however. I have no idea how much work that will take until I start digging more into the ggplot code to see how Hadley did what he did. It would be a ton easier for an end-user than having to Trend(Week_LoadFactor(x)) or something.

Now that I think about it, this may be easier than I thought. "Trend" here would simply be a ggplot object containing all the stuff I need it to contain, and what I'm adding is the data. What I'm wrapping just needs to validate that the data is of the right form, hence maybe my own custom class.

Anyway, read Hadley's stuff and work with it. I'm using this everywhere these days, putting together things much faster than I used to and able to appear a lot cleaner (semantically).
 

bryangoodrich

Probably A Mammal
Yeah, that was easier than I thought. Turns out ggplot has an infix operator for replacing the data frame. So,

Code:
Trend <- ggplot(data.table(), aes(...)) + geom_line()
Trend %+% Day_Mean(x)
Trend %+% Week_LoadFactor(x)
... the possibilities are endless! Or about 20ish
It's not exactly as I would like, but it would be a bit of work to make my own + supersede all this, but then I'd create my own class returned by these aggregate measure functions so it would be like "aggregate measure plus trend [plot]" where Trend is to be a ggplot object predefined. Then the end-user could Day_Max(x) + Trend or Year_Mean(x) + Trend. It would be easy to use and semantically clear.

I was surprised how cool this works, though.
 

bryangoodrich

Probably A Mammal
Oh! It's so pretty!

Code:
level_measure <- function(Level, Measure) {
    force(Level)
    Measure <- match.fun(Measure)
    function(X) {
        x <- X[, list(kW = Measure(kW)),
          by = list(meter, "Agg" = cut(timestamp, breaks=Level))]
        class(x) <- c("foobar", class(x))
        x
    }
}

args <- expand.grid(
    Level   = list("Year" = "year", "Quarter" = "quarter",
                   "Month" = "month", "Week" = "week", "Day" = "day"),
    Measure = list("Sum" = sum, "Mean" = mean, "Max" = max,
                   "LoadFactor" = function(x) mean(x) / max(x)),
    stringsAsFactors = FALSE)
funs <- lapply(1:nrow(args), function(n) {
    force(n)
    level_measure(args$Level[[n]], args$Measure[[n]])
})
names(funs) <- with(args, paste(names(Level), names(Measure), sep="_"))
list2env(funs, environment())
rm(args, funs)
So this generates a bunch of functions for different aggregation levels like I've talked about. It also returns an object of class "foobar." Now ...

Code:
`+.foobar` <- function(x,p) p %+% x
Trend <- ggplot(data.frame(), aes(as.numeric(Agg), kW, col=meter)) + geom_line() + theme_bw()
class(Trend) <- c("foobar", class(Trend))
That's it, now I can do any of those combinations

Code:
x <- read_meter(...)
Quarter_LoadFactor(x)
Week_Mean(x) + Trend
Here my data model is a 3 column data.table object with (meter, timestamp, kW) tuple, with meter an identifier, timestamp a 15 minute reading, and kW the value read at that timestamp.

Truly was easy to do. Requires a bit of work to make "good" in a development sense (like validation of inputs, among other things), but as a proof of concept, I got it to work as expected.

Now I just need to figure how to build a package off stuff like this. For instance, I can package.skeleton this stuff to generate R code files, but they don't contain the environment they enclose (so it just hard codes Measure and Level, not their values for each function). I could wing it and just take those files, compile them into one file and do documentation for them as a group while having the package generate them when it loads or something. But that seems shoddy.
 

bryangoodrich

Probably A Mammal
Since I'm sharing code, here's another one going into my bmisc package

Code:
StringSplit <- function(s, c, i) vapply(strsplit(s, c), "[", "", i)
I hate getting a list back from my strings when I just want a certain part of that. So, this gives me what I want.

Code:
x <- c("2013-03-15 14:30:00", "2013-07-29 09:00:00", "2013-10-19 19:30:00", 
"2013-08-16 08:15:00", "2013-06-11 04:45:00", "2013-12-31 06:45:00", 
"2013-10-06 16:15:00", "2013-01-13 03:00:00", "2013-09-07 09:15:00", 
"2013-03-27 09:15:00")
StringSplit <- function(s, c, i) vapply(strsplit(s, c), "[", "", i)

StringSplit(x, ' ', 1)
#  [1] "2013-03-15" "2013-07-29" "2013-10-19" "2013-08-16" "2013-06-11"
#  [6] "2013-12-31" "2013-10-06" "2013-01-13" "2013-09-07" "2013-03-27"

StringSplit(x, ' ', 2)
#  [1] "14:30:00" "09:00:00" "19:30:00" "08:15:00" "04:45:00" "06:45:00"
#  [7] "16:15:00" "03:00:00" "09:15:00" "09:15:00"
It was useful in my case for parsing timestamp strings, but clearly it is a general wrapper for strsplit that should (hopefully) be quick at grabbing the vector index you want from that split. There's reasons strsplit returns a list, so maybe some validation or something should be included to handle the subset of data this method is intended to work against, but it wasn't meant to be entirely sophisticated as-is.
 

trinker

ggplot2orBust
This is a bit above my pay grade but I know you can export environments as data sets (not sure if this helps). Also I have bee working on a package skeleton function (it will never be released to CRAN and is for my own personal use) that may be of use to you. It generates its own roxygen2 templated .R files. A bit fancier than package.skeleton.

I plan to add methods detection as well.

https://github.com/trinker/acc.roxygen2

Specifically the ?package_template function is useful for building templates.
 

bryangoodrich

Probably A Mammal
Yeah, but the data sets would have to be respectively linked to the function environments to which they belong (are used in). Unless there's a built-in way to handle that, I'm just not sure. Maybe I'll shoot Hadley a message about this. Maybe his code will give some insight into how he handled it. Something to study later. Since I have a workaround from needing a package to get done what I need to get done, I'll focus on continuing to build this suite of tools. Package development will come later.
 

trinker

ggplot2orBust
BG said:
Package development will come later.
I would strongly recomend against this. I'd include it in a package on GitHub and add roxygen2 documentation as you go. This saves time recalling documentation and stuff later, is less boring in small chinks, allows you to test as you go and find breaking points, allows you to have others see your package and test it, provides a means of storing the package in a safe place.
 

bryangoodrich

Probably A Mammal
I always document as I go, but the point is that I CAN'T document something that doesn't exist (statically). I can document the generating function (level_measure), but not its offspring that I generate given a list of different levels (time cut breaks) and measures (vector functions of one output). Nobody else is going to be working on this, though. I'm the only R programmer here. It's all closed since it's work specific. I was thinking of using a private bit bucket for versioning, though.
 

Lazar

Phineas Packard
A while back Dason had code on his blog on how to get the point estimate and uncertainty around a turning point in polynomial regression. I stole the idea for latent growth curve models. The Lavaan code is:
Code:
library(lavaan)
m_quadratic <- '
              i =~ 1*W1+1*+1*W3+1*W4+1*W5
              s =~ 0*W1+1*W2+2*W3+3*W4+4*W5
              q =~ 0*W1+1*W2+4*W3+9*W4+16*W5
              s ~ b*1
              q ~ c*1 
              tp := -1*(b)/(2 * (c))
            '
fitQ <- growth(m_quadratic, data=careerPros, missing="fiml")
i= intercept
s=linear slope
q=quadratic slope
tp will give turning point and standard errors for confidence intervals.
 

Dason

Ambassador to the humans
I've been meaning to make a sequel post to that that does the same thing except instead of using the delta method you use nonlinear regression.
 
R Tips and Tricks and Graphics

Here are some useful tips and tricks in R. Please share the tips and tricks you know, in this thread:

Ctrl + L - clears the screen
Home and Ctrl+k - go to the beginning of the line of code and delete the line
ls ("package:base") - to find out the list of functions in base R
rm (list=setdiff (ls(), "x")) - remove all objects except "x"
read.csv ("data.csv", nrow=10) - import only the first 10 rows of data
read.csv ("data.csv", skip=100) - skip the first 100 rows of data and import the rest


To change your working directory to a specific folder permanently, right-click "Notepad++" and choose "Run as administrator" and add "setwd ("your working directory folder")" in line 18 of "Rprofile.site" file in your R directory and save.

Attached is a cool 3D interactive graphic from the codes below. I just love the "rgl" package:

install.packages ("rgl"); library (rgl)
x <- seq (-3,3,length=50)
y <- x
f <- function (x,y) {x*exp(-x^2-y^2)}
z <- outer (x,y,f)
plot3d (x,y,z, box=F, axes=F, xlab="",ylab="",zlab="")
surface3d (x,y,z, col=heat.colors (30))
I admit I learned these from the "R Fundamentals & Graphics" ebook. If you know of any other R graphics ebooks (I don't like paper), please share.
 
Last edited:

Dason

Ambassador to the humans
Re: R Tips and Tricks and Graphics

To change your working directory to a specific folder permanently, right-click "Notepad++" and choose "Run as administrator" and add "setwd ("your working directory folder")" in line 18 of "Rprofile.site" file in your R directory and save.
There are better ways to do this that don't require administrator privileges. Just modify the .Rprofile in your home directory. If you don't have a .Rprofile file go ahead and create one. This will only modify things for you as a user. If you modify Rprofile.site you're making changes for *everybody* running on that computer.
 
Re: R Tips and Tricks and Graphics

Using loops takes much more processing time than "apply" family functions, especially if the data set is big. For example, we can use the function "proc.time()" to compare the processing times in assigning bonuses for 20,000 employees by using "for..loop" versus "sapply":

emp <- data.frame (num=1:20000, service=round (runif (20000,1,30)))

{ptm <- proc.time ()
for (i in 1:20000) {
if (emp$service [ i ] < 10) {
emp$bonus [ i ]=1000
} else {
emp$bonus [ i ]=2000
}
}
proc.time() - ptm }

{ ptm <- proc.time ()
emp$bonus <- sapply (emp$service, function(x) {if (x<10){x=1000} else {x=2000}})
proc.time () - ptm }
 

trinker

ggplot2orBust
Re: R Tips and Tricks and Graphics

Win said:
Using loops takes much more processing time than "apply" family functions, especially if the data set is big. For example, we can use the function "proc.time()" to compare the processing times in assigning bonuses for 20,000 employees by using "for..loop" versus "sapply"
This is not always true. loops have been optimized for sometime now. Generally, using benchmarking will tell you what approach is speedier.
 

Dason

Ambassador to the humans
Re: R Tips and Tricks and Graphics

This is not always true. loops have been optimized for sometime now. Generally, using benchmarking will tell you what approach is speedier.
Code:
library(microbenchmark)
library(ggplot2)

n <- 2000
emp <- data.frame(num = seq_len(n), service = round(runif(n, 1, 30)))
# preallocate to get rid of any initial assignment cost 
emp$bonus <- 0

for_loop_naive <- function(){
    for(i in 1:nrow(emp)){
        if(emp$service[i] < 10){
            emp$bonus[i] <- 1000
        }else{
            emp$bonus[i] <- 2000
        }
    }
}

using_cut <- function(){
    emp$bonus <- c(1000, 2000)[cut(j, c(0,10,Inf),right=FALSE)]
}

for_loop_pre <- function(){
    x <- numeric(n)
    for(i in seq_along(x))
    if(emp$service[i] < 10){
        x[i] <- 1000
    }else{
        x[i] <- 2000
    }
    emp$bonus <- x
}

using_ifelse <- function(){
    emp$bonus <- ifelse(emp$service < 10, 1000, 2000)
}

direct_assign <- function(){
    emp$bonus <- 2000
    emp$bonus[emp$service < 10] <- 1000
}

direct_mult <- function(){
    emp$bonus <- 1000*(1+(emp$service >=10))
}

direct_subset <- function(){
    emp$bonus <- c(1000, 2000)[1 + (emp$service >= 10)]
}

using_sapply_if <- function(){
    emp$bonus <- sapply(emp$service, function(x){if(x < 10){x=1000}else{x=2000}})
}

using_sapply_if <- function(){
    emp$bonus <- sapply(emp$service, function(x){if(x < 10){1000}else{2000}})
}

using_sapply_direct <- function(){
    emp$bonus <- sapply(emp$service, function(x){1000*(1 +(x>=10))})
}


res <- microbenchmark(for_loop_naive(), 
               for_loop_pre(), 
               direct_assign(), 
               direct_mult(),
               direct_subset(),
               using_ifelse(), 
               using_cut(),
               using_sapply_if(),
               using_sapply_direct())

qplot(y=time, data = res, colour = expr) + scale_y_log10()
I get the following:
Code:
> res
Unit: microseconds
                  expr        min          lq     median          uq        max neval
      for_loop_naive() 193834.794 199462.1860 204207.658 222927.8815 415161.006   100
        for_loop_pre()  64657.321  68943.3060  70412.277  75940.6695 205239.318   100
       direct_assign()    320.222    342.6415    360.416    384.1620    789.416   100
         direct_mult()     96.869    106.3680    114.609    135.0035   5492.598   100
       direct_subset()    164.546    179.1430    192.727    221.8505   9447.990   100
        using_ifelse()   1850.933   1919.9375   1964.601   2026.6550   8430.542   100
           using_cut()   1437.124   1653.1780   1691.242   1803.4065   7208.529   100
     using_sapply_if()   4555.747   4710.2360   4902.788   5983.7925  13858.956   100
 using_sapply_direct()   6435.037   6838.5450   7164.041   8026.3005  33321.908   100
Direct assignment is much better in this case - but being tricky and only doing one assignment instead of two is the best way. The cut implementation would probably be faster if I used findInterval - I included it because it's fairly easy to generalize. One thing to keep in mind is that accessing a data.frame is a *slow* operation. So if you're building a loop and you have to access a single element (emp$service) in every iteration then you can easily get improvements by trying to work around that. You also always want to avoid building up your result vector as you go - preallocating can help if you can't avoid a loop. Clearly though for something simple like this you really should just be doing things directly.

Vectorization is king in R - if you can vectorize something instead of using a loop (or apply family function) then that will most likely net you the best results.

I'm sure if we used data.table we would see some decent results too - but it's hard to beat the vectorized approach for something simple like this.
 
Last edited: