# Share your functions & code

#### bryangoodrich

##### Probably A Mammal
Well using the example from the other thread,

Closure definitions

Code:
match_between_gen <- function(value) {
function(x, lookback, lookahead, pattern = ".*") {
regex <- gsub("__LOOKBACK__", lookback, regex)
regex <- gsub("__PATTERN__", pattern, regex)
regmatch <- gregexpr(regex, x, perl = TRUE)
if (!value) regmatch else regmatches(x, regmatch)
}
}

regex_values  <- match_between_gen(TRUE)
regex_matches <- match_between_gen(FALSE)
Typical definition

Code:
match_between <- function(x, lookback, lookahead, pattern = ".*", value = TRUE) {
regex <- gsub("__LOOKBACK__", lookback, regex)
regex <- gsub("__PATTERN__", pattern, regex)
regmatch <- gregexpr(regex, x, perl = TRUE)
if (!value) regmatch else regmatches(x, regmatch)
}
Then the usage

Code:
html <- c("<blockquote>", "<A NAME=1.1.2>And I in going, madam, weep o'er my father's death[/COLOR]</A><br>",
"<A NAME=1.1.3>anew: but I must attend his majesty's command, to</A><br>",
"<A NAME=1.1.4>whom I am now in ward, evermore in subjection.</A><br>",
"</blockquote>")

open_tag <- "<A NAME=[0-9]\\.[0-9]\\.[0-9]>"
close_tag <- "</A>"

regex_values(html, lookback = open_tag, lookahead = close_tag)
[COLOR="royalblue"]# [[1]]
# character(0)
#
# [[2]]
# [1] "And I in going, madam, weep o'er my father's death[noparse][/COLOR][/noparse]"
#
# [[3]]
# [1] "anew: but I must attend his majesty's command, to"
#
# [[4]]
# [1] "whom I am now in ward, evermore in subjection."
#
# [[5]]
# character(0)[/COLOR]

regex_matches(html, open_tag, close_tag)
[COLOR="royalblue"]# [[1]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"useBytes")
# [1] TRUE
#
# [[2]]
# [1] 15
# attr(,"match.length")
# [1] 58
# attr(,"useBytes")
# [1] TRUE
#
# [[3]]
# [1] 15
# attr(,"match.length")
# [1] 49
# attr(,"useBytes")
# [1] TRUE
#
# [[4]]
# [1] 15
# attr(,"match.length")
# [1] 46
# attr(,"useBytes")
# [1] TRUE
#
# [[5]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"useBytes")
# [1] TRUE[/COLOR]

match_between(html, open_tag, close_tag)  [COLOR="royalblue"]# values[/COLOR]
[COLOR="royalblue"]# regex_values(...) results[/COLOR]

match_between(html, open_tag, close_tag, value = FALSE) [COLOR="royalblue"]# matches[/COLOR]
[COLOR="royalblue"]# regex_matches(...) results[/COLOR]
Now one might say "you can just make wrappers of match_between where the value parameter is set appropriately." That is true, but it requires you to duplicate code manually, potentially introduce errors, and is harder to maintain throughout developmental changes. Functional programming, as I've experienced it, seems to give better management both over the code and semantics (if you design them well), but also over the state of variables, which to me seems to be the coolest benefit of closures: controlling a variable within a function environment.

One other thing I like is that with the closure definition, the formal arguments for regex_matches and regex_values are the same, which is something I think should be expected as they do the same thing. What changes is the return object. The complexity is increased by having more functions to remember, but that is also balanced by being able to better management the parameter space and the semantics of those functions to be remembered.

#### bryangoodrich

##### Probably A Mammal
Well I used the stuff I learned in Hadley's functional programming to take two lists, one of measures (sum, mean, load factor, ...) and one of time aggregations (year, quarter, week, ...) and generate all permutations of them into Level_Measure functions. Thus, a person can read.meter(...meter number...) to read in a meter number. Then they can Quarter_LoadFactor and generate a table (data.table) of that measure by that level of aggregation. This can be directly used in ggplot quite easily. I'm thinking of building classes for certain plots so maybe I can do something like "Trend + Week_Mean(x)" and "Profile + Work_Schedule(x)" to do the appropriate ggplot on the appropriate table transformation. That may be a much later thing, however. I have no idea how much work that will take until I start digging more into the ggplot code to see how Hadley did what he did. It would be a ton easier for an end-user than having to Trend(Week_LoadFactor(x)) or something.

Now that I think about it, this may be easier than I thought. "Trend" here would simply be a ggplot object containing all the stuff I need it to contain, and what I'm adding is the data. What I'm wrapping just needs to validate that the data is of the right form, hence maybe my own custom class.

Anyway, read Hadley's stuff and work with it. I'm using this everywhere these days, putting together things much faster than I used to and able to appear a lot cleaner (semantically).

#### bryangoodrich

##### Probably A Mammal
Yeah, that was easier than I thought. Turns out ggplot has an infix operator for replacing the data frame. So,

Code:
Trend <- ggplot(data.table(), aes(...)) + geom_line()
Trend %+% Day_Mean(x)
... the possibilities are endless! Or about 20ish
It's not exactly as I would like, but it would be a bit of work to make my own + supersede all this, but then I'd create my own class returned by these aggregate measure functions so it would be like "aggregate measure plus trend [plot]" where Trend is to be a ggplot object predefined. Then the end-user could Day_Max(x) + Trend or Year_Mean(x) + Trend. It would be easy to use and semantically clear.

I was surprised how cool this works, though.

#### bryangoodrich

##### Probably A Mammal
Oh! It's so pretty!

Code:
level_measure <- function(Level, Measure) {
force(Level)
Measure <- match.fun(Measure)
function(X) {
x <- X[, list(kW = Measure(kW)),
by = list(meter, "Agg" = cut(timestamp, breaks=Level))]
class(x) <- c("foobar", class(x))
x
}
}

args <- expand.grid(
Level   = list("Year" = "year", "Quarter" = "quarter",
"Month" = "month", "Week" = "week", "Day" = "day"),
Measure = list("Sum" = sum, "Mean" = mean, "Max" = max,
"LoadFactor" = function(x) mean(x) / max(x)),
stringsAsFactors = FALSE)
funs <- lapply(1:nrow(args), function(n) {
force(n)
level_measure(args$Level[[n]], args$Measure[[n]])
})
names(funs) <- with(args, paste(names(Level), names(Measure), sep="_"))
list2env(funs, environment())
rm(args, funs)
So this generates a bunch of functions for different aggregation levels like I've talked about. It also returns an object of class "foobar." Now ...

Code:
+.foobar <- function(x,p) p %+% x
Trend <- ggplot(data.frame(), aes(as.numeric(Agg), kW, col=meter)) + geom_line() + theme_bw()
class(Trend) <- c("foobar", class(Trend))
That's it, now I can do any of those combinations

Code:
x <- read_meter(...)
Week_Mean(x) + Trend
Here my data model is a 3 column data.table object with (meter, timestamp, kW) tuple, with meter an identifier, timestamp a 15 minute reading, and kW the value read at that timestamp.

Truly was easy to do. Requires a bit of work to make "good" in a development sense (like validation of inputs, among other things), but as a proof of concept, I got it to work as expected.

Now I just need to figure how to build a package off stuff like this. For instance, I can package.skeleton this stuff to generate R code files, but they don't contain the environment they enclose (so it just hard codes Measure and Level, not their values for each function). I could wing it and just take those files, compile them into one file and do documentation for them as a group while having the package generate them when it loads or something. But that seems shoddy.

#### bryangoodrich

##### Probably A Mammal
Since I'm sharing code, here's another one going into my bmisc package

Code:
StringSplit <- function(s, c, i) vapply(strsplit(s, c), "[", "", i)
I hate getting a list back from my strings when I just want a certain part of that. So, this gives me what I want.

Code:
x <- c("2013-03-15 14:30:00", "2013-07-29 09:00:00", "2013-10-19 19:30:00",
"2013-08-16 08:15:00", "2013-06-11 04:45:00", "2013-12-31 06:45:00",
"2013-10-06 16:15:00", "2013-01-13 03:00:00", "2013-09-07 09:15:00",
"2013-03-27 09:15:00")
StringSplit <- function(s, c, i) vapply(strsplit(s, c), "[", "", i)

StringSplit(x, ' ', 1)
#  [1] "2013-03-15" "2013-07-29" "2013-10-19" "2013-08-16" "2013-06-11"
#  [6] "2013-12-31" "2013-10-06" "2013-01-13" "2013-09-07" "2013-03-27"

StringSplit(x, ' ', 2)
#  [1] "14:30:00" "09:00:00" "19:30:00" "08:15:00" "04:45:00" "06:45:00"
#  [7] "16:15:00" "03:00:00" "09:15:00" "09:15:00"
It was useful in my case for parsing timestamp strings, but clearly it is a general wrapper for strsplit that should (hopefully) be quick at grabbing the vector index you want from that split. There's reasons strsplit returns a list, so maybe some validation or something should be included to handle the subset of data this method is intended to work against, but it wasn't meant to be entirely sophisticated as-is.

#### trinker

##### ggplot2orBust
This is a bit above my pay grade but I know you can export environments as data sets (not sure if this helps). Also I have bee working on a package skeleton function (it will never be released to CRAN and is for my own personal use) that may be of use to you. It generates its own roxygen2 templated .R files. A bit fancier than package.skeleton.

I plan to add methods detection as well.

https://github.com/trinker/acc.roxygen2

Specifically the ?package_template function is useful for building templates.

#### bryangoodrich

##### Probably A Mammal
Yeah, but the data sets would have to be respectively linked to the function environments to which they belong (are used in). Unless there's a built-in way to handle that, I'm just not sure. Maybe I'll shoot Hadley a message about this. Maybe his code will give some insight into how he handled it. Something to study later. Since I have a workaround from needing a package to get done what I need to get done, I'll focus on continuing to build this suite of tools. Package development will come later.

#### trinker

##### ggplot2orBust
BG said:
Package development will come later.
I would strongly recomend against this. I'd include it in a package on GitHub and add roxygen2 documentation as you go. This saves time recalling documentation and stuff later, is less boring in small chinks, allows you to test as you go and find breaking points, allows you to have others see your package and test it, provides a means of storing the package in a safe place.

#### bryangoodrich

##### Probably A Mammal
I always document as I go, but the point is that I CAN'T document something that doesn't exist (statically). I can document the generating function (level_measure), but not its offspring that I generate given a list of different levels (time cut breaks) and measures (vector functions of one output). Nobody else is going to be working on this, though. I'm the only R programmer here. It's all closed since it's work specific. I was thinking of using a private bit bucket for versioning, though.

#### trinker

##### ggplot2orBust
@bg as a student you can get 5 free private repos

#### Lazar

##### Phineas Packard
A while back Dason had code on his blog on how to get the point estimate and uncertainty around a turning point in polynomial regression. I stole the idea for latent growth curve models. The Lavaan code is:
Code:
library(lavaan)
i =~ 1*W1+1*+1*W3+1*W4+1*W5
s =~ 0*W1+1*W2+2*W3+3*W4+4*W5
q =~ 0*W1+1*W2+4*W3+9*W4+16*W5
s ~ b*1
q ~ c*1
tp := -1*(b)/(2 * (c))
'
fitQ <- growth(m_quadratic, data=careerPros, missing="fiml")
i= intercept
s=linear slope
tp will give turning point and standard errors for confidence intervals.

#### Dason

I've been meaning to make a sequel post to that that does the same thing except instead of using the delta method you use nonlinear regression.

#### Win

##### New Member
R Tips and Tricks and Graphics

Here are some useful tips and tricks in R. Please share the tips and tricks you know, in this thread:

Ctrl + L - clears the screen
Home and Ctrl+k - go to the beginning of the line of code and delete the line
ls ("package:base") - to find out the list of functions in base R
rm (list=setdiff (ls(), "x")) - remove all objects except "x"
read.csv ("data.csv", nrow=10) - import only the first 10 rows of data
read.csv ("data.csv", skip=100) - skip the first 100 rows of data and import the rest

To change your working directory to a specific folder permanently, right-click "Notepad++" and choose "Run as administrator" and add "setwd ("your working directory folder")" in line 18 of "Rprofile.site" file in your R directory and save.

Attached is a cool 3D interactive graphic from the codes below. I just love the "rgl" package:

install.packages ("rgl"); library (rgl)
x <- seq (-3,3,length=50)
y <- x
f <- function (x,y) {x*exp(-x^2-y^2)}
z <- outer (x,y,f)
plot3d (x,y,z, box=F, axes=F, xlab="",ylab="",zlab="")
surface3d (x,y,z, col=heat.colors (30))
I admit I learned these from the "R Fundamentals & Graphics" ebook. If you know of any other R graphics ebooks (I don't like paper), please share.

Last edited:

#### Dason

Re: R Tips and Tricks and Graphics

To change your working directory to a specific folder permanently, right-click "Notepad++" and choose "Run as administrator" and add "setwd ("your working directory folder")" in line 18 of "Rprofile.site" file in your R directory and save.
There are better ways to do this that don't require administrator privileges. Just modify the .Rprofile in your home directory. If you don't have a .Rprofile file go ahead and create one. This will only modify things for you as a user. If you modify Rprofile.site you're making changes for *everybody* running on that computer.

#### Win

##### New Member
Re: R Tips and Tricks and Graphics

Using loops takes much more processing time than "apply" family functions, especially if the data set is big. For example, we can use the function "proc.time()" to compare the processing times in assigning bonuses for 20,000 employees by using "for..loop" versus "sapply":

emp <- data.frame (num=1:20000, service=round (runif (20000,1,30)))

{ptm <- proc.time ()
for (i in 1:20000) {
if (emp$service [ i ] < 10) { emp$bonus [ i ]=1000
} else {
emp$bonus [ i ]=2000 } } proc.time() - ptm } { ptm <- proc.time () emp$bonus <- sapply (emp$service, function(x) {if (x<10){x=1000} else {x=2000}}) proc.time () - ptm } #### trinker ##### ggplot2orBust Re: R Tips and Tricks and Graphics Win said: Using loops takes much more processing time than "apply" family functions, especially if the data set is big. For example, we can use the function "proc.time()" to compare the processing times in assigning bonuses for 20,000 employees by using "for..loop" versus "sapply" This is not always true. loops have been optimized for sometime now. Generally, using benchmarking will tell you what approach is speedier. #### Dason ##### Ambassador to the humans Re: R Tips and Tricks and Graphics This is not always true. loops have been optimized for sometime now. Generally, using benchmarking will tell you what approach is speedier. Code: library(microbenchmark) library(ggplot2) n <- 2000 emp <- data.frame(num = seq_len(n), service = round(runif(n, 1, 30))) # preallocate to get rid of any initial assignment cost emp$bonus <- 0

for_loop_naive <- function(){
for(i in 1:nrow(emp)){
if(emp$service[i] < 10){ emp$bonus[i] <- 1000
}else{
emp$bonus[i] <- 2000 } } } using_cut <- function(){ emp$bonus <- c(1000, 2000)[cut(j, c(0,10,Inf),right=FALSE)]
}

for_loop_pre <- function(){
x <- numeric(n)
for(i in seq_along(x))
if(emp$service[i] < 10){ x[i] <- 1000 }else{ x[i] <- 2000 } emp$bonus <- x
}

using_ifelse <- function(){
emp$bonus <- ifelse(emp$service < 10, 1000, 2000)
}

direct_assign <- function(){
emp$bonus <- 2000 emp$bonus[emp$service < 10] <- 1000 } direct_mult <- function(){ emp$bonus <- 1000*(1+(emp$service >=10)) } direct_subset <- function(){ emp$bonus <- c(1000, 2000)[1 + (emp$service >= 10)] } using_sapply_if <- function(){ emp$bonus <- sapply(emp$service, function(x){if(x < 10){x=1000}else{x=2000}}) } using_sapply_if <- function(){ emp$bonus <- sapply(emp$service, function(x){if(x < 10){1000}else{2000}}) } using_sapply_direct <- function(){ emp$bonus <- sapply(emp$service, function(x){1000*(1 +(x>=10))}) } res <- microbenchmark(for_loop_naive(), for_loop_pre(), direct_assign(), direct_mult(), direct_subset(), using_ifelse(), using_cut(), using_sapply_if(), using_sapply_direct()) qplot(y=time, data = res, colour = expr) + scale_y_log10() I get the following: Code: > res Unit: microseconds expr min lq median uq max neval for_loop_naive() 193834.794 199462.1860 204207.658 222927.8815 415161.006 100 for_loop_pre() 64657.321 68943.3060 70412.277 75940.6695 205239.318 100 direct_assign() 320.222 342.6415 360.416 384.1620 789.416 100 direct_mult() 96.869 106.3680 114.609 135.0035 5492.598 100 direct_subset() 164.546 179.1430 192.727 221.8505 9447.990 100 using_ifelse() 1850.933 1919.9375 1964.601 2026.6550 8430.542 100 using_cut() 1437.124 1653.1780 1691.242 1803.4065 7208.529 100 using_sapply_if() 4555.747 4710.2360 4902.788 5983.7925 13858.956 100 using_sapply_direct() 6435.037 6838.5450 7164.041 8026.3005 33321.908 100 Direct assignment is much better in this case - but being tricky and only doing one assignment instead of two is the best way. The cut implementation would probably be faster if I used findInterval - I included it because it's fairly easy to generalize. One thing to keep in mind is that accessing a data.frame is a *slow* operation. So if you're building a loop and you have to access a single element (emp$service) in every iteration then you can easily get improvements by trying to work around that. You also always want to avoid building up your result vector as you go - preallocating can help if you can't avoid a loop. Clearly though for something simple like this you really should just be doing things directly.

Vectorization is king in R - if you can vectorize something instead of using a loop (or apply family function) then that will most likely net you the best results.

I'm sure if we used data.table we would see some decent results too - but it's hard to beat the vectorized approach for something simple like this.

Last edited: