Today I Learned: ____

bryangoodrich

Probably A Mammal
I did a quick benchmark on Jake's code using sapply and vapply(strsplit(x, split = ","), "[", "", 2) with 100,000 replications. Ended up with sapply being 1.637 relative to the vapply base. As least for simple vectors (character, numeric, etc.), it's pretty easy to specify the return object. I haven't dealt with matrices and frames yet, myself. Thought I'd throw this out there for consideration. If you want optimal speed and you're returning a simple data type (simple in R terms), you might want to consider using vapply, instead.

In this case, an unlist(lapply(...)) produces the same result with a relative 1.017, so at least the vapply approach is better semantically and in speed, if only slightly, to the unlist(lapply(...)) approach that I would appeal to for speed.
 

Jake

Cookie Scientist
but it doesn't quite fit perfectly with the example output trinker was giving.
How do you figure? Trinker didn't actually give any example code involving a list that we could "match," but this solution does exactly the task he described.
 

bryangoodrich

Probably A Mammal
I have to agree. I read Trinker as saying something to the effect of "I want to split a vector of strings by some delimiter using strsplit and grab every nth term." This is a basic task of, say, reading in lines of tabular data with a delimiter (say, a comma) and you want to grab the 2nd field value for each row. The algorithm Jake provided using vapply as I did would be the optimal approach (as far as I can see without appealing to alternative languages).

I kind of wonder how this would compare to the all-at-once approach of using read.csv and then specifying the 2nd column vector of the frame. How would it compare with large vs smaller data sets? When does one algorithm prove better than the other?
 

trinker

ggplot2orBust
Yeah I asked about indexing a vector after collapsing a list of equal length splitted vectors. Jake's solution avoids the unlisting all together. Nice solution.

@BG in my case these vectors of collapsed strings may not be preexisting (ie I am the one who created them for convenience sake) so the read csv may not work but I suspect you're talking about creating a file internally like you do with connections and read.bin etc.
 

Dason

Ambassador to the humans
And what I was saying was that the approach of grabbing 'every' whatever element applied to vectors directly. Jake's approach (which I like and I use so don't think I'm attacking it - just pointing out the differences) will give a slightly different output the way I see it. Because the output trinker is giving grabs something like every other element or every 3rd element in a vector. Jake's approach will grab the 2nd element in each of the vectors in a list - so if the vectors aren't the same length or you actually wanted all the even indexed elements in every vector then you don't end up with everything you want.

Code:
> lst <- list(1:20, 1:20)
> 
> sapply(lst, "[", 2)
[1] 2 2
> # different outputs...
> unlist(lst)[c(F,T)]
 [1]  2  4  6  8 10 12 14 16 18 20  2  4  6  8 10 12 14 16 18 20
> # But if we wanted to use *apply along with "["
> unlist(lapply(lst, "[", c(F,T)))
 [1]  2  4  6  8 10 12 14 16 18 20  2  4  6  8 10 12 14 16 18 20
> # It's still not exactly the same though since
> # we could have this...
> lst2 <- list(1:5, 1:4)
> unlist(lst2)[c(F,T)]
[1] 2 4 1 3
> unlist(lapply(lst2, "[", c(F,T)))
[1] 2 4 2 4
I only bring it up because there are times where I want to grab every 12th element out of a vector - and trinker's functions and example output allow for something like that. Jake's don't allow for that directly. Jake's stuff is very useful and has it's place but I was just pointing out that it's not a direct replacement or generalization of what trinker provided. But for the original idea of "I want the 2nd thing in each of the results from strsplit" - then yeah Jake's approach is the way to go.
 

TheEcologist

Global Moderator
TIL: Parallel processing R scripts for the lazy man (but only on a UNIX system).

Step 1) Start by making a R-script in your working directory, here is my silly example (which I assume is saved as foo.r)

Code:
a=0
for (i in 1:100){
Sys.sleep(0.1)
a=a+mean(rnorm(10000000))
}
#normally you would do something sensible here
# and write your end product to a directory on your disk
Now enter your favourite console emulator, and start your script only don't run it on the foreground but send it to the background using "&" at the end of the command

Code:
R CMD BATCH "/path/to/file/foo.r" &
Retype this to start another and another and another...You can know run as many instances of foo.r as you have nodes (but it usually is not smart to run more processes than you have processors).

Watch out that if you are doing any random number generation that you have original seeds in each thread!

Note that their are far more elegant ways to do this, but this is a real no-brainer.

Quick, parallel computation for the lazyman.
 

TheEcologist

Global Moderator
TIL that .Platform$dynlib.ext is incredibly useful to know when you want to call C from R regardless of platform .

use it here:

Code:
dyn.load(file.path(".",paste("MYFILENAME",.Platform$dynlib.ext,sep=""))
 

Dason

Ambassador to the humans
TIL that .Platform$dynlib.ext is incredibly useful to know when you want to call C from R regardless of platform .

use it here:

Code:
dyn.load(file.path(".",paste("MYFILENAME",.Platform$dynlib.ext,sep=""))
Indeed. It's even mentioned that this is the way you should do it in the Writing R Extensions pdf: http://cran.r-project.org/doc/manuals/R-exts.html#dyn_002eload-and-dyn_002eunload For some reason following the link makes me go to the bottom of the guide but if you reload the page afterward you get to the place you want - Section 5.3
 

Dason

Ambassador to the humans
Yeah floating point arithmetic is interesting. I think we discussed that at some point - but this is my favorite slash easiest way to convey that you need to be careful with decimals using programming languages
Code:
> sqrt(2)^2 == 2
[1] FALSE
 

TheEcologist

Global Moderator
TIL that cairo_pdf() will solve many problems when plotting to a pdf in linux!

make sure you look at the options (e.g. onepage=TRUE/FALSE).

List of things it solved for me;

  • Correct display of unicode characters
  • No errors in the standard plot character
  • Font styles always correct
  • No transparency issues (when using the alpha parameter in rbg)
  • colour intensity is exactly the same as the X11 window
 
Last edited:

Dason

Ambassador to the humans
TIL: If a package has a help file for the package itself aliased under something like "stats-package" or "methods-package" typically I would do something like
Code:
?"stats-package"
?"methods-package"
to view those help files. (Although more often than not I start with ?stats-package and end up at the help page for Arithmetic Operators because it thinks I'm asking for help with "-" based on how the interpreter parses things).

BUT! there is a nicer way to get these help files apparently
Code:
package?stats
package?methods
That's right - you can use package?packagename to view those help files (if they exist).
 

Dason

Ambassador to the humans
TIL: About the conflicts() function
Code:
Description:

     ‘conflicts’ reports on objects that exist with the same name in
     two or more places on the ‘search’ path, usually because an object
     in the user's workspace or a package is masking a system object of
     the same name.  This helps discover unintentional masking.
This can help identify functions that have been masked

Code:
> lm <- function(x){print("HAHA YOUR LM IS GONE")}
> conflicts()
[1] "lm"        "body<-"    "kronecker"
> lm
function(x){print("HAHA YOUR LM IS GONE")}
> rm(lm)
> conflicts()
[1] "body<-"    "kronecker"
 

Dason

Ambassador to the humans
Nice Dason is that newer?
Code:
[05:50:22][dasonk@Brutus:~/Rdevel/trunk/src/library/base/man](182): svn log conflicts.Rd 
------------------------------------------------------------------------
r61150 | ripley | 2012-11-25 08:27:30 -0600 (Sun, 25 Nov 2012) | 1 line

legibility
------------------------------------------------------------------------
r59039 | ripley | 2012-04-15 05:32:41 -0500 (Sun, 15 Apr 2012) | 1 line

use preferred form of 'R Core Team'
------------------------------------------------------------------------
r56186 | murdoch | 2011-06-19 20:51:45 -0500 (Sun, 19 Jun 2011) | 1 line

Revert r56184 and r56185
------------------------------------------------------------------------
r56184 | murdoch | 2011-06-19 18:58:46 -0500 (Sun, 19 Jun 2011) | 1 line

Remove redundant \alias entries from man pages
------------------------------------------------------------------------
r42333 | ripley | 2007-07-27 05:16:22 -0500 (Fri, 27 Jul 2007) | 3 lines

add copyright/licence header
remove CVS-style $Id fields

------------------------------------------------------------------------
r30915 | ripley | 2004-08-29 10:22:54 -0500 (Sun, 29 Aug 2004) | 3 lines

some corrections.
Mainly layour improvements in \usage and elsewhere.

------------------------------------------------------------------------
r24239 | ripley | 2003-05-08 16:45:56 -0500 (Thu, 08 May 2003) | 2 lines

branch update

------------------------------------------------------------------------
r12976 | pd | 2001-02-27 04:41:45 -0600 (Tue, 27 Feb 2001) | 2 lines

branch update

------------------------------------------------------------------------
r9149 | maechler | 2000-05-10 11:11:14 -0500 (Wed, 10 May 2000) | 2 lines

ex "fixed"

------------------------------------------------------------------------
r9111 | maechler | 2000-05-08 12:14:32 -0500 (Mon, 08 May 2000) | 2 lines

ex

------------------------------------------------------------------------
r2625 | maechler | 1998-10-23 07:57:14 -0500 (Fri, 23 Oct 1998) | 2 lines

new from BDR

------------------------------------------------------------------------
That 1998 at the end tells me that conflicts() has been around for quite some time.
 

TheEcologist

Global Moderator
I remember an old discussion with Dason in the chatbox, where we both didn't really know how to get a list of all the C functions that are available through R. It seemed like such a useful thing to have!

Luckily the R developers thought the same, find it in the "R-sourcecode"/src/main directory, file names.c

Looking for math functions specifically? Try /src/include/Rmath.h0.in

Note:
Ofcourse working in a unix environment will enable you to quickly find what ever pattern want with;
Code:
 find  path/to/R-sourcecode/ pattern | grep pattern
For future reference!

TE
 

trinker

ggplot2orBust
TIL what it means for R to return a function:

Code:
make.power <- function(n) {
    pow <- function(x) {
        x^n
    }
    pow
}

sqrd <- make.power(2)
sqrd(3)
cubd <- make.power(3)
cubd(3)
Not sure yet how I'd use this but cool none the less.