Today I Learned: ____

Jake

Cookie Scientist
Well, you didn't specify sim="permutation", so it's not even pretending to do a permutation test there--but, you did get me thinking, boot doesn't even know how to sample without replacement if we haven't specified the strata (since the number of observations in each strata would determine the number of possible permutations), but we never give it this information! So it must not be sampling without replacement after all. Looking through ?boot, I don't see that there is a way to give it this information either. (There is a strata argument, but this just makes it permutate within each stratum, which is not what we want.)

This is a little disappointing because it would not be very hard to code in the ability to sample permutations without replacement, assuming the user provided the necessary information.
 

Dason

Ambassador to the humans
I don't see the problem though. Unless you're doing something with very small sample sizes then there are going to be more permutations than you could possibly fit easily. If you happen to get a few permutations more than once it doesn't really matter that much. You're still approximating the null distribution asymptotically.
 

Jake

Cookie Scientist
You're right, it basically doesn't matter at all in practice. But still, it would feel nice knowing that it could provide the exact p-values if only in principle :p
 

trinker

ggplot2orBust
I have really been interested in learning how to appropriately use the reshape function from base rather than relying on the reshape packages (Wickham). The base function is very flexible and may require less code than the reshape package. I also replicated the same thing using rbind, cbind, do.call, and some fancy indexing.

I made myself a tutorial and thought I'd share. I was going to do this as a statspedia article but you can't just upload a pdf. I've got a lot going on right now so maybe another day.

Hopefully others find this useful as well. View attachment 2235
 

trinker

ggplot2orBust
TIL: BEWARE OF THE facet_grid FUNCTION IN GGPLOT2
There is a bug in ggplot two involving facet grid and duplicate rows (SEE LINK)

That's why I created the IDer function (LINK) to add an ID row that eliminates the duplicate row problem (this problem is especially apparent in stacked data). I was having to add an id row repeatedly (I had IDs but when they're stacked they're no longer unique) so I made that function to save time:

Code:
library(ggplot2)
set.seed(10)
CO3 <- data.frame(CO2[, 2:3], outcome=factor(sample(c('none', 'some', 'lots', 'tons'), 
           nrow(CO2), rep=T), levels=c('none', 'some', 'lots', 'tons')))

    ggplot(CO3, aes(x=outcome)) + geom_bar(aes(x=outcome)) +
    facet_grid(Treatment~., margins=TRUE)

CO3 <- IDer(CO3)

    ggplot(CO3, aes(x=outcome)) + geom_bar(aes(x=outcome)) +
    facet_grid(Treatment~., margins=TRUE)
THE IDer function

Code:
IDer <- function(dataframe, id.name="id"){
    DF <- data.frame(c=1:nrow(dataframe), dataframe)
    colnames(DF)[1] <- id.name
    return(DF)
}
 

trinker

ggplot2orBust
YIL: You can add margins to just one end of a facet_grid plot (Thanks to Brian Diggins of SO). Most people (including me until a few days ago) think that facet_grid's margin argument takes only a logical TRUE or FALSE but you can tell it a variable name from the facet_grid formula and it will add a margin to only that variable. this is not documented but pretty nice if you're doing something like faceting a pre and post condition and don't want to sum on the time variable but do on the other variable.

Example:
Code:
library(ggplot2)
set.seed(10)
CO3 <- data.frame(CO2[, 2:3], time=rep(c('pre', 'post'), each=42),
           outcome=factor(sample(c('none', 'some', 'lots', 'tons'), 
           nrow(CO2), rep=T), levels=c('none', 'some', 'lots', 'tons')))

CO3 <- data.frame(CO3, id=1:nrow(CO3))

[COLOR="#708090"]#The way everyone thinks you have to do the margins[/COLOR]
ggplot(CO3, aes(x=outcome)) + geom_bar(aes(x=outcome))+ 
     facet_grid(Treatment~time, margins=TRUE)

ggplot(CO3, aes(x=outcome)) + geom_bar(aes(x=outcome))+ 
     facet_grid(Treatment~time, margins='Treatment') 

ggplot(CO3, aes(x=outcome)) + geom_bar(aes(x=outcome))+ 
     facet_grid(Treatment~time, margins='time')
 

trinker

ggplot2orBust
It's been way too long since this thread has got some love. I've learned lots but been pretty busy so...

TIL: How to use progress bars inside longer functions using lapply. Here's an en example to play around with:

Code:
total = nrow(mtcars)
progress.bar = TRUE
type = FALSE
#progress.bar = FALSE  #parameter to play with
#type = 'text'         #parameter to play with

if(progress.bar) {
    if (Sys.info()[['sysname']]=="Windows" & type != "text"){
        # create progress bar
        pb <- winProgressBar(title = "progress bar", min = 0,
            max = total, width = 300)
        lapply(1:total, function(i) {
                Sys.sleep(.5)
                setWinProgressBar(pb, i, 
                    title=paste(round(i/total*100, 0), "% done"))
            }
        )
        close(pb)
    } else {
        # create progress bar
        pb <- txtProgressBar(min = 0, max = total, style = 3)
        lapply(1:total, function(i) {
                Sys.sleep(.5)
                setTxtProgressBar(pb, i)
            }
        )
        close(pb)
    }
} else {
    Sys.sleep(total/4)
    return("should have used a progress bar")
}
 

trinker

ggplot2orBust
Yeah I used it in a function for a package in which I'm lapplying over a vector of text strings determining parts of speech. This is a long process and I used exactly this code in the function but without system sleep. I think it's pretty nice on functions that take a long time to run (makes you feel like things aren't frozen).

As far as using it in your work flow I'd say anytime you have a longer process and you know the number of iterations to run it could be appropriate. When I get into bigger simulations I may use there, but this was the first longer process I have felt the progress bar would enhance things. Other packages have progress bars related functions as well (I think plyr or reshape does) but I wanted to keep it in base.
 

trinker

ggplot2orBust
As I was working through using progress bars the other day a professor who runs long simulations was asking me about them. He asked if they could be used if you passed a vector of values rather than the sequence along the vector. I told him I thought so but wasn't sure how. here's the solution with both a standard text progress bar and the windows version:

Code:
w <- c("raptors are awesome don't you all agree")
y <- unlist(strsplit(w, " "))
n <- length(y)

[COLOR="#696969"]#STANDARD TEXT VERSION[/COLOR]
pb <- txtProgressBar(min = 0, max = n, style = 3) 
lapply(y, function(x){
        z <- nchar(x); Sys.sleep(.5)
        i <- which(y %in% x)
        setTxtProgressBar(pb, i)
        return(z)
    }
)
close(pb)


[COLOR="#696969"]#WINDOWS VERSION[/COLOR]
pb <- winProgressBar(title = "progress bar", min = 0, max = n, width = 300) 
sapply(y, function(x){
        z <- nchar(x); Sys.sleep(.5)
        i <- which(y %in% x)
        setWinProgressBar(pb, i, title=paste(round(i/n*100, 0), "% done"))
        return(z)    
    }
)
close(pb)
 

Dason

Ambassador to the humans
Instead of an expensive "which" every iteration you could just (*gasp*) use a global assignment to a variable which tells you how far along you are.

Code:
w <- c("raptors are smelly and stupid and should be destroyed don't you all agree")
y <- unlist(strsplit(w, " "))
n <- length(y)

#STANDARD TEXT VERSION
pb <- txtProgressBar(min = 0, max = n, style = 3) 
i <- 0
sapply(y, function(x){
        i <<- i + 1
        z <- nchar(x)
        Sys.sleep(.5)
        setTxtProgressBar(pb, i)
        return(z)
    }
)
close(pb)
 

trinker

ggplot2orBust
TIL a bit more about as.formula but am still learning (I may start including formula arguments in some of my custom functions when I get more comfortable with them).

Here's some of what I was playing with [compliments of (LINK)]:
Code:
form1 <- as.formula("hi ~ lo + mid")
form2 <- as.formula("blue ~ red + green")
form2[[3]] <- form1[[3]]
form2

form2[[1]]
form2[[2]]
form2[[3]]
str(form2[[1]])
str(form2[[2]])
str(form2[[3]])
Anyone know of a good function to tear apart the code for to learn more about formulas? (thinking reshape2 or aggregate maybe lm) Right now I'm not sure if formula is turned into a character vector and then used to index the data frame you pass to the function or if they have some cooler property than that. I'm not sure how functions recognize if something is a formula because when you enter it in as an argument (into aggregate for instance) it's not like you've used as.formula to specify I'm feeding this function a formula, so how does aggregate know I've passed it a formula? Is it because I've only passed it argument x instead of x and y? Anyway I'll share if I learn more.

PS I finished paper one of two (rough draft) 3 weeks early) and am working on an easy 8 pager now so I decided to reward myself with fooling around with R a bit.
 

Jake

Cookie Scientist
A couple quick notes:

You don't need to write formulas as character vectors and use as.formula, you can just write them directly like
Code:
form <- y ~ a*b
To start to get an idea about how formula objects are structured, try calling terms() on some different formulas:
Code:
> terms(y ~ a + b)
y ~ a + b
attr(,"variables")
list(y, a, b)
attr(,"factors")
  a b
y 0 0
a 1 0
b 0 1
attr(,"term.labels")
[1] "a" "b"
attr(,"order")
[1] 1 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>
> terms(y ~ a * b)
y ~ a * b
attr(,"variables")
list(y, a, b)
attr(,"factors")
  a b a:b
y 0 0   0
a 1 0   1
b 0 1   1
attr(,"term.labels")
[1] "a"   "b"   "a:b"
attr(,"order")
[1] 1 1 2
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>
> terms(y ~ (a + b + c)^2 - b:c)
y ~ (a + b + c)^2 - b:c
attr(,"variables")
list(y, a, b, c)
attr(,"factors")
  a b c a:b a:c
y 0 0 0   0   0
a 1 0 0   1   1
b 0 1 0   1   0
c 0 0 1   0   1
attr(,"term.labels")
[1] "a"   "b"   "c"   "a:b" "a:c"
attr(,"order")
[1] 1 1 1 2 2
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>
Notice in particular the "factors" attribute which contains some pretty useful information.

Finally, getS3method("aggregate", "formula") shows what aggregate does with formulas. From eyeballing it, it looks like it basically figures out how to build an appropriate data frame from the formula information and then just calls aggregate.data.frame on that.
 
Re: Today I Learned: ____ R guessing variables names

I noticed something worrying yesterday.

I have in my workspace a dataframe with variables:
dataset$mean_of_corr
dataset$Mean

I noticed that calling:
dataset$mean resulted in R displaying the dataset$mean_of_cor variable.

To me this was a big suprise since I expected that if the name does not match perfectly R would not display any result.

Example
mean_of_cor <- rnorm(100)
Mean <- rep(1,100)
dataset <- data.frame(mean_of_cor,Mean)

dataset$mean_of_cor == dataset$mean
dataset$mean_of_cor == dataset$Mean
 

Lazar

Phineas Packard
This is covered pretty well in The Art of Programming in R

You can use a shortened call until there is a conflict. For example if you have a dataframe as follows:
Code:
test<- data.frame (gooble_gooble = 1:10)
you can use test$goo to call gooble_gooble. But should you introduce a conflict, such as:
Code:
test$gooble_goop<- 21:30
test$goo will return NULL

Yours had no conflict because one variable had a capital M and the other had a lower case m.
 

trinker

ggplot2orBust
I asked a few days back in the chatbox about the use of L for integers in R as in 2L and recieved a response from Dasona dn bryangoodrich. Recently someone on the R-help email list asked a similar question. Thought I'd post Brian Ripley's answer as it gives a little bit of history as to where the L came from. His response is to Bill Dunlap (I think):

Prof Brian Ripley said:
On 04/05/2012 00:43, William Dunlap wrote:
> > class(10)
> [1] "numeric"
> > class(10L)
> [1] "integer"
> > class(10i)
> [1] "complex"
>
> Why not 10I for integer? Perhaps because "I" and "l"
> look too similar, perhaps because "i" and "I" sound
> too similar. The "L" does not mean "long": integers
> are 4 bytes long.

Actually it does: this notation dates from the C language on 16-bit
computers where integers were 16-bits and longs were 32-bit (and R has
no 'long' type).

The author of this in R never explained why he chose the notation, but
it is shorter than as.integer(10), and more efficient as the coercion is
done at parse time.

(C has a different convention: 10 is integer, 10. is double. Changing
to that would have altered existing code, since e.g. 4294967296 is not a
valid integer.)

>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
 

Dason

Ambassador to the humans
TIL: simulate.lm exists. I probably should have known about it sooner since a while back I wrote a function that does essentially the exact same thing. In my defense I wrote it to simulate from glms as well so it goes one step beyond... But still - How did I miss this?

Code:
x <- 1:10
y <- 2 - .7*x + rnorm(10)

o <- lm(y~x)

simdata <- simulate(o)

plot(x, y)
points(x, simdata[,1], col = "red")