Today I Learned: ____

Dason

Ambassador to the humans
TIR (remembered): Bogosort! The best sort ever. You feed in a numeric vector that you want sorted and it checks if the vector is already sorted - if it isn't then it randomly shuffles the vector and checks if it is sorted. Rinse and repeat until your vector is sorted.

Code:
bogosort <- function(x){
    # x is assumed to be a numeric vector

    # While x is not sorted
    while(sum(diff(x) < 0) > 0){
        # Shuffle the values of x
        x <- sample(x)
    }
    return(x)
}

bogosort(rnorm(4))
The expected number of times it needs to run the loop is on the order of n! so... don't use this if you want a fast sort...
 

bryangoodrich

Probably A Mammal
Could make for a fun statistics experiment (like a simple random walk experiment). Use different length vectors and the time it takes to sort them. Produce some sort of visualization of the results and describe the results of your sample and subsequent analysis!
 

jpkelley

TS Contributor
Today I learned about Independent Components Analysis (using fastICA in R) for multivariate analysis. I had been frustrated with PCA for non-normal data, and this was an ideal solution. My model residuals are sweet. I haven't fleshed out everything about ICA, but it looks promising.
 

trinker

ggplot2orBust
TIL: what <<- is for. I've seen it used and knew it had something to do with environments but bryangoodrich discussed it in the chat box and I tried using it. It acts on the parent environment:

Code:
cv <- function(){
mtcars <<- mtcars[, -c(1:5)]
}

cv()

mtcars
 

Dason

Ambassador to the humans
Let's see if you fully understand what's going. Take this little test - try to predict the output for each of these:

Code:
val <- 1

f <- function(){
  val <- 2

  g <- function(){
    val <<- 3
  }

  g()

  return(val)
}

## What should val contain?
val

## What should f return?
f()

## What should val contain now?
val

rm(val)
rm(f)
Now a slightly different version...
Code:
val <- 1

g <- function(){
  val <<- 3
}

f <- function(){
  val <- 2

  g()

  return(val)
}

## What should val contain?
val

## What should f return?
f()

## What should val contain now?
val
 

trinker

ggplot2orBust
The second one was easy. It took me a moment to get the first one. It's because the <<- is nested inside another function so the function is the parent environment and not the .Glob.env. GooD exercise for me Dason.
 

Dason

Ambassador to the humans
It's putting the labels directly on the plot instead of having a legend off to the side. I don't think it's the best way to do it in all circumstances (I think it could be better for the scatterplots) but I prefer this style of labeling to a legend. With a legend you need to take your eyes off the data to actually try to understand what the data is telling you. If the labels are where they logically should be (next to the proper data) you get to see the patterns and it doesn't take as long to tell what's going on. Plus if you're using color to differentiate the groups sometimes its hard to match up the color in the legend and whats on the plot but if you put the label next to the point along with the colors it is a lot easier to understand.
 

trinker

ggplot2orBust
bryan it gets rid of the legend and plots the groups/labels what ever directly on the graph. It may (I don't think this is appropriate in every situation) be easier to interpret your graph as you're eyes aren't going back and forth from legend to plot. It also does the formatting for you so you don't have labels on top of each other. And it seems to play well with all three graphics systems. I think it could save some time and has potential for some circumstances (check out the topo map (contour map) with the numbers right on the graphic. Much more interpretable and professional.
 

ledzep

Point Mass at Zero
I wonder why I didn't use the "cast" function ever before. I am familiar with the melt function but never got around to use "cast" at all.
I used to subset the data to 4 datasets (one for each day) and then bind 'em together to get to my desired output.
Now, what was "Donkey's work", Cast does it very nicely. Very thankful to Hadly Wickham for his time-savers (already a big user of plyr, now add reshape to it).

Code:
require(reshape);
# d.f
> test
   day person freq
1    0      A    1
2    1      A    2
3    2      A    1
4    3      A    2
5    0      B    1
6    1      B    3
7    2      B    1
8    3      B    6
9    0      C    3
10   1      C    6
11   2      C    3
12   3      C    2


> summary<-cast(test,person~day,)
Using freq as value column.  Use the value argument to cast to override this choice
> summary
  person 0 1 2 3
1      A 1 2 1 2
2      B 1 3 1 6
3      C 3 6 3 2
 

Lazar

Phineas Packard
This could be fun to have in your RProfile to play on startup:
Code:
Dave <- function(){
 #Download from  'http://www.palantir.net/2001/tma1/wav/hihal.wav'
  require(tuneR)
  play('hihal.wav')
}
 
Last edited:

trinker

ggplot2orBust
I'm tired of waiting around for Dason to stop fooling around and make us a decent alarm clock:

Code:
alarm.clock <- function(hours=0, minutes=0, seconds=0){
    x <- c(hours*360 + minutes*60 + seconds)
    Sys.sleep(x)
    browseURL("http://www.youtube.com/watch?v=DkJ6OScQshI")
} #end of function

alarm.clock(seconds=5)
 

Dason

Ambassador to the humans
Don't have R on your machine but just want to run a quick simulation or something?

TIL: http://ideone.com/

It's a neat site that actually lets you compile code for a bunch of different languages. R seems to have quite a few packages available to it but the one time I tried plot (in base) it gave an error. Still it's interesting if you don't have access to R.
 

bryangoodrich

Probably A Mammal
oh that could be interesting! The plot may not have worked because it doesn't control graphic devices? Just a guess. I'll definitely have to "hello world" some random languages now lol
 

Jake

Cookie Scientist
TIL: the (kind of subtle) difference between bootstrapping a mean group difference vs. a permutation test, and how to implement both in R using the boot package.

Here is a well-annotated example:
Code:
library(boot)

# make up data
dat <- stack(list(a = rnorm(15, mean=10),
                  b = rnorm(15, mean=11)))
str(dat)
t.test(values ~ ind, data=dat, var.equal=T)

##############################################################
##### Bootstrap of group difference                      #####
##############################################################

# define bootstrap statistic
bootDiff <- function(data, indices, print=F){
  newdata <- data[indices,]
  if(print) print(indices)
  return(diff = diff(tapply(newdata$values, newdata$ind, mean)))
}

# the indices tell us all about how the bootstrap works
# first boot passes in the original indices: seq(nrow(dat))
# this is to get the observed estimate of the statistic
# then, for each resample, boot samples the indices
# from each stratum (group), WITH REPLACEMENT, separately
boot(data=dat, statistic=bootDiff, R=2,
     print=TRUE, strata=rep(1:2, each=15))

# bootstrap!
resamples <- 10000
results1 <- boot(data=dat, statistic=bootDiff,
                 R=resamples, strata=rep(1:2, each=15))
results1
boot.ci(results1)
plot(results1)
# notice that our distribution of resampled group diffs
# is centered around the original estimate.
# the standard error is the SD of this distribution
# and the 95% CI bounds can be taken as the .025 and .975 quantiles
# or in a more complex way such as those given by boot.ci()

##############################################################
##### Permutation test                                   #####
##############################################################

# define permutation test statistic
permDiff <- function(data, indices, print=F){
  if(print) print(indices)
  if(print) print(length(unique(indices)))
  indices <- factor(indices <= 15, labels=c("b","a"))
  if(print) print(indices)
  newdata <- cbind.data.frame(indices, values=data$values)
  return(diff = diff(tapply(newdata$values, newdata$ind, mean)))
}

# as usual, boot first passes in the original indices.
# next, boot randomly "shuffles" the indices (i.e., no replacement)
# such that each value has an equal chance
# of coming from either group a or group b.
# the logic here is that, if the null hypothesis is true,
# then the particular pattern of indices in the observed data
# is arbitrary, and we want to know the distribution of group
# differences that we might have seen with other arbitrary indices
boot(data=dat, statistic=permDiff, R=2,
     print=TRUE, sim="permutation")

# permutate!
# in theory, there are a finite number of permutations of the indices.
# but in practice, it is usually not feasible to exhaust every 
# possibility. even in this smallish example, there are choose(30,15)
# = 155,117,520 possible permutations. so instead we will randomly
# sample 10k of the possible permutations
resamples <- 10000
results2 <- boot(data=dat, statistic=permDiff,
                 R=resamples,sim="permutation")
plot(results2)
# notice that this distribution is centered around 0.
# since the logic behind the permutation test is analogous
# to null hypothesis significance testing, we can calculate
# a p-value based on where our observed statistic lies
# in the distribution of statistics under the null hypothesis.
# e.g., we can do a two-tailed test like this:
p.value <- mean(results2$t > results1$t0) + mean(results2$t < -1*results1$t0)
p.value
 

bryangoodrich

Probably A Mammal
I've set up a randomization (permutation) distribution before. It wasn't easy.

Code:
# Source -- [COLOR="orange"]http://www.bryangoodrich.com/projects/alsm/ch16/[/COLOR]
remainder <- function(x, set) set[!set %in% x]
f <- function(Y, X) {
  Y <- matrix(Y)                                # Turn row-vector into column
  p <- ncol(X);     n <- nrow(X)
  J <- matrix(1, n, n)                          # (5.18)
  H <- X %*% solve(t(X) %*% X) %*% t(X)         # (5.73a)
  SSE <- t(Y) %*% (diag(n) - H) %*% Y           # (5.89b)
  SSR <-  t(Y) %*% (H - (1/n)*J) %*% Y          # (5.89c)
  fstar <- (SSR / (p - 1)) / (SSE / (n - p))    # (6.39b)
}
 
base <- c(1.1, 0.5, -2.1, 4.2, 3.7, 0.8, 3.2, 2.8, 6.3)
t2   <- t12 <- t123 <- list()
y    <- NULL
X    <- cbind(
  X1 = c(1, 1, 1, 0, 0, 0, 0, 0, 0),
  X2 = c(0, 0, 0, 1, 1, 1, 0, 0, 0),
  X3 = c(0, 0, 0, 0, 0, 0, 1, 1, 1)
);
 
t1   <- t(combn(base, 3))
seq6 <- t(combn(base, 3, remainder, set = base))
 
for (i in 1:84)  t2[[i]] <- t(combn(seq6[i, ], 3))
for (i in 1:84) t12[[i]] <- cbind(t1[i, 1], t1[i, 2], t1[i, 3], t2[[i]])
for (i in 1:84)
  t123[[i]] <- cbind(t12[[i]], t(apply(t12[[i]], 1, remainder, set = base)))
for (i in 1:84) y <- rbind(y, t123[[i]])
 
fstar <- apply(y, 1, function(Y) f(Y, X))
 
cbind(y, data.frame(f = fstar))
hist(fstar, freq = FALSE, ylim = c(0, 1), col = "gray90", main = "")
curve(df(x, 2, 6), add = TRUE, lwd = 2)
 
rm(base, fstar, i, remainder, seq6, t1, t2, t12, t123, f, X, y)

And even in this case it wasn't that big. When boot randomly takes out permutations, I assume this is with replacement, right?

Also, in the bootDiff call from boot, you're setting up the strata. If I understand the code correctly, the line in permDiff that sets up the statistic correspondingly is

Code:
indices <- factor(indices <= 15, labels=c("b","a"))
Is that correct?
 

Jake

Cookie Scientist
When boot randomly takes out permutations, I assume this is with replacement, right?
I don't know the answer for sure, but I assume (or rather, hope) that it is without replacement. That way one could in principle (although not usually in practice) exhaust every possible permutation without repeats and thus have an exact p-value.

Also, in the bootDiff call from boot, you're setting up the strata. If I understand the code correctly, the line in permDiff that sets up the statistic correspondingly is

Code:
indices <- factor(indices <= 15, labels=c("b","a"))
Is that correct?
If I understand your question correctly, then yes; this is the line where I take those indices that boot shuffled for me and use them for the group reassignment.
 

Dason

Ambassador to the humans
My guess is that boot samples with replacement. It would be a lot easier and depending on how you specify strata it could be very difficult to guarantee that you're sampling without replacement.

Code:
library(boot)
set.seed(643)
mn <- function(d, i){mean(d[i])}
dat <- rnorm(2)
j <- boot(dat, mn, R = 10, stype = "i")
j$t
The first two samples of the statistic are the same. This wouldn't happen if the samples were without replacement.