bryangoodrich (02-23-2012)
bryan it gets rid of the legend and plots the groups/labels what ever directly on the graph. It may (I don't think this is appropriate in every situation) be easier to interpret your graph as you're eyes aren't going back and forth from legend to plot. It also does the formatting for you so you don't have labels on top of each other. And it seems to play well with all three graphics systems. I think it could save some time and has potential for some circumstances (check out the topo map (contour map) with the numbers right on the graphic. Much more interpretable and professional.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
bryangoodrich (02-23-2012)
I wonder why I didn't use the "cast" function ever before. I am familiar with the melt function but never got around to use "cast" at all.
I used to subset the data to 4 datasets (one for each day) and then bind 'em together to get to my desired output.
Now, what was "Donkey's work", Cast does it very nicely. Very thankful to Hadly Wickham for his time-savers (already a big user of plyr, now add reshape to it).
Code:require(reshape); # d.f > test day person freq 1 0 A 1 2 1 A 2 3 2 A 1 4 3 A 2 5 0 B 1 6 1 B 3 7 2 B 1 8 3 B 6 9 0 C 3 10 1 C 6 11 2 C 3 12 3 C 2 > summary<-cast(test,person~day,) Using freq as value column. Use the value argument to cast to override this choice > summary person 0 1 2 3 1 A 1 2 1 2 2 B 1 3 1 6 3 C 3 6 3 2
Oh Thou Perelman! Poincare's was for you and Riemann's is for me.
This could be fun to have in your RProfile to play on startup:
Code:Dave <- function(){ #Download from 'http://www.palantir.net/2001/tma1/wav/hihal.wav' require(tuneR) play('hihal.wav') }
Last edited by Lazar; 02-26-2012 at 04:02 AM.
I'm tired of waiting around for Dason to stop fooling around and make us a decent alarm clock:
Code:alarm.clock <- function(hours=0, minutes=0, seconds=0){ x <- c(hours*360 + minutes*60 + seconds) Sys.sleep(x) browseURL("http://www.youtube.com/watch?v=DkJ6OScQshI") } #end of function alarm.clock(seconds=5)
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
Don't have R on your machine but just want to run a quick simulation or something?
TIL: http://ideone.com/
It's a neat site that actually lets you compile code for a bunch of different languages. R seems to have quite a few packages available to it but the one time I tried plot (in base) it gave an error. Still it's interesting if you don't have access to R.
I don't have emotions and sometimes that makes me very sad.
bryangoodrich (03-01-2012)
oh that could be interesting! The plot may not have worked because it doesn't control graphic devices? Just a guess. I'll definitely have to "hello world" some random languages now lol
TIL: the (kind of subtle) difference between bootstrapping a mean group difference vs. a permutation test, and how to implement both in R using the boot package.
Here is a well-annotated example:
Code:library(boot) # make up data dat <- stack(list(a = rnorm(15, mean=10), b = rnorm(15, mean=11))) str(dat) t.test(values ~ ind, data=dat, var.equal=T) ############################################################## ##### Bootstrap of group difference ##### ############################################################## # define bootstrap statistic bootDiff <- function(data, indices, print=F){ newdata <- data[indices,] if(print) print(indices) return(diff = diff(tapply(newdata$values, newdata$ind, mean))) } # the indices tell us all about how the bootstrap works # first boot passes in the original indices: seq(nrow(dat)) # this is to get the observed estimate of the statistic # then, for each resample, boot samples the indices # from each stratum (group), WITH REPLACEMENT, separately boot(data=dat, statistic=bootDiff, R=2, print=TRUE, strata=rep(1:2, each=15)) # bootstrap! resamples <- 10000 results1 <- boot(data=dat, statistic=bootDiff, R=resamples, strata=rep(1:2, each=15)) results1 boot.ci(results1) plot(results1) # notice that our distribution of resampled group diffs # is centered around the original estimate. # the standard error is the SD of this distribution # and the 95% CI bounds can be taken as the .025 and .975 quantiles # or in a more complex way such as those given by boot.ci() ############################################################## ##### Permutation test ##### ############################################################## # define permutation test statistic permDiff <- function(data, indices, print=F){ if(print) print(indices) if(print) print(length(unique(indices))) indices <- factor(indices <= 15, labels=c("b","a")) if(print) print(indices) newdata <- cbind.data.frame(indices, values=data$values) return(diff = diff(tapply(newdata$values, newdata$ind, mean))) } # as usual, boot first passes in the original indices. # next, boot randomly "shuffles" the indices (i.e., no replacement) # such that each value has an equal chance # of coming from either group a or group b. # the logic here is that, if the null hypothesis is true, # then the particular pattern of indices in the observed data # is arbitrary, and we want to know the distribution of group # differences that we might have seen with other arbitrary indices boot(data=dat, statistic=permDiff, R=2, print=TRUE, sim="permutation") # permutate! # in theory, there are a finite number of permutations of the indices. # but in practice, it is usually not feasible to exhaust every # possibility. even in this smallish example, there are choose(30,15) # = 155,117,520 possible permutations. so instead we will randomly # sample 10k of the possible permutations resamples <- 10000 results2 <- boot(data=dat, statistic=permDiff, R=resamples,sim="permutation") plot(results2) # notice that this distribution is centered around 0. # since the logic behind the permutation test is analogous # to null hypothesis significance testing, we can calculate # a p-value based on where our observed statistic lies # in the distribution of statistics under the null hypothesis. # e.g., we can do a two-tailed test like this: p.value <- mean(results2$t > results1$t0) + mean(results2$t < -1*results1$t0) p.value
In God we trust. All others must bring data.
~W. Edwards Deming
trinker (03-05-2012)
I've set up a randomization (permutation) distribution before. It wasn't easy.
Spoiler:
And even in this case it wasn't that big. When boot randomly takes out permutations, I assume this is with replacement, right?
Also, in the bootDiff call from boot, you're setting up the strata. If I understand the code correctly, the line in permDiff that sets up the statistic correspondingly is
Is that correct?Code:indices <- factor(indices <= 15, labels=c("b","a"))
I don't know the answer for sure, but I assume (or rather, hope) that it is without replacement. That way one could in principle (although not usually in practice) exhaust every possible permutation without repeats and thus have an exact p-value.
If I understand your question correctly, then yes; this is the line where I take those indices that boot shuffled for me and use them for the group reassignment.
In God we trust. All others must bring data.
~W. Edwards Deming
My guess is that boot samples with replacement. It would be a lot easier and depending on how you specify strata it could be very difficult to guarantee that you're sampling without replacement.
The first two samples of the statistic are the same. This wouldn't happen if the samples were without replacement.Code:library(boot) set.seed(643) mn <- function(d, i){mean(d[i])} dat <- rnorm(2) j <- boot(dat, mn, R = 10, stype = "i") j$t
I don't have emotions and sometimes that makes me very sad.
Well, you didn't specify sim="permutation", so it's not even pretending to do a permutation test there--but, you did get me thinking, boot doesn't even know how to sample without replacement if we haven't specified the strata (since the number of observations in each strata would determine the number of possible permutations), but we never give it this information! So it must not be sampling without replacement after all. Looking through ?boot, I don't see that there is a way to give it this information either. (There is a strata argument, but this just makes it permutate within each stratum, which is not what we want.)
This is a little disappointing because it would not be very hard to code in the ability to sample permutations without replacement, assuming the user provided the necessary information.
In God we trust. All others must bring data.
~W. Edwards Deming
I don't see the problem though. Unless you're doing something with very small sample sizes then there are going to be more permutations than you could possibly fit easily. If you happen to get a few permutations more than once it doesn't really matter that much. You're still approximating the null distribution asymptotically.
I don't have emotions and sometimes that makes me very sad.
You're right, it basically doesn't matter at all in practice. But still, it would feel nice knowing that it could provide the exact p-values if only in principle
In God we trust. All others must bring data.
~W. Edwards Deming
I have really been interested in learning how to appropriately use the reshape function from base rather than relying on the reshape packages (Wickham). The base function is very flexible and may require less code than the reshape package. I also replicated the same thing using rbind, cbind, do.call, and some fancy indexing.
I made myself a tutorial and thought I'd share. I was going to do this as a statspedia article but you can't just upload a pdf. I've got a lot going on right now so maybe another day.
Hopefully others find this useful as well. Wide to Long Closely examined.pdf
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
GretaGarbo (04-08-2012)
TIL: BEWARE OF THE facet_grid FUNCTION IN GGPLOT2
There is a bug in ggplot two involving facet grid and duplicate rows (SEE LINK)
That's why I created the IDer function (LINK) to add an ID row that eliminates the duplicate row problem (this problem is especially apparent in stacked data). I was having to add an id row repeatedly (I had IDs but when they're stacked they're no longer unique) so I made that function to save time:
THE IDer functionCode:library(ggplot2) set.seed(10) CO3 <- data.frame(CO2[, 2:3], outcome=factor(sample(c('none', 'some', 'lots', 'tons'), nrow(CO2), rep=T), levels=c('none', 'some', 'lots', 'tons'))) ggplot(CO3, aes(x=outcome)) + geom_bar(aes(x=outcome)) + facet_grid(Treatment~., margins=TRUE) CO3 <- IDer(CO3) ggplot(CO3, aes(x=outcome)) + geom_bar(aes(x=outcome)) + facet_grid(Treatment~., margins=TRUE)
Code:IDer <- function(dataframe, id.name="id"){ DF <- data.frame(c=1:nrow(dataframe), dataframe) colnames(DF)[1] <- id.name return(DF) }
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
Tweet |