I like to try and stay in base R when I can. There are just a lot of useful tools in the stats package. That being said, no one can deny that Reshape has a lot to offer. The melt function has become particularly useful for me because it is so much more intuitive than using stack or reshape.
That being said, does anyone know how to recreate that intuitive behavior in base R? For instance, I have a data set with 3 factors and 7 numeric fields. With melt, it would know to stack the numeric fields using their column names as factor levels. Doing the same thing stack is easy if you only have one non-numeric field. The help files for stack appear useless to me (horrible examples and not much detail). I've always had trouble with getting reshape to do what I want (because it expects the data to fit a certain form).
Here's a simple data set we can work with:
Code:
df <- data.frame(
A = gl(2, 2), # 1 1 2 2
B = gl(2, 1, 4), # 1 2 1 2
C = rnorm(4) * 10,
D = runif(4, -10, 10)
);
The long form should be (with '#' representing some random number from C or D)
Code:
1 1 # C
1 1 # D
1 2 # C
1 2 # D
2 1 # C
2 1 # D
2 2 # C
2 2 # D
The only thing I can really think of is to
(1) Capture the number of non-factor columns, call it n
(2) Stack the data set (it removes the factors)
(3) Repeat the factors (rbind?) n-times
(4) Put together (2) and (3)
(5) Sort according to each factor
My tentative solution is below. I think it is straightforward. I take advantage of some features in R. For instance, when I concatenate the data frame, you get each column as its list parts. In a way, it is like "exploding" it into a list of manageable pieces that lends itself to better control than the matrix layout of a data frame. I also take advantage of the natural recycling to reproduce step (3) above. Thus, everything comes together nicely.
df <- data.frame(
A = gl(2, 2), # 1 1 2 2
B = gl(2, 1, 4), # 1 2 1 2
C = rnorm(4) * 10,
D = runif(4, -10, 10)
);
df
# A B C D
# 1 1 1 3.509282 -1.508020
# 2 1 2 -1.798883 -7.413565
# 3 2 1 -6.339989 4.443587
# 4 2 2 2.845679 7.858665
df <- myMelt(df)
df[order(df$A, df$B), ]
# A B values ind
# 1 1 1 3.509282 C
# 5 1 1 -1.508020 D
# 2 1 2 -1.798883 C
# 6 1 2 -7.413565 D
# 3 2 1 -6.339989 C
# 7 2 1 4.443587 D
# 4 2 2 2.845679 C
# 8 2 2 7.858665 D
The one thing I'm missing is how to take the isFactor and use that to specify the order/sort within myMelt. I could then rename the row.names as a proper sequence and return the final result I want in the first place. Nevertheless, this did work!
Using an external sort defeats the purpose! I know what it should be sorted on. I'm wondering if I could make use of that sortframe we discussed the other day, doing something like
Code:
sortframe(df, df[, 1], df[, 2], ...)
and using the individual values of isFactor to indicate the '1' and '2' or whatever they may be. It wouldn't violate the base solutions at work here. I could also include the id.vars parameter, and basically do the manual search I used whenever it isn't specified. A check to make sure those fields actually are factors should probably still be utilized (not concerned with error checking atm).
The only other thing I can think of is to somehow produce an unevaluated statement or something along those lines (I'm still not familiar with the eval and call, etc. type functions), and use that to set up the ordering.
Just had a thought on the way home that I could just make my approach a wrapper for stack, using it as the work horse for the function. I haven't checked if this works yet, but my idea is this:
Code:
myMelt <- function(df) {
long <- stack(df)
vars <- unique(long$ind)
cbind(subset(df, select = -vars), long)
}
I'll try your approach out tomorrow trinker. I never could understand quite what each of the parameters required to make it work right. Did you get the same result that I did? I probably should have just used some defined integers so the values didn't change in my example lol
Yes I get the same results but it took some rechecking of my notes I keep on R (I'm at 175 pages I've accumulated) to figure out the parameters. They aren't intuitive. That's why we all love reshape2/1 and plyr so much. I actually think your approach is much more transparent, though I'm betting the reshape (base) approach may be faster.
I also have to reorder the rows in the same way you did as well as give the rows new row names.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
It looks to me like reshape gives you more control, and it would be nice if it made some assumptions like melt (and my function) does regarding factors. Honestly, I don't even remember all the parameters to melt. I always just use it on a data frame with factors and it works like a charm, and pretty quickly, too. If I understand reshape correctly (will read up some more later), I can probably create a wrapper to make it easy for cases like this (e.g., the isFactor can be used to fill some of those parameters you set). The one thing I haven't tested yet is how it might handle classes like Date or if it is appropriate to submit a non-factor ID variable, assuming it'll be coerced into a factor. I never use the benchmarking faculties R has. Maybe I'll do it and see which process worked faster on cpu time.
My method above did not work. First off, I needed to do levels(long$ind), and secondly, the -vars in subset doesn't work because vars is a vector of quoted names. Apparently that doesn't work within subset?? Seems stupid. Effectively it is the difference between
melt <- function(df, names = NULL) {
long <- stack(df)
vars <- levels(long$ind) # I don't know why I thought I needed unique
long <- data.frame(df[, which(!names(df) %in% vars)], long)
if (!is.null(names)) {
names(long) <- names
} else {
names(long) <- paste("V", seq(ncol(long)), sep = "")
} # end if-else
return(long)
} # end melt
This way the user can specify the names as a character vector or else it gets the sort of behavior as reading in a table of data without a header.
One thing I need to control (or suppress) is the warning that comes from putting together the factors with the stacked variables, because any time there's more than one factor it spits out a warning about the row names or something. Since it's irrelevant and handled later, it just needs to go.