# melt function chooses wrong id variable with large datasets in R

#### JoachimPCS

##### New Member
Hello all,

I'm using a large dataset consisting of 2 groups of data, 2 columns in excel with a header (group name) and 15 000 rows of data. I would like like to compare this data, so I transform my dataset with the melt function to get 1 column of data and 1 column of ID variables, then I can apply different statistical tests. With small datasets this works great, the melt function automatically chooses the name in row 1 as ID variable and melts the data, thus giving me a matrix with all ID variables in column one and the data accordingly in column 2.
With this big dataset however it chooses the whole first column as ID variables in stead of the first row. Is there a reason why this happens and how can I make sure the first row is chosen as ID variabele and the lower rows as data?

If I specify that I want the first row to be the id variable I also get error.

melt(dataset,id.vars=dataset[1,], na.rm=TRUE)

Are there alternative ways to create a good reshaped dataset?

Kind Regards
Joachim

#### TheEcologist

##### Global Moderator
This is one way to do this, in base R, with reshape.

Code:
dd=data.frame(g1=rnorm(5),g2=rnorm(5),g3=rnorm(5))

reshape(dd, idvar = "g", varying = names(dd),v.names = "stat", direction = "long")

#### trinker

##### ggplot2orBust
Here's a dplyr + tidyr approach (faster than the reshape2 package). Not as compact code wise as TE's approach above but to me the dplyr + tidyr approach is easier to remember because each function does one action well.

Code:
if (!require("pacman")) install.packages("pacman")

dd %>%
mutate(time = 1:n()) %>%
gather(g, stat, -time) %>%
mutate(g = as.numeric(gsub("\\D", "", g))) %>%
arrange(time)

#### TheEcologist

##### Global Moderator
I personally find reshape easier to comprehend in this case, but note that dplyr + tidyr approach, though more cumbersome will be faster on large data-sets. They have been highly optimized - so it may be worthwhile to learn if you foresee yourself working with big datasets.

Note: at this point of development, you still achieve speeds way beyond the dplyr + tidyr approach, with custom base code on very large datasets, but it will be MUCH more cumbersome.