So, I want to start with a blank data frame, and add rows to in on the fly. What I do is this:
Code:
# initialize a blank data.frame
df <- data.frame()
# ... some code is run the result of which is a list like:
l1 <- list(name="Jack", sex="M")
df <- rbind(df, l1)
# ... some other code is run the result of which is another list like:
l2 <- list(name="Jill", sex="F")
df <- rbind(df, l2) #gives error
I get this error:
Code:
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "Jill") :
invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, ri, value = "F") :
invalid factor level, NAs generated
How can I get it done, without hitting that error? (I want to use data frames, not use a matrix and then convert to a data frame)
Internally a data frame is just a list but that doesn't mean you can treat them exactly the same. Dataframes have certain properties that need to be true and are thus treated a little differently than lists by R.
"His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich
Creating dataframes like this is generally not a good idea. You should always avoid "growing" your dataframe.
It is far more efficient memory wise to first create an "empty dataframe", the exact size of the end product, this is then allocated to the memory once.
You can then fill in all the row/columns "on the fly".
If you grow them, as above, it creates a copies of the whole data.frame at every iteration which, especially in loops, causes significant efficiency penalties.
Therefore always try to pre-allocate your dataframe;
This is why;
Code:
# pre-allocate the data
dat <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
Rprofmem()
for(i in 1:1000) {
dat[i,"x"] <- runif(1)
dat[i,"y"] <- rnorm(1)
}
a=length(noquote(readLines("Rprofmem.out")))
# grow the data
dat <- data.frame(x=NULL, y=NULL)
Rprofmem()
for(i in 1:1000) {
dat[i,"x"] <- runif(1)
dat[i,"y"] <- rnorm(1)
}
b=length(noquote(readLines("Rprofmem.out")))
barplot(c(a,b),ylab="instances of memory allocation",names=c("pre-allocating","growing"))
So in this example growing really is twice as bad as pre-allocating
The true ideals of great philosophies always seem to get lost somewhere along the road..
Yeah I think if you grow the dataframe as you go you're talking about O(n^2) whereas preallocation and storing into that is just O(n). So really the problem gets a whole lot worse as the size of the data you're working with increases. For small problems it's not a good idea to do things by growing a dataframe but it's also not the end of the world. For large problems it's the difference of minutes/hours.
Another thing to consider is the data structure itself - data.table is a lot more efficient in terms of memory management so if you are working on very large data sets its worth looking into.
"His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich
Hmm, interesting, I had not heard about data.table before and I've never liked data.frame its so bulky.. Look at the source code... its huge!. I'll give data.table a try.
Here are the bench mark results with data.table added, an improvement but not much better though;
Code:
# pre-allocate the data
dat <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
Rprofmem()
for(i in 1:1000) {
dat[i,"x"] <- runif(1)
dat[i,"y"] <- rnorm(1)
}
a=length(noquote(readLines("Rprofmem.out")))
# grow the data
dat <- data.frame(x=NULL, y=NULL)
Rprofmem()
for(i in 1:1000) {
dat[i,"x"] <- runif(1)
dat[i,"y"] <- rnorm(1)
}
b=length(noquote(readLines("Rprofmem.out")))
barplot(c(a,b),ylab="instances of memory allocation",names=c("pre-allocating","growing"))
# now you should actually completely avoid using data.frame
library(data.table)
dat <- data.table(x=rep(0,1000), y=rep(0,1000))
Rprofmem()
for(i in 1:1000) {
dat[i,x := runif(1)]
dat[i,y := rnorm(1)]
}
d=length(noquote(readLines("Rprofmem.out")))
barplot(c(a,b,d),ylab="instances of memory allocation",names=c("pre-allocating","growing","data.tables"))
The true ideals of great philosophies always seem to get lost somewhere along the road..