+ Reply to Thread
Results 1 to 9 of 9

Thread: Generating dataframes on the fly

  1. #1
    Points: 2,026, Level: 27
    Level completed: 18%, Points required for next Level: 124

    Posts
    80
    Thanks
    21
    Thanked 0 Times in 0 Posts

    Generating dataframes on the fly



    So, I want to start with a blank data frame, and add rows to in on the fly. What I do is this:

    Code: 
    # initialize a blank data.frame
    df <- data.frame()
    # ... some code is run the result of which is a list like:
    l1 <- list(name="Jack", sex="M")
    df <- rbind(df, l1)
    # ... some other code is run the result of which is another list like:
    l2 <- list(name="Jill", sex="F")
    df <- rbind(df, l2) #gives error
    I get this error:
    Code: 
    Warning messages:
    1: In `[<-.factor`(`*tmp*`, ri, value = "Jill") :
      invalid factor level, NAs generated
    2: In `[<-.factor`(`*tmp*`, ri, value = "F") :
      invalid factor level, NAs generated
    How can I get it done, without hitting that error? (I want to use data frames, not use a matrix and then convert to a data frame)

  2. #2
    Bhoot
    Points: 1,275, Level: 19
    Level completed: 75%, Points required for next Level: 25

    Posts
    1,758
    Thanks
    40
    Thanked 124 Times in 106 Posts

    Re: Generating dataframes on the fly

    If you are looking for solution. Try this
    Code: 
    df <- data.frame() 
    l1 <- list(name="Jack", sex="M") 
    df <- rbind(df, l1) # for consistency you can use   df <- rbind(df, as.data.frame(l1))
    l2 <- list(name="Jill", sex="F") 
    df <- rbind(df, as.data.frame(l2))
    In the long run, we're all dead.

  3. #3
    Points: 2,026, Level: 27
    Level completed: 18%, Points required for next Level: 124

    Posts
    80
    Thanks
    21
    Thanked 0 Times in 0 Posts

    Re: Generating dataframes on the fly

    Interesting! I thought data.frame is a bunch of lists together; it appears it is not exactly that!

  4. #4
    RotParaTon
    Points: 46,105, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Discussion EnderPosting AwardFrequent PosterActivity AwardCommunity Award
    Dason's Avatar
    Location
    Ames, IA
    Posts
    9,063
    Thanks
    211
    Thanked 1,605 Times in 1,375 Posts

    Re: Generating dataframes on the fly

    Internally a data frame is just a list but that doesn't mean you can treat them exactly the same. Dataframes have certain properties that need to be true and are thus treated a little differently than lists by R.
    "His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich

  5. #5
    R purist
    Points: 13,351, Level: 75
    Level completed: 26%, Points required for next Level: 299
    TheEcologist's Avatar
    Location
    The Netherlands.
    Posts
    1,282
    Thanks
    112
    Thanked 249 Times in 125 Posts

    Re: Generating dataframes on the fly

    Creating dataframes like this is generally not a good idea. You should always avoid "growing" your dataframe.
    It is far more efficient memory wise to first create an "empty dataframe", the exact size of the end product, this is then allocated to the memory once.
    You can then fill in all the row/columns "on the fly".

    If you grow them, as above, it creates a copies of the whole data.frame at every iteration which, especially in loops, causes significant efficiency penalties.

    Therefore always try to pre-allocate your dataframe;

    This is why;

    Code: 
     # pre-allocate the data
    
    dat <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
    
    Rprofmem()
    
    for(i in 1:1000) {
      dat[i,"x"] <- runif(1)
      dat[i,"y"] <- rnorm(1)
    }
    
    
    a=length(noquote(readLines("Rprofmem.out")))
    
    # grow the data
    
    
    
    dat <- data.frame(x=NULL, y=NULL)
     Rprofmem()
    
    for(i in 1:1000) {
      dat[i,"x"] <- runif(1)
      dat[i,"y"] <- rnorm(1)
    }
    
    b=length(noquote(readLines("Rprofmem.out")))
    
    barplot(c(a,b),ylab="instances of memory allocation",names=c("pre-allocating","growing"))
    So in this example growing really is twice as bad as pre-allocating
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  6. #6
    RotParaTon
    Points: 46,105, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Discussion EnderPosting AwardFrequent PosterActivity AwardCommunity Award
    Dason's Avatar
    Location
    Ames, IA
    Posts
    9,063
    Thanks
    211
    Thanked 1,605 Times in 1,375 Posts

    Re: Generating dataframes on the fly

    Yeah I think if you grow the dataframe as you go you're talking about O(n^2) whereas preallocation and storing into that is just O(n). So really the problem gets a whole lot worse as the size of the data you're working with increases. For small problems it's not a good idea to do things by growing a dataframe but it's also not the end of the world. For large problems it's the difference of minutes/hours.

    Another thing to consider is the data structure itself - data.table is a lot more efficient in terms of memory management so if you are working on very large data sets its worth looking into.
    "His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich

  7. The Following 2 Users Say Thank You to Dason For This Useful Post:

    TheEcologist (10-03-2012), vinux (10-03-2012)

  8. #7
    R purist
    Points: 13,351, Level: 75
    Level completed: 26%, Points required for next Level: 299
    TheEcologist's Avatar
    Location
    The Netherlands.
    Posts
    1,282
    Thanks
    112
    Thanked 249 Times in 125 Posts

    Re: Generating dataframes on the fly

    Hmm, interesting, I had not heard about data.table before and I've never liked data.frame its so bulky.. Look at the source code... its huge!. I'll give data.table a try.

    Here are the bench mark results with data.table added, an improvement but not much better though;

    Code: 
    
    # pre-allocate the data
    
    dat <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
    
    Rprofmem()
    
    for(i in 1:1000) {
      dat[i,"x"] <- runif(1)
      dat[i,"y"] <- rnorm(1)
    }
    
    
    a=length(noquote(readLines("Rprofmem.out")))
    
    # grow the data
    
    
    dat <- data.frame(x=NULL, y=NULL)
    
    Rprofmem()
    
    for(i in 1:1000) {
      dat[i,"x"] <- runif(1)
      dat[i,"y"] <- rnorm(1)
    }
    
    b=length(noquote(readLines("Rprofmem.out")))
    
    barplot(c(a,b),ylab="instances of memory allocation",names=c("pre-allocating","growing"))
    
    # now you should actually completely avoid using data.frame
    
    library(data.table)
    dat <- data.table(x=rep(0,1000), y=rep(0,1000))
    
    Rprofmem()
    
    for(i in 1:1000) {
      dat[i,x := runif(1)]
      dat[i,y := rnorm(1)]
    }
    
    d=length(noquote(readLines("Rprofmem.out")))
    
    barplot(c(a,b,d),ylab="instances of memory allocation",names=c("pre-allocating","growing","data.tables"))
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  9. #8
    R purist
    Points: 13,351, Level: 75
    Level completed: 26%, Points required for next Level: 299
    TheEcologist's Avatar
    Location
    The Netherlands.
    Posts
    1,282
    Thanks
    112
    Thanked 249 Times in 125 Posts

    Re: Generating dataframes on the fly

    Oke, I have to thank you Dason. I just found this little gem within the package data.tables; set()

    Look at the Benchmark results now when using set()!

    Code: 
    
    
    
    # pre-allocate the data
    
    dat <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
    
    Rprofmem()
    
    for(i in 1:1000) {
      dat[i,"x"] <- runif(1)
      dat[i,"y"] <- rnorm(1)
    }
    
    
    a=length(noquote(readLines("Rprofmem.out")))
    
    # grow the data
    
    
    dat <- data.frame(x=NULL, y=NULL)
    
    Rprofmem()
    
    for(i in 1:1000) {
      dat[i,"x"] <- runif(1)
      dat[i,"y"] <- rnorm(1)
    }
    
    b=length(noquote(readLines("Rprofmem.out")))
    
    
    # now you should actually completely avoid using data.frame
    
    library(data.table)
    dat <- data.table(x=rep(0,1000), y=rep(0,1000))
    
    Rprofmem()
    
    for(i in 1:1000) {
      dat[i,x := runif(1)]
      dat[i,y := rnorm(1)]
    }
    
    d=length(noquote(readLines("Rprofmem.out")))
    
    
    # using set() instead
    
    dat <- data.table(x=rep(0,1000), y=rep(0,1000))
    
    Rprofmem()
    
    
      for(i in 1:1000) {
        set(dat,i,1L, runif(1) )
        set(dat,i,2L, rnorm(1) )
      }
    
    
    e=length(noquote(readLines("Rprofmem.out")))
    
    barplot(c(a,b,d,e),ylab="instances of memory allocation",names=c("pre-allocating","growing","data.tables","set"))
    Looks like we just found the most efficient way to do this in R by far!
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  10. The Following 2 Users Say Thank You to TheEcologist For This Useful Post:

    Jake (10-03-2012), merik (10-07-2012)

  11. #9
    Points: 2,026, Level: 27
    Level completed: 18%, Points required for next Level: 124

    Posts
    80
    Thanks
    21
    Thanked 0 Times in 0 Posts

    Re: Generating dataframes on the fly


    That was a great finding! http://cran.r-project.org/web/packag...able-intro.pdf is an absolutely helpful resource too.

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts








Advertise on Talk Stats