+ Reply to Thread
Results 1 to 7 of 7

Thread: Evidence that data.table isn't always fastest

  1. #1
    ggplot2orBust
    Points: 36,318, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,000
    Thanks
    1,412
    Thanked 769 Times in 687 Posts

    Evidence that data.table isn't always fastest




    Long story short: Tyler ran his mouth telling the creator of the data.table package that it's slower sometimes (didn't know who I was talking too). See that link here (LINK) in the comment section under Matthew's response. He says prove it. Tyler says here's one link (LINK).

    We discussed that it's not always faster here a few times. Figure if the data.table people will make those things faster why not point it out. What we need are examples where data.table isn't faster (or even if it is I'd like to have examples of the good times to use it too). then I can link to here and say this is what we've got and he can do with it what he wants (hopefully make things better).
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  2. #2
    Points: 224, Level: 4
    Level completed: 48%, Points required for next Level: 26

    Posts
    5
    Thanks
    2
    Thanked 3 Times in 3 Posts

    Re: Evidence that data.table isn't always fastest

    Hi. Have read the thread and posted into it. I couldn't see any code using data.table there so I've asked for that please. Yes, please do point these things out. More examples showing when data.table is and isn't appropriate would be good, and we'll do our best to make it better. Here's direct link to my reply (basically, please retry with v1.8.0 and post code).
    But data.table isn't a silver bullet; a hash table will always be more appropriate for many tasks. See package fastmatch for example - was that tried for the fast dictionary thread?
    data.table is not bug free, either. Performance issues are usually considered bugs and fixed (if we know about them) - like by the sounds of it this one.
    I'll have a word with the data.table marketing department - they should have told me about the discussions on this forum!
    Matthew
    Last edited by Matthew Dowle; 05-19-2012 at 08:08 AM. Reason: Added link to reply in other thread.

  3. The Following User Says Thank You to Matthew Dowle For This Useful Post:

    bryangoodrich (05-20-2012)

  4. #3
    Cookie Scientist
    Points: 9,403, Level: 65
    Level completed: 18%, Points required for next Level: 247
    Jake's Avatar
    Location
    Boulder, CO
    Posts
    1,143
    Thanks
    46
    Thanked 480 Times in 367 Posts

    Re: Evidence that data.table isn't always fastest

    Here's what I was doing in the thread from January:
    Code: 
    > ### get dataset
    > 
    > load(url("http://dl.dropbox.com/u/61803503/NETtalk.RData"))
    > str(NETtalk)
    'data.frame':	20137 obs. of  2 variables:
     $ word     : chr  "hm" "hmm" "hmmm" "hmph" ...
     $ syllables: num  1 1 1 1 2 2 1 1 1 1 ...
    > 
    > ### set up data.table
    > 
    > library(data.table)
    data.table 1.7.10  For help type: help("data.table")
    > NETtalk.table <- data.table(NETtalk, key="word")
    > 
    > ### compare to vector scan
    > 
    > library(rbenchmark)
    > benchmark(vectorScan=NETtalk[NETtalk$word=="stuff",],
    +           dataTable=NETtalk.table[J("stuff"),])
            test replications elapsed relative user.self sys.self
    2  dataTable          100    0.72 5.538462      0.71        0
    1 vectorScan          100    0.13 1.000000      0.12        0
      user.child sys.child
    2         NA        NA
    1         NA        NA
    As you can see this is data.table 1.7.10 and here data.table is markedly slower than the simple vector scan. However, here's what it looks like now that I've just updated to data.table 1.8.0:
    Code: 
    > ### get dataset
    > 
    > load(url("http://dl.dropbox.com/u/61803503/NETtalk.RData"))
    > str(NETtalk)
    'data.frame':	20137 obs. of  2 variables:
     $ word     : chr  "hm" "hmm" "hmmm" "hmph" ...
     $ syllables: num  1 1 1 1 2 2 1 1 1 1 ...
    > 
    > ### set up data.table
    > 
    > library(data.table)
    data.table 1.8.0  For help type: help("data.table")
    > NETtalk.table <- data.table(NETtalk, key="word")
    > 
    > ### compare to vector scan
    > 
    > library(rbenchmark)
    > benchmark(vectorScan=NETtalk[NETtalk$word=="stuff",],
    +           dataTable=NETtalk.table[J("stuff"),])
            test replications elapsed relative user.self sys.self
    2  dataTable          100    0.17   1.0625      0.17        0
    1 vectorScan          100    0.16   1.0000      0.16        0
      user.child sys.child
    2         NA        NA
    1         NA        N
    Now the two methods look pretty much the same. Clearly this is an improvement but shouldn't data.table be faster? Do we need a bigger test dataset for the advantage to emerge?
    “In God we trust. All others must bring data.”
    ~W. Edwards Deming

  5. The Following 2 Users Say Thank You to Jake For This Useful Post:

    Matthew Dowle (05-21-2012), trinker (05-19-2012)

  6. #4
    ggplot2orBust
    Points: 36,318, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,000
    Thanks
    1,412
    Thanked 769 Times in 687 Posts

    Re: Evidence that data.table isn't always fastest

    I ran this with a hash table as well using microbenchmarking on a windows 7 with the newest version of data.table 1.8.0

    Code: 
    ### get dataset
        load(url("http://dl.dropbox.com/u/61803503/NETtalk.RData"))
    
    ### set up data.table
    
    library(data.table)
    library(microbenchmark)
                                                                         
     hash <- function(x) {e <- new.env(hash = TRUE, size = nrow(x),       
        parent = emptyenv()); apply(x, 1, function(col) assign(col[1],     
        as.numeric(col[2]), envir = e)); return(e)}  
                                                                       
    env <- hash(NETtalk)  #assign the dictionary to env
        
    NETtalk.table <- data.table(NETtalk, key="word")
        vectorScan <- function()NETtalk[NETtalk$word=="stuff",]
        dataTable <- function() NETtalk.table[J("stuff"),]
        hash <- function() get("stuff", e = env)
    
    (op <- microbenchmark( 
        vectorScan(),
        dataTable(),
        hash(),        
    times=1000L))
    Results:
    Code: 
        
    Unit: microseconds
              expr      min       lq   median       uq        max
    1  dataTable() 2603.882 2742.217 2831.331 3121.299 479626.882
    2       hash()    2.333    3.733    9.332   11.664    177.294
    3 vectorScan() 1350.697 1393.154 1423.947 1582.812 439744.716
    A visualization of the results:
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  7. The Following 2 Users Say Thank You to trinker For This Useful Post:

    bryangoodrich (05-20-2012), Matthew Dowle (05-21-2012)

  8. #5
    Probably A Mammal
    Points: 19,310, Level: 87
    Level completed: 92%, Points required for next Level: 40
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    2,205
    Thanks
    293
    Thanked 496 Times in 451 Posts

    Re: Evidence that data.table isn't always fastest

    I gotta say, whenever I can make use of hash, I may just do that, albeit, when it is appropriate.

    BTW, thanks Matthew for participating. It's been awhile since I made a quick simulation like I did in the relevant thread. There were some other times it's come up in the past several months where data.table didn't work out and I think there's code, but I never recall these threads. When I get time, I'll definitely take a look at my comment back then and see if I can't recreate the situation.

    Tyler, I didn't know ggplot had such visualization. I like it!

  9. #6
    Points: 224, Level: 4
    Level completed: 48%, Points required for next Level: 26

    Posts
    5
    Thanks
    2
    Thanked 3 Times in 3 Posts

    Re: Evidence that data.table isn't always fastest

    Hi,

    Thanks for the example. There's a lot to say but I'll try and keep it short :

    In general beware of finding significant differences of insignificant times. The timing of the task in hand seems to be 0.000 seconds vs 0.000 seconds. You're having to repeat 100 times before the compute time takes more than 1 millisecond. This is only a useful benchmark if you really do want to repeat the task over and over again; i.e., that the real task is many individual lookups in separate calls. Otherwise you're just timing very small overhead differences which make no difference in practice. Perhaps it would be better if benchmark()'s default was times=3 rather than time=100. If 3 runs take 0.000 seconds, who cares?

    The way to scale it up is increase the size of the data. And it makes a difference the size of the dictionary vs the amount of data being lookup up in the vectorized call. Tasks where a single run take several seconds (or minutes). Then take the minimum of just 3 runs. The reason people say to do 3 consecutive runs is to isolate L2 cache effects causing the first run to take longer.

    That said I'll try and reduce overhead anyway. Just to prove a point for marketing reasons. That's one reason why set() was added to data.table 1.8.0. Someone claimed that the structure of data.frame itself made it slow even when manipulated in C, so I added set() to show that data.frame's internal structure is excellent and that wasn't true. Even though in practice we'll never loop set() like that in practice, it's there as an option if you do ever want a 'loopable' assign by reference.

    On trinker's example, looking up a single word over and over again in a loop isn't what data.table is for. Not just because of the reasons above, but by default mult is "all" which means it looks for the start and end location of *groups*. data.table doesn't know if a key is unique or not, and assumes it isn't. Try setting mult="first" or mult="last" (doesn't matter which) if *you* know the data is unique. It won't make a difference in this benchmark, because this benchmark is dominated by overhead. So, pass a *vector* of words to i, and ensure to use SJ rather than J to tell data.table to do a binary merge rather than many separate binary searches (i.e. key i too as well as x). If you're looping the call to time it, then it may or may not be fair to isolate the time to key i. Those things should get data.table closer to green, on timings of single runs. But even then, for dictionaries of single words, a hash table is more appropriate. For that, data.table has it's own: chmatch() which operates like a hash without building a hash table (it reuses R's own internal hash table without having to build a new one). See ?chmatch for comparison to fastmatch. Then if you time chmatch() it's important to include the time to build the hash: chmatch doesn't need any, but the others do. The others are faster on subsequent calls, though. It's all quite involved to compare properly.

    We're trying to make data.table more user friendly to help avoid users falling into this traps. For example in v1.8.1, mean() is now automatically .Internal()ed so the user doesn't need to know about the difference between sum() and mean(), or the existence of data.table wiki point 3. More convenience features like that basically.

    Hope that helps. Look forward to seeing the new results. Hopefully it'll lead to some imporvements to data.table, these discussions usually do!

    Matthew

  10. The Following User Says Thank You to Matthew Dowle For This Useful Post:

    trinker (05-21-2012)

  11. #7
    Points: 224, Level: 4
    Level completed: 48%, Points required for next Level: 26

    Posts
    5
    Thanks
    2
    Thanked 3 Times in 3 Posts

    Re: Evidence that data.table isn't always fastest


    Quote Originally Posted by Jake View Post
    However, here's what it looks like now that I've just updated to data.table 1.8.0:
    Code: 
    > ### get dataset
    > 
    > load(url("http://dl.dropbox.com/u/61803503/NETtalk.RData"))
    > str(NETtalk)
    'data.frame':	20137 obs. of  2 variables:
     $ word     : chr  "hm" "hmm" "hmmm" "hmph" ...
     $ syllables: num  1 1 1 1 2 2 1 1 1 1 ...
    > 
    > ### set up data.table
    > 
    > library(data.table)
    data.table 1.8.0  For help type: help("data.table")
    > NETtalk.table <- data.table(NETtalk, key="word")
    > 
    > ### compare to vector scan
    > 
    > library(rbenchmark)
    > benchmark(vectorScan=NETtalk[NETtalk$word=="stuff",],
    +           dataTable=NETtalk.table[J("stuff"),])
            test replications elapsed relative user.self sys.self
    2  dataTable          100    0.17   1.0625      0.17        0
    1 vectorScan          100    0.16   1.0000      0.16        0
      user.child sys.child
    2         NA        NA
    1         NA        N
    Now the two methods look pretty much the same. Clearly this is an improvement but shouldn't data.table be faster? Do we need a bigger test dataset for the advantage to emerge?
    Yes, a much bigger dataset. One where we're not comparing tasks that take 0.0017 seconds vs 0.0016 seconds. The example in vignette("datatable-timing") shows a 10 million row dataset (rather than 20,000) with 2 columns in the key. That shows a single run taking 7.6 seconds reduced to 0.018 with data.table. The point there is that as you scale the number of rows up towards 2 billion, you still get under 0.1 seconds with data.table, but the vector scan can take many minutes, or even 'out of memory'. There is significant time needed to setkey() (which the vector scan doesn't need) but that's typically done once up front, like a database.

  12. The Following User Says Thank You to Matthew Dowle For This Useful Post:

    trinker (05-21-2012)

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats