Long story short: Tyler ran his mouth telling the creator of the data.table package that it's slower sometimes (didn't know who I was talking too). See that link here (LINK) in the comment section under Matthew's response. He says prove it. Tyler says here's one link (LINK).
We discussed that it's not always faster here a few times. Figure if the data.table people will make those things faster why not point it out. What we need are examples where data.table isn't faster (or even if it is I'd like to have examples of the good times to use it too). then I can link to here and say this is what we've got and he can do with it what he wants (hopefully make things better).
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
Hi. Have read the thread and posted into it. I couldn't see any code using data.table there so I've asked for that please. Yes, please do point these things out. More examples showing when data.table is and isn't appropriate would be good, and we'll do our best to make it better. Here's direct link to my reply (basically, please retry with v1.8.0 and post code).
But data.table isn't a silver bullet; a hash table will always be more appropriate for many tasks. See package fastmatch for example - was that tried for the fast dictionary thread?
data.table is not bug free, either. Performance issues are usually considered bugs and fixed (if we know about them) - like by the sounds of it this one.
I'll have a word with the data.table marketing department - they should have told me about the discussions on this forum!
Last edited by Matthew Dowle; 05-19-2012 at 09:08 AM.
Reason: Added link to reply in other thread.
I gotta say, whenever I can make use of hash, I may just do that, albeit, when it is appropriate.
BTW, thanks Matthew for participating. It's been awhile since I made a quick simulation like I did in the relevant thread. There were some other times it's come up in the past several months where data.table didn't work out and I think there's code, but I never recall these threads. When I get time, I'll definitely take a look at my comment back then and see if I can't recreate the situation.
Tyler, I didn't know ggplot had such visualization. I like it!
Thanks for the example. There's a lot to say but I'll try and keep it short :
In general beware of finding significant differences of insignificant times. The timing of the task in hand seems to be 0.000 seconds vs 0.000 seconds. You're having to repeat 100 times before the compute time takes more than 1 millisecond. This is only a useful benchmark if you really do want to repeat the task over and over again; i.e., that the real task is many individual lookups in separate calls. Otherwise you're just timing very small overhead differences which make no difference in practice. Perhaps it would be better if benchmark()'s default was times=3 rather than time=100. If 3 runs take 0.000 seconds, who cares?
The way to scale it up is increase the size of the data. And it makes a difference the size of the dictionary vs the amount of data being lookup up in the vectorized call. Tasks where a single run take several seconds (or minutes). Then take the minimum of just 3 runs. The reason people say to do 3 consecutive runs is to isolate L2 cache effects causing the first run to take longer.
That said I'll try and reduce overhead anyway. Just to prove a point for marketing reasons. That's one reason why set() was added to data.table 1.8.0. Someone claimed that the structure of data.frame itself made it slow even when manipulated in C, so I added set() to show that data.frame's internal structure is excellent and that wasn't true. Even though in practice we'll never loop set() like that in practice, it's there as an option if you do ever want a 'loopable' assign by reference.
On trinker's example, looking up a single word over and over again in a loop isn't what data.table is for. Not just because of the reasons above, but by default mult is "all" which means it looks for the start and end location of *groups*. data.table doesn't know if a key is unique or not, and assumes it isn't. Try setting mult="first" or mult="last" (doesn't matter which) if *you* know the data is unique. It won't make a difference in this benchmark, because this benchmark is dominated by overhead. So, pass a *vector* of words to i, and ensure to use SJ rather than J to tell data.table to do a binary merge rather than many separate binary searches (i.e. key i too as well as x). If you're looping the call to time it, then it may or may not be fair to isolate the time to key i. Those things should get data.table closer to green, on timings of single runs. But even then, for dictionaries of single words, a hash table is more appropriate. For that, data.table has it's own: chmatch() which operates like a hash without building a hash table (it reuses R's own internal hash table without having to build a new one). See ?chmatch for comparison to fastmatch. Then if you time chmatch() it's important to include the time to build the hash: chmatch doesn't need any, but the others do. The others are faster on subsequent calls, though. It's all quite involved to compare properly.
We're trying to make data.table more user friendly to help avoid users falling into this traps. For example in v1.8.1, mean() is now automatically .Internal()ed so the user doesn't need to know about the difference between sum() and mean(), or the existence of data.table wiki point 3. More convenience features like that basically.
Hope that helps. Look forward to seeing the new results. Hopefully it'll lead to some imporvements to data.table, these discussions usually do!
However, here's what it looks like now that I've just updated to data.table 1.8.0:
> ### get dataset
'data.frame': 20137 obs. of 2 variables:
$ word : chr "hm" "hmm" "hmmm" "hmph" ...
$ syllables: num 1 1 1 1 2 2 1 1 1 1 ...
> ### set up data.table
data.table 1.8.0 For help type: help("data.table")
> NETtalk.table <- data.table(NETtalk, key="word")
> ### compare to vector scan
test replications elapsed relative user.self sys.self
2 dataTable 100 0.17 1.0625 0.17 0
1 vectorScan 100 0.16 1.0000 0.16 0
2 NA NA
1 NA N
Now the two methods look pretty much the same. Clearly this is an improvement but shouldn't data.table be faster? Do we need a bigger test dataset for the advantage to emerge?
Yes, a much bigger dataset. One where we're not comparing tasks that take 0.0017 seconds vs 0.0016 seconds. The example in vignette("datatable-timing") shows a 10 million row dataset (rather than 20,000) with 2 columns in the key. That shows a single run taking 7.6 seconds reduced to 0.018 with data.table. The point there is that as you scale the number of rows up towards 2 billion, you still get under 0.1 seconds with data.table, but the vector scan can take many minutes, or even 'out of memory'. There is significant time needed to setkey() (which the vector scan doesn't need) but that's typically done once up front, like a database.