# Thread: estimating population from hash table fill rate

1. ## estimating population from hash table fill rate

Apologies if this is the wrong forum, but it looks reasonably close.

I have a practical problem to estimate the number of actual entries in a large table, based on the fill level of a hash table. The problem is that I have a content addressable file store in which individual files are stored in a path determined by their SHA1 hash value. The storage is a multilevel hierarchy of directories, each level using the Rth octet of the hash value.

When there are a lot of files (10s of millions) it is time consuming to traverse the file tree and count them. I believe there should be a way to estimate the overall population size, maybe to within a factor of 2, based on the fill level of the buckets, assuming that the octet values in the SHA1 hash are uniform, which they should be.

It's a kind of reverse application of the birthday problem, only I am going to count the number of holes with no pigeons and want to estimate the total number of pigeons based on the holes that are empty.

The directory tree is 3 levels deep, with 16M possible end nodes, and a possibility of many files in each final directory. Although it is impractical to walk the whole tree, I can directly count the top couple of levels to see which of 2^16 prefixes exist. At lower levels I can sample the fill density reasonable approximation of the fill density.

I don't think the math is too hard, but my counting problem experience is very old now, and I am not familiar with the problem domain. Maybe this is a problem that's been written up somewhere and someone here would know where.

Any ideas or help would be most welcome. Thanks.

2. ## Re: estimating population from hash table fill rate

Hey gbuchana.

How big is table per say? SHA1 is the hash, but how big is the key-size that you are using for the table itself? Also it seems like you are doing a multi-level table, so you are only estimating for one table or all tables?

I also take it the fill level is just the number of nodes filled (not the entries which may be added via a linked list).

SHA1 has been studied quite extensively with regards to its collision properties: have you seen anything out there regarding the distributional aspects of SHA1 with respect to collisions?

3. ## Re: estimating population from hash table fill rate

The way this storage table works is that it is implemented as a multilevel hierarchy of directories in a filesystem. So that if a chunk of content hashes to "79a9bc37e0ca218d87c4e49a4714242cb9b5215c" it will be stored in the directory path: "/79/a9/bc/79a9bc37e0ca218d87c4e49a4714242cb9b5215c". The SHA1 hash of the content is, at once, the unique identifier of the content object and its location. The fact that this is implemented as a filesystem makes it slow to traverse the whole thing and count, and even in situations where you do have to traverse the whole table, you have to have some up-front guesstimate of the size in order to judge how long the long operation is likely to take. This is where my estimating question comes in.

What I was reasoning is that I can look at the actual occurrences of directory entries at the top two levels -- that would be quick, and would be conceptually the same as looking at the fill rate of a 2^16 element hash table.

So far at least, SHA1 does not have any practical collision attack, and accidental collisions are effectively impossible. I'm not going to worry about that aspect. But the SHA1 hash values are uniform across my data, by assumption, and collisions within the first 16 bits are inevitable and uniform.

Since I posted my Q, I have implemented a process for sampling and estimating that is, I believe, going to be reasonably accurate. What I am doing is randomly probing down into the table and counting the fill density at each level. So the top level will have 256 entries, any time the overall table has more than 1500, with a high probability. Looking at the second level in a dozen places shows that there is an average population of 237.7. Third level, the average population is 2.8. Bottom level average is 1.01. So multiplying 256 x 237.8 x 2.83 x 1.01 I get an estimate of 17400, for an actual population of 174515. Really not bad for a 300ms process. I could do some work on my sampling methodology to get a confidence level or error bars on the estimate, but for my needs this is just about right.

And a property of the table that I didn't mention makes the method I have chosen more accurate. The thing is that deletions are possible -- although relatively rare. By my method will include empty directories in the sample and may conclude that the mean fill level at the bottom is less than one. Estimating based on the existence of the top two levels of directory hierarchy does not account for deletions, since the content file that created that path may no longer be there. On the other, other hand, some of the long traversal operations take time proportionate to the complexity of the file hierarchy, not necessarily the number of actual elements in it, so both types of estimates are useful.

4. ## Re: estimating population from hash table fill rate

Within sampling and estimation there are quite a number of ideas that might interest you, but this is a large field.

One suggestion I have for these kind of problems (and you are probably very aware of this) is to think about ways to add redundancy to the hash-table information to give you the kind of statistical information that will give these estimates quickly.

With regard to deletions, you could store a word saying how many things have been deleted. The other thing is that with regards to the average, you can also put a word in the hash structure that updates itself every time you make an addition.

This idea can be extended to all kinds of attributes including the number of fills for a table amongst other things.

If you can think of the right variables that become part of the whole structure that are updated every-time you do something to the table, then the thing becomes checking these variables as opposed to having to re-calculate things and do more sampling and calculations than need be.

A lot of the statistical properties of the table including mean, variance, standard quartiles, deciles, or percentiles and other things can be updated instantaneously on any modification to the hash table itself and it means that you can just pull up the figures straight away.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts