John Fremlin's blog: Bits of hash to avoid a clash

Posted 2012-01-05 05:40:00 GMT

Suppose you have n objects which you cannot for some reason index sequentially, but for each one you can create a hashed value, and instead of transmitting the objects you want when possible to transmit the hashed values, in order to save communication bandwidth. Now if the hashes have fewer bits than those necessary to encode all the possible objects there will certainly be collisions where a hash value does not uniquely identify an object. Suppose this is acceptable if the probability of collision is less than or equal to p. How many bits of a perfectly distributed hash does one need, or equivalently if the hash can take values from 0 to h-1 inclusive, how large should h be?

This is related to the birthday collision demonstration — in a classroom of children it is expected that two will share the same birthday.

The probability of no collision q = 1-p is 0 if h < n and otherwise the product (1 - k/h) for k from 1 to n. Now if we look at the powers of h, then this is 1 - n(n+1)/2h + a2/h2 - a3/h3 + … where the coefficients ai are functions of n. Now if we say that h > n(n+1) then we can show elementarily that ai+1 < ai by comparing the components of the constituent sum, so q > 1 - n2/2h.

Therefore given p = 1-q, the probability of collision we are willing to tolerate, we have h < n2/2p so we simply choose an h as large as n2/2p to keep us very safe.

Post a comment