A fundamental weakness of hashing:
For any choice of hash function, there exists a bad set of keys that all hash to the same slot.
避免方法:Randomness
The idea is that you choose a hash function at random.
The name of the scheme is universal hashing.
Universal Hashing
Def. Let U be a universe of keys and H be a finite collection of hash functions mapping U to the slots in our hash table.
H is universal: if ∀x:∀y: x∈U ∧ y∈U ∧ x≠y, | {h∈H: h(x) = h(y)} | = |H|/m
Theorem
If we choose h randomly from the set of hash functions H and then we suppose we're hashing n keys into m slots in table T. Then for given key x, the expected number of collision with x = n/m = α(the load factor of the table)
proof
Let Cx be the random variable denoting the total number of collisions of keys in T with x.
Note: E[Cxy] = 1/m and Cx =
Constructing a universal hash function
Let m be prime. Decompose key k into r + 1 digits:
k=<k0, k1, ..., kr> where 0 < ki <= m-1
Pick a = <a0, a1, ..., ar>, each ai is chosen randomly from {0, 1, ..., m-1}
Define ha(k) =
How big is H? |H| = mr+1
Theorem
H is universal
proof
Let x = <x0, x1, ..., xr>
y = <y0, y1, ..., yr> be distinct keys.
They differ in at least one digit, without loss of generality position 0. For how many ha ∈ H do x and y collide?
Must have ha(x) = ha(y).
Thus for any choices of a1, a2, ... , ar, exactly 1 of the m choices for a0 causes x and y to collide, and no collision for other m-1 choices for a0.
has that cause x, y to collides = mr = |H|/m
Number theory fact
Let m be prime for any z∈Zm(integers mod m) such that z 0, ∃ unique z-1 ∈ Zm such that z*z-1 1 (mod(m))
Perfect Hashing
Problem: Given n keys, construct a static hash table of size m = O(n) such that search takes O(1) time in the worst case.
Idea: 2-level scheme with universal hashing at both levels.
No collision at level 2.
The reason why we don't have collisions in the second level:
If there are ni item that hash to level one slot i, then we're going to use mi = ni2 in the level two hash table. Under these circumstances, it's easy to find hash functions such that there are no collisions.
Level 2 analysis
Theorem
Hash n keys into m = n2 slots, using a random hash function in a universal set H. Then the expected number of collision is less than on half.
proof
The probability that two given keys collide under h is 1/(n2).
There are n(n-1)/2 pairs of keys.
E[#collisions] = (n(n-1)/2)*(1/(n2)) = n(n-1)/(2n2) < 1/2
Markov's Inequality
For random variable X 0, Pr{X t} E[X]/t
proof:
E[x] =
Corollary
Pr{no collisions} 1/2
proof Pr{ 1 cooision} E[#collisoins]/1 < 1/2
To find a good level-2 hash function, just test a few at random. Find one quickly, since at least half will work.
Analysis of storage
For level 1, choose m = n, and let ni be the random variable for the number for #keys that hash to slot i in T. Let mi = ni2 slots in each level 2 table S sub i.
E[total storage] = n + E[] = θ(n) by bucket sort analysis.
(⇒⇔↔¬∧∨∀∃∈⊂⊃∪∩→← )