8. 全域哈希和完全哈希

A fundamental weakness of hashing:
For any choice of hash function, there exists a bad set of keys that all hash to the same slot.
避免方法：Randomness
The idea is that you choose a hash function at random.
The name of the scheme is universal hashing.

Universal Hashing

Def. Let U be a universe of keys and H be a finite collection of hash functions mapping U to the slots in our hash table.
H is universal: if ∀x:∀y: x∈U ∧ y∈U ∧ x≠y, | {h∈H: h(x) = h(y)} | = |H|/m

Theorem

If we choose h randomly from the set of hash functions H and then we suppose we're hashing n keys into m slots in table T. Then for given key x, the expected number of collision with x = n/m = α(the load factor of the table)

proof

Let C_x be the random variable denoting the total number of collisions of keys in T with x.
$C_{xy}=\begin{cases} 1, h(x) = h(y)\\ 0, otherwise \end{cases}$
Note: E[C_xy] = 1/m and C_x = $\sum_{y∈T*{x}}{C_{xy}}$

Constructing a universal hash function

Let m be prime. Decompose key k into r + 1 digits:
k=<k₀, k₁, ..., k_r> where 0 < k_i <= m-1
Pick a = <a₀, a₁, ..., a_r>, each a_i is chosen randomly from {0, 1, ..., m-1}
Define h_a(k) = $\left(\sum_{i = 0}^r{a_ik_i}\right)mod(m)$
How big is H? |H| = m^r+1

Theorem

H is universal

proof

Let x = <x₀, x₁, ..., x_r>
y = <y₀, y₁, ..., y_r> be distinct keys.

They differ in at least one digit, without loss of generality position 0. For how many ha ∈ H do x and y collide?
Must have ha(x) = ha(y).
$⇒\sum_{i=0}^r{a_ix_i}\equiv\sum_{i=0}^r{a_iy_i}(mod(m))$
$⇒\sum_{i=0}^r{a_i(x_i-y_i)}\equiv0(mod(m))$
$⇒a_0(x_0-y_0) + \sum_{i=1}^r{a_i(x_i-y_i)}\equiv0(mod(m))$
$⇒a_0(x_0-y_0) \equiv -\sum_{i=1}^r{a_i(x_i-y_i)}(mod(m))$
$⇒a_0 \equiv (-\sum_{i=1}^r{a_i(x_i-y_i)})*(x_0-y_0)^{-1}(mod(m))$
Thus for any choices of a₁, a₂, ... , a_r, exactly 1 of the m choices for a₀ causes x and y to collide, and no collision for other m-1 choices for a₀.
h_as that cause x, y to collides = m^r = |H|/m

Number theory fact

Let m be prime for any z∈Z_m(integers mod m) such that z $\not\equiv$ 0, ∃ unique z^-1 ∈ Z_m such that z*z^-1 $\equiv$ 1 (mod(m))

Perfect Hashing

Problem: Given n keys, construct a static hash table of size m = O(n) such that search takes O(1) time in the worst case.
Idea: 2-level scheme with universal hashing at both levels.
No collision at level 2.
The reason why we don't have collisions in the second level:
If there are n_i item that hash to level one slot i, then we're going to use m_i = n_i² in the level two hash table. Under these circumstances, it's easy to find hash functions such that there are no collisions.

Level 2 analysis

Theorem

Hash n keys into m = n² slots, using a random hash function in a universal set H. Then the expected number of collision is less than on half.

proof

The probability that two given keys collide under h is 1/(n²).
There are n(n-1)/2 pairs of keys.
E[#collisions] = (n(n-1)/2)*(1/(n²)) = n(n-1)/(2n²) < 1/2

Markov's Inequality

For random variable X $\geq$ 0, Pr{X $\geq$ t} $\leq$ E[X]/t
proof:
E[x] = $\sum_{x=0}^\infty{x*Pr\{X = \})} \geq \sum_{x=t}^\infty{x*Pr\{X = x\}}$
$\geq \sum_{x=t}^\infty{t*Pr\{X = x\}} = t*Pr\{x \geq t\}$

Corollary

Pr{no collisions} $\geq$ 1/2
proof Pr{ $\geq$ 1 cooision} $\leq$ E[#collisoins]/1 < 1/2
To find a good level-2 hash function, just test a few at random. Find one quickly, since at least half will work.

Analysis of storage

For level 1, choose m = n, and let n_i be the random variable for the number for #keys that hash to slot i in T. Let m_i = n_i² slots in each level 2 table S sub i.
E[total storage] = n + E[ $\sum_{i=0}^{m-1}{θ(n_i^2)}$ ] = θ(n) by bucket sort analysis.

(⇒⇔↔¬∧∨∀∃∈⊂⊃∪∩→← )

8. 全域哈希和完全哈希

Universal Hashing

Theorem

proof

Constructing a universal hash function

Theorem

proof

Number theory fact

Perfect Hashing

Level 2 analysis

Theorem

proof

Markov's Inequality

Corollary

Analysis of storage

推荐阅读更多精彩内容