READING: Sections 11.1, 11.2, 11.3 (except 11.3.3).
SELF-TEST: Exercises 11.1-1, 11.2-1, 11.2-2.
–
Hashing
–
We saw that ADT has a optimal asymptotic worst case on insert, delete and search of O(n log n). BUT, that was if we are restricted to a decision tree model of computation, i.e.,
where we can COMPARE variables, but not use variables to index into memory.
If we use a RAM model of computation, where variables can serve as indices into memory, then we can do better! Thats the motivation for looking at hashing.
Problem 1: Read a text file, keeping track of the number of occurrences of
each character (ASCII codes 0 127).
Solution? Direct-address table: track number of occurrences of each character
in array with 128 positions (one for each character). All operations Theta(1)
and memory usage small.
Problem 2: Read a data file, keeping track of the number of occurrences of
each integer value (from 0 to 2^{32}-1).
Solution? Wasteful: use array with 2^{32} positions as above. Time Theta(1)
but memory required is huge, especially when data files relatively small (say
approx. 10^5 (400 mb) different values out of 2^{32} (about 4*10^9) possibilities). Instead, allocate
array with 100,000 positions (for example), and figure out how to map each
integer to one positions hashing.
Definitions:
Universe of keys U (set of all possible keys).
Hash Table T = array with m positions (size m depends on application);
each array location called a slot or bucket.
Hash Function h : U -> {0,1,,m-1} maps keys to array positions.
Intention: to access key k, examine T[h(k)].
When |U| > m, collisions (keys x != y with h(x) = h(y)) unavoidable.
Two issues to discuss: (1) handling collisions; (2) good hash functions.
Chaining (closed hashing): each location of T stores linked list of
items that hash to this location. Example: say h(k1) = h(k4) = h(k6) = 2,
h(k2) = h(k5) = 1, and h(k3) = m-1.
TWorst-case running times?
0 |/| . INSERT(x): search linked list at
index h(x.key), plus Theta(1) to
1 |o|>|k5|o|>|k2|/| replace element or insert new
node.
2 |o|>|k6|o|>|k4|o|>|k1|/|
. DELETE(x): search linked list at
:index h(x.key), plus Theta(1) to
remove node.
m-2 |/|
. SEARCH(k): search linked list at
m-1 |o|>|k3|/|index h(k); worst case Theta(n).
More precisely for SEARCH: as long as |U| > m(n-1), possible to pick n
keys that all hash to same bucket time Theta(n).
(Note: if |U| <= m(n-1), some hash functions put no more than n-1 keys ineach bucket, but if |U| > m(n-1), at least one bucket must contain at
least n keys.)
Practical consideration: worst-case behaviour very rarely encountered in
real life applications; in fact, hashing works very well (fast) in practice.
How do we reconcile this with worst-case complexity?
Average-case running time for SEARCH with chaining.
To simplify analysis,assume any key equally likely to hash to any bucket
(called simple uniform hashing):
Pr[h(x)=i] =SUMPr[x] = 1/mfor i = 0, 1, , m-1.
x in U
h(x)=i
# what is Pr[x]? Probability of picking x out of U
Consequence: expected number of items the same for each bucket.
Define load factor alpha = expected number of items in each bucket
(alpha = n/m under simple uniform hashing, where n = number of items).
Random variables? N(x) = number of elements examined during search for x.
Let L_i = number of elements in bucket i (so n = SUM L_i).
Probability space? Pick x uniformly at random from U.
Under simple uniform hashing, Pr[h(x)=i] = 1/m, so
# E[T] is the expected number of items in the same bucket, for each bucket in T
E[T] =SUM( Pr[x] * N(x) )
x in U
m-1
=SUMSUM( Pr[x] * N(x) )# but N(x) <= L_i for each ii=0 x in Uh(x)=im-11 m-1 n<=SUM ( Pr[h(x)=i] * L_i ) = – SUM L_i = – = alpha.i=0m i=0 m- Concern: In practical applications |U| >> m. So analysis above computes
expectation for keys most likely _not_ in the table. Intuitively:
searching for a key not in the table requires traversing one complete
linked list with average size alpha.
Is this _really_ a problem? Runtime Theta(alpha) already excellent; when
we ensure m > n, it means expected time Theta(1) for search (and other
operations). More careful analysis for items already in table can
only improve on this, not worsen it. But for the sake of completeness
One possibility: Change probability space pick x uniformly at random
from elements in T. Then, Pr[h(x)=i] = L_i/n, and probability that x is
the j-th element in bucket i (conditional to the fact that h(x)=i) is
simply 1/L_i (also uniform).
E[T] =SUM Pr[x]*N(x)
x in T
# explain this step
m-1 (L_i )
= SUM ( Pr[h(x)=i] * SUM ( j * Pr[x is j-th in bucket i] ) )
i=0 (j=1 )
m-1 L_i 1 m-1 L_i
= SUM ( L_i/n * SUM j/L_i ) = * SUM SUM j
i=0 j=1 n i=0 j=1
1 m-1 1 m-11 m-1
= SUM L_i(L_i+1)/2 = SUM ((L_i)^2) + SUM L_i
n i=02n i=0 2n i=0
1 m-1
= SUM ((L_i)^2) + 1/2
2n i=0
Unfortunately, no further simplification is possible in general. However,
under simple uniform hashing, we know E[L_i] = alpha = n/m, so
# Check this first substitution
m-1
E[T] = 1/2n SUM ((n/m)^2) + 1/2 = n^2/2nm + 1/2 = (alpha + 1)/2.
i=0
Hash functions: finding good functions appears essential to analysis above, to
ensure simple uniform hashing.
wanted: h(x) depends on every part/bit of x even for complex objects
wanted: h(x) spreads out values
wanted: h(x) efficient to compute
In practice, getting all three is difficult. Example: hash function for
strings loop over the whole string this is more time consuming than
looking only at a few characters in the string.
Non-numerical keys: represent as strings or bit-strings and treat as
large numbers (possibly over a large base, for strings).
Multiplication method (good for numerical keys): h(k) = floor(m frac(k A))
where A is a real constant in [0,1) and frac(x) = fractional part of x.
Example: use A = 1/phi = 0.618033 see textbook for details.
Division method (most common): h(k) = k mod m
Pitfall: when m not prime, keys not obviously spread out.
Example: k mod 10 depends only on last decimal digit of k; similarly,
k mod 2^p depends only on last p bits of k.
In practice, use prime m but this constrains table size.
Variation: h(k) = (a k + b) mod m, for constants a,b in {0,,m-1]
where m is prime. Useful in universal hashing (see textbook if
interested).
Other considerations.
Open addressing: store all items directly in T no chaining.
Collisions? Look for another free spot in some systematic manner.
. Linear probing: examine h(k), h(k) + 1, h(k) + 2,
Problem: long runs can occur.
. Quadratic probing: examine h(k), h(k) + 1, h(k) + 4,
Problem: collisions still lead to long probe sequences.
. Double hashing: examine h1(k), h1(k) + h2(k), h1(k) + 2 h2(k),
Helps disentangle collisions.
Open addressing under uniform hashing:
. unsuccessful search expects 1/(1-alpha) probes;
. successful search expects 1/alpha ln(1/(1-alpha)) probes.
In practice open addressing works best when alpha below 0.5; chaining
works well with alpha as high as 1 (or even slightly higher).
Reviews
There are no reviews yet.