[SOLVED] 代写 READING: Sections 11.1, 11.2, 11.3 (except 11.3.3).

30 $

File Name: 代写_READING:_Sections_11.1,_11.2,_11.3_(except_11.3.3)..zip
File Size: 546.36 KB

SKU: 8540975388 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


READING: Sections 11.1, 11.2, 11.3 (except 11.3.3).
SELF-TEST: Exercises 11.1-1, 11.2-1, 11.2-2.

——-
Hashing
——-

We saw that ADT has a optimal asymptotic worst case on insert, delete and search of O(n log n). BUT, that was if we are restricted to a “decision tree” model of computation, i.e.,
where we can COMPARE variables, but not use variables to index into memory.

If we use a RAM model of computation, where variables can serve as indices into memory, then we can do better! That’s the motivation for looking at hashing.

Problem 1: Read a text file, keeping track of the number of occurrences of
each character (ASCII codes 0 – 127).

Solution? Direct-address table: track number of occurrences of each character
in array with 128 positions (one for each character). All operations Theta(1)
and memory usage small.

Problem 2: Read a data file, keeping track of the number of occurrences of
each integer value (from 0 to 2^{32}-1).

Solution? Wasteful: use array with 2^{32} positions as above. Time Theta(1)
but memory required is huge, especially when data files relatively small (say
approx. 10^5 (400 mb) different values out of 2^{32} (about 4*10^9) possibilities). Instead, allocate
array with 100,000 positions (for example), and figure out how to map each
integer to one positions — “hashing”.

Definitions:
– Universe of keys U (set of all possible keys).
– Hash Table T = array with m positions (size m depends on application);
each array location called a “slot” or “bucket”.
– Hash Function h : U -> {0,1,…,m-1} maps keys to array positions.

Intention: to access key k, examine T[h(k)].
When |U| > m, “collisions” (keys x != y with h(x) = h(y)) unavoidable.
Two issues to discuss: (1) handling collisions; (2) good hash functions.

– Chaining (“closed hashing”): each location of T stores linked list of
items that hash to this location. Example: say h(k1) = h(k4) = h(k6) = 2,
h(k2) = h(k5) = 1, and h(k3) = m-1.

TWorst-case running times?

0 |/| . INSERT(x): search linked list at
— —— —— index h(x.key), plus Theta(1) to
1 |o|–>|k5|o|–>|k2|/| replace element or insert new
— —— —— ——node.
2 |o|–>|k6|o|–>|k4|o|–>|k1|/|
— —— —— ——. DELETE(x): search linked list at
:index h(x.key), plus Theta(1) to
— remove node.
m-2 |/|
— ——. SEARCH(k): search linked list at
m-1 |o|–>|k3|/|index h(k); worst case Theta(n).
— ——

More precisely for SEARCH: as long as |U| > m(n-1), possible to pick n
keys that all hash to same bucket — time Theta(n).
(Note: if |U| <= m(n-1), some hash functions put no more than n-1 keys ineach bucket, but if |U| > m(n-1), at least one bucket must contain at
least n keys.)

Practical consideration: worst-case behaviour very rarely encountered in
“real life” applications; in fact, hashing works very well (fast) in practice.
How do we reconcile this with worst-case complexity?

Average-case running time for SEARCH with chaining.

– To simplify analysis,assume any key equally likely to hash to any bucket
(called “simple uniform hashing”):

Pr[h(x)=i] =SUMPr[x] = 1/mfor i = 0, 1, …, m-1.
x in U
h(x)=i

# what is Pr[x]? Probability of picking x out of U

Consequence: expected number of items the same for each bucket.

– Define “load factor” alpha = expected number of items in each bucket
(alpha = n/m under simple uniform hashing, where n = number of items).

– Random variables? N(x) = number of elements examined during search for x.

– Let L_i = number of elements in bucket i (so n = SUM L_i).

– Probability space? Pick x uniformly at random from U.

– Under simple uniform hashing, Pr[h(x)=i] = 1/m, so

# E[T] is the expected number of items in the same bucket, for each bucket in T

E[T] =SUM( Pr[x] * N(x) )
x in U

m-1
=SUMSUM( Pr[x] * N(x) )# but N(x) <= L_i for each ii=0 x in Uh(x)=im-11 m-1 n<=SUM ( Pr[h(x)=i] * L_i ) = – SUM L_i = – = alpha.i=0m i=0 m- Concern: In practical applications |U| >> m. So analysis above computes
expectation for keys most likely _not_ in the table. Intuitively:
searching for a key not in the table requires traversing one complete
linked list with average size alpha.

– Is this _really_ a problem? Runtime Theta(alpha) already excellent; when
we ensure m > n, it means expected time Theta(1) for search (and other
operations). More careful analysis — for items already in table — can
only improve on this, not worsen it. But for the sake of completeness…

– One possibility: Change probability space — pick x uniformly at random
from elements in T. Then, Pr[h(x)=i] = L_i/n, and probability that x is
the j-th element in bucket i (conditional to the fact that h(x)=i) is
simply 1/L_i (also uniform).

E[T] =SUM Pr[x]*N(x)
x in T

# explain this step

m-1 (L_i )
= SUM ( Pr[h(x)=i] * SUM ( j * Pr[x is j-th in bucket i] ) )
i=0 (j=1 )

m-1 L_i 1 m-1 L_i
= SUM ( L_i/n * SUM j/L_i ) = – * SUM SUM j
i=0 j=1 n i=0 j=1

1 m-1 1 m-11 m-1
= – SUM L_i(L_i+1)/2 = — SUM ((L_i)^2) + — SUM L_i
n i=02n i=0 2n i=0

1 m-1
= — SUM ((L_i)^2) + 1/2
2n i=0

Unfortunately, no further simplification is possible in general. However,
under simple uniform hashing, we know E[L_i] = alpha = n/m, so

# Check this first substitution

m-1
E[T] = 1/2n SUM ((n/m)^2) + 1/2 = n^2/2nm + 1/2 = (alpha + 1)/2.
i=0

Hash functions: finding good functions appears essential to analysis above, to
ensure simple uniform hashing.

– wanted: h(x) depends on every part/bit of x — even for complex objects
wanted: h(x) “spreads out” values
wanted: h(x) efficient to compute

– In practice, getting all three is difficult. Example: hash function for
strings loop over the whole string — this is more time consuming than
looking only at a few characters in the string.

– Non-numerical keys: represent as strings or bit-strings and treat as
large numbers (possibly over a large base, for strings).

– Multiplication method (good for numerical keys): h(k) = floor(m frac(k A))
where A is a real constant in [0,1) and frac(x) = fractional part of x.
Example: use A = 1/phi = 0.618033… — see textbook for details.

– Division method (most common): h(k) = k mod m
Pitfall: when m not prime, keys not obviously “spread out”.
Example: k mod 10 depends only on last decimal digit of k; similarly,
k mod 2^p depends only on last p bits of k.
In practice, use prime m — but this constrains table size.

– Variation: h(k) = (a k + b) mod m, for constants a,b in {0,…,m-1]
where m is prime. Useful in “universal hashing” (see textbook if
interested).

Other considerations.

– Open addressing: store all items directly in T — no chaining.
Collisions? Look for another free spot in some systematic manner.
. Linear probing: examine h(k), h(k) + 1, h(k) + 2, …
Problem: long runs can occur.
. Quadratic probing: examine h(k), h(k) + 1, h(k) + 4, …
Problem: collisions still lead to long probe sequences.
. Double hashing: examine h1(k), h1(k) + h2(k), h1(k) + 2 h2(k), …
Helps “disentangle” collisions.

– Open addressing under uniform hashing:
. unsuccessful search expects 1/(1-alpha) probes;
. successful search expects 1/alpha ln(1/(1-alpha)) probes.

– In practice open addressing works best when alpha below 0.5; chaining
works well with alpha as high as 1 (or even slightly higher).

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 代写 READING: Sections 11.1, 11.2, 11.3 (except 11.3.3).
30 $