Hash Functions
The primary desire of a hash function is to distribute all data equally
Helps to avoid hash collisions:
Michael and Toby collide as the function (hash) has the same output for both
Hash functions: properties
Our proof for O(1) average runtime depends on the function output being spread out
Specifically, if the fixed length of the hash has m possible outputs, then two random inputs have probability 1/m of same output
Hash functions: properties
Our name initials hash function does not have this property
James Parker JP
This has m=26*26 = 676 possible outputs (with the English alphabet)
Hash functions: properties
However, names are not equally distributed across this space (Do you know anyone with the initials QQ? Or OI?)
The most common initial in USA is: JR, so this will have more collisions than QQ (thus not a good hash func)
Hash functions: properties
We might also want additional properties like: locality
(formally: locality-sensitive hashing)
This means if the function inputs are close, then so are the outputs Formally: |a-b|<|c-d| means |h(a)-h(b)| < |h(c)-h(d)|Hash functions: propertiesh( ) Hash functions: propertiesh( ) h( ) Hash functions: propertiesh( ) h( ) h( ) Hash functions: properties As hash functions try to put things in buckets (like bucket sort), you can actually use the same method (assuming your key k is 0<=k<1) (m is possible outputs of function)h(k) = floor(k*m) =roundDown(k*m)Hash functions: decimals Two problems with this approach:- Might need to transform numbersto decimals (can be difficult)(for initials BZ = )- Assumes inputs are randomlydistributed between 0 and 1(not true for initials… and would be largely based off first initial)Hash functions: decimals The division hash function usesthe remainder(like example last time)h(k) = k mod 10 (= k % 10)So…h(412) = 412 mod 10 = 2 h(32428374) = 32428374 mod 10 = 4 Hash functions: division While this might work, there is one major issue:- Might only use part of the number for identificationUsing:m=10 only looks at last digit m=2 only looks at odd/evenHash functions: division Depending on your choice of m, you can get very bad resultsYou could accidentally have all anagrams (different word order) have the same hash output… h(tag) = 9 h(gat) = 9h(gta) = 9 h(tga) = 9Hash functions: division Prime numbers (as by definition they are not divisible by many numbers) provide a good mNote: we will talk later about how to efficiently find prime numbers,for now just assume they are fairly easy to findHash functions: division Small example:Bad: m=4 with inputs{0,2,4,6,8,10} h(k) = 0 for {0,4,8}h(k) = 1 for {} (nothing)h(k) = 2 for {2,6,10}h(k) = 3 for {} (nothing)This is collision-tastic!Hash functions: division Small example:Good: m=5 with inputs{0,2,4,6,8,10} h(k) = 0 for {0,10}h(k) = 1 for {6}h(k) = 2 for {2}h(k) = 3 for {8}h(k) = 4 for {4}Much better!Hash functions: division The division function is annoying as your choice for m is pickyThe multiplication hash functionadds another constant to get rid ofthis issue:Abuse of notation! mod 1 means take the decimal part… so: 3.1415 mod 1 = 0.141524598.62345 mod 1 = 0.62345 Hash functions: multiplication Picking A=0.25 runs into much the same issue as our mod 4 division: Example: A=0.25, m=10input = {0,2,4,6,8,10}h(k) = 0 for inputs {0, 4, 8} h(k) = 5 for inputs {2, 6, 10} all other outputs are unusedHash functions: multiplication However, you can pick A to be some irrational number (math def) and this should prevent this issue(This irrational number should have a variety of digits, so not 0.1212…) A full analysis needs number theoryHash functions: multiplication The book suggests:Lets just go with that… m=4, inputs = {0,2,4,6,8,10} h(k) = 0 for inputs {0, 2, 10} h(k) = 1 for inputs {4}h(k) = 2 for inputs {6}h(k) = 3 for inputs {8} Hash functions: multiplication More robust hashing:m = desired hash length (any num) p = prime (where p > m)
Any A and B:
0 <= B < p 1 <= A < pHash functions: multiplication
Reviews
There are no reviews yet.