Microsoft PowerPoint 16- Indexing-p2
2018 A. Alawini & A. Parameswaran
Indexing:
B+ Tree &
Hash Tables
Doris Xin, Abdu Alawini
University of Illinois at Urbana-Champaign
CS411: Database Systems
October 24, 2018
1
2018 A. Alawini & A. Parameswaran
Announcements
HW 3: Due by Friday 10/26 (23:59)
Sign up for PT1 midterm demos: Due by Friday 10/26
(23:59)
https://wiki.illinois.edu/wiki/display/cs411sfa18/Project+Track+1
+Midterm+Demo+Signup
Midterm review session: Friday 10/26 (4:00-4:50) SC 1404
To suggest topicsto discuss in the review session, please fill this
form: https://goo.gl/forms/5fDcm80cDjmtMJ0H3
Please fill the early course feedback form:
https://goo.gl/forms/SC4BYcrDy8dai8PE2
Midterm: 10/29 in class 11-12:15 pm
2
2018 A. Alawini & A. Parameswaran
Todays lecture
Indexing
Continue with B+ Trees
Hash Tables
3
2018 A. Alawini & A. Parameswaran
What is the best part of the class so far?
4
2018 A. Alawini & A. Parameswaran
What is the least useful part of the
course so far?
5
2018 A. Alawini & A. Parameswaran
What would you change to make the
course even better?
6
2018 A. Alawini & A. Parameswaran
B+ Trees (Recap)
Multi-level index on a specific search key
Nodes contain keys and pointers
Internal nodes: point to other nodes
Leaf nodes: point to data records; last pointer to the next leaf node
7
20 60
10
10 15 18 20 30 50
15 18 20 30 50
2018 A. Alawini & A. Parameswaran
B+ Trees (Recap cont)
Each node can contain up to n keys;
All nodes have the same capacity (n max keys)
Degree d = n/2 the minimum # of keys per node
(assume n is even for simplicity)
Each node has between [d, 2d] keys at all times
Except root node, which can have only one key.
In practice, each node has its own file block
For a 4KB block, we can accommodate up to 340 keys (d=170).
In practice, d = 100, 66.5% fill-in factor 133 keys
Visiting one node = one disk read (latency ~ 105-107 ns)
First 3 levels of can be cached in main memory (latency ~ 100 ns) to
reduce disk reads
8
an artificial
requirement to
make B+ Trees
balanced
2018 A. Alawini & A. Parameswaran
B+ Trees (Recap cont)
Do B+ Trees always help?
No. e.g., an array of sorted integers.
Types of queries to answer with a B+ Tree:
Exact key value, e.g., SELECT name FROM people WHERE age=20
Range queries, e.g., SELECT name FROM people WHERE age>=20
and age<=709 2018 A. Alawini & A. Parameswaran 1 0id = 1, name = Kyle, age = 15id = 3, name = Bob, age = 20id = 10, name = Ivan, age = 22id = 12, name = Diana, age = 19id = 13, name = Ed, age = 20id = 15, name = Lucy, age = 19id = 27, name = George, age = 21id = 30, name = Helen, age = 23id = 33, name = Cathy, age = 18id = 35, name = Jennifer, age = 22id = 38, name = Alice, age = 21id = 70, name = Fred, age = 181310121315273033353870dense (entry for every record) d = 2n = 4B+ Tree search key = id 2018 A. Alawini & A. Parameswaran 1 1id = 1, name = Alice, age = 15id = 3, name = Bob, age = 20id = 10, name = Carl, age = 22id = 12, name = Diana, age = 19id = 13, name = Ed, age = 20id = 15, name = Fred, age = 19id = 27, name = Ginger, age = 21id = 30, name = Helen, age = 23id = 33, name = Ivan, age = 18id = 35, name = Jennifer, age = 22id = 38, name = Kyle, age = 21id = 70, name = Louise, age = 181310121315273033353870123080828635dense (entry for every record) d = 2n = 4B+ Tree search key = id 2018 A. Alawini & A. Parameswaran 1 2id = 1, name = Alice, age = 15id = 3, name = Bob, age = 20id = 10, name = Carl, age = 22id = 12, name = Diana, age = 19id = 13, name = Ed, age = 20id = 15, name = Fred, age = 19id = 27, name = Ginger, age = 21id = 30, name = Helen, age = 23id = 33, name = Ivan, age = 18id = 35, name = Jennifer, age = 22id = 38, name = Kyle, age = 21id = 70, name = Louise, age = 18151818191920202121222223dense (entry for every record) d = 2n = 4B+ Tree search key = age1921242522 2018 A. Alawini & A. ParameswaranThink-Pair Share Exercise13Min capacity of a data entry page is 1 entry; Min capacity of an index page is 2 pointers/1 key value. Leaf nodes (data entry pages) point to data pages; each pointer represents a record ID (RID).Write the key value index entries for the root and the intermediate nodes in the tree above. 2018 A. Alawini & A. ParameswaranHandling data changes in B+ Trees14 2018 A. Alawini & A. Parameswaran 15Insertion in a B+ Tree8020 60 100 120 14010 15 18 20 30 40 50 60 65 80 85 9010 15 18 20 30 40 50 60 65 80 85 90Insert K=19Assume d=2.DATA BLOCKS 2018 A. Alawini & A. ParameswaranInsertion in a B+ Tree168020 60 100 120 14010 15 18 19 20 30 40 50 60 65 80 85 9010 15 18 20 30 40 50 60 65 80 85 9019After insertion 2018 A. Alawini & A. ParameswaranInsertion in a B+ Tree178020 60 100 120 14010 15 18 19 20 30 40 50 60 65 80 85 9010 15 18 20 30 40 50 60 65 80 85 9019Now insert 25 2018 A. Alawini & A. ParameswaranInsertion in a B+ Tree188020 60 100 120 14010 15 18 19 20 25 30 40 50 60 65 80 85 9010 15 18 20 25 30 40 60 65 80 85 9019After insertion50 2018 A. Alawini & A. ParameswaranInsertion in a B+ Tree198020 60 100 120 14010 15 18 19 20 25 30 40 50 60 65 80 85 9010 15 18 20 25 30 40 60 65 80 85 9019But now have to split !50 2018 A. Alawini & A. ParameswaranInsertion in a B+ Tree208020 30 60 100 120 14010 15 18 19 20 25 60 65 80 85 9010 15 18 20 25 30 40 60 65 80 85 9019After the split5030 40 50 2018 A. Alawini & A. ParameswaranDeletion from a B+ Tree218020 30 60 100 120 14010 15 18 19 20 25 60 65 80 85 9010 15 18 20 25 30 40 60 65 80 85 9019Delete 305030 40 50 2018 A. Alawini & A. ParameswaranDeletion from a B+ Tree228020 30 60 100 120 14010 15 18 19 20 25 60 65 80 85 9010 15 18 20 25 40 60 65 80 85 9019After deleting 305040 50May change to 40, or not 2018 A. Alawini & A. ParameswaranDeletion from a B+ Tree238020 30 60 100 120 14010 15 18 19 20 25 60 65 80 85 9010 15 18 20 25 40 60 65 80 85 9019Now delete 255040 50 2018 A. Alawini & A. ParameswaranDeletion from a B+ Tree248020 30 60 100 120 14010 15 18 19 20 60 65 80 85 9010 15 18 20 40 60 65 80 85 9019After deleting 25Need to rebalanceRotate5040 50Rotation in general can involve either sibling, but here only the left sibling can help 2018 A. Alawini & A. Parameswaran19 20Deletion from a B+ Tree258020 30 60 100 120 14010 15 18 60 65 80 85 9010 15 18 20 40 60 65 80 85 9019 5040 50 2018 A. Alawini & A. ParameswaranDeletion from a B+ Tree268019 30 60 100 120 14010 15 18 19 20 60 65 80 85 9010 15 18 20 40 60 65 80 85 9019 5040 50 2018 A. Alawini & A. ParameswaranDeletion from a B+ Tree278019 30 60 100 120 14010 15 18 19 20 60 65 80 85 9010 15 18 20 40 60 65 80 85 9019Now delete 405040 50 2018 A. Alawini & A. ParameswaranDeletion from a B+ Tree288019 30 60 100 120 14010 15 18 19 20 60 65 80 85 9010 15 18 20 60 65 80 85 9019After deleting 40Rotation not possible5050Need to merge nodes 2018 A. Alawini & A. ParameswaranDeletion from a B+ Tree298019 60 100 120 14010 15 18 19 20 50 60 65 80 85 9010 15 18 20 60 65 80 85 9019Final tree50 2018 A. Alawini & A. ParameswaranAdvantages of B+TreesBalanced Uniform space utilization Predictable organization Predictable time (logarithmic); unbalanced can be linear in worst case Good for range queries30Can we do better? 2018 A. Alawini & A. ParameswaranOutlineIndexingB+ Trees Hash Tables31 2018 A. Alawini & A. Parameswaran32Hash Tables 2018 A. Alawini & A. Parameswaran33Hash TablesSecondary storage hash tables are much like main memory onesRecall basics:There are n bucketsA hash function f(k) maps a key k to {0, 1, , n-1}Store in bucket f(k) a pointer to record with key kSecondary storage: bucket = blockStore in bucket f(k) any record with key kuse overflow blocks when needed 2018 A. Alawini & A. Parameswaran34Assume 1 bucket (block) stores 2 recordsh(e)=0h(b)=h(f)=1h(g)=2h(a)=h(c)=3Hash Table Exampleebfgac0123 2018 A. Alawini & A. Parameswaran35Search for a:Compute h(a)=3Read bucket (block) 31 disk accessSearching in a Hash Tableebfgac0123Main memory may have an array of pointers (to buckets) accessible by bucket number. 2018 A. Alawini & A. Parameswaran36Place in right bucket (block), if spaceE.g. h(d)=2Insertion in Hash Tableebfgdac0123 2018 A. Alawini & A. Parameswaran37Create overflow block, if no spaceE.g. h(k)=1More over-flow blocksmay be neededInsertion in Hash Tableebfgdac0123k 2018 A. Alawini & A. Parameswaran38Hash Table PerformanceFixed number of bucketsExcellent, if no overflow blocksDegrades considerably when there are many overflow blocks. Might need to go through a chain of overflow blocksCan fix this by allowing the number of buckets to grow 2018 A. Alawini & A. Parameswaran39Extensible Hash Table Array of pointers to blocks instead of array of blocks Size of array is allowed to grow. 2x size when it grows Dont need a block per bucket. Sparse buckets share a block Hash function returns k-bit integers (e.g., k=32) Only use the first i << k bits to determine bucket Number of buckets = 2i Bit counter on each block indicates how much are used for that block0(010)1(011)i = 11101BUCKETSDATA BLOCKSBit counter 2018 A. Alawini & A. Parameswaran40Insertion in Extensible Hash TableInsert 1110 0(010)1(011)1(110)i=1 1101BUCKETS DATA BLOCKS 2018 A. Alawini & A. Parameswaran41Insertion in Extensible Hash Table Now insert 1010 Need to split block and extend bucket array i becomes 2: done in two steps0(010)1(011)1(110), 1(010)i=1 1101BUCKETS DATA BLOCKS 2018 A. Alawini & A. Parameswaran42Insertion in Extensible Hash TableStep 1: Extend the buckets0(010)1(011)1(110)i=1 1101BUCKETS DATA BLOCKS0(010)10(11)11(10)i=2 1100011011BUCKETS DATA BLOCKS 2018 A. Alawini & A. Parameswaran43Insertion in Extensible Hash TableStep 2: Now try to insert 10100(010)10(11)10(10)i=2 120001101111(10) 2BUCKETS DATA BLOCKS 2018 A. Alawini & A. Parameswaran44Insertion in Extensible Hash TableDone0(010)10(11)10(10)i=2 120001101111(10) 2BUCKETS DATA BLOCKS 2018 A. Alawini & A. Parameswaran45Insertion in Extensible Hash Table Now insert 0000: where would it go? Then 0101? Need to split block, but not bucket array0(010)10(11)10(10)i=2 120001101111(10) 2BUCKETS DATA BLOCKS 2018 A. Alawini & A. Parameswaran46Insertion in Extensible Hash Table Now insert 0000: where would it go? Then 0101? Need to split block, but not bucket array0(010)0(000), 0(101)10(11)10(10)i=2 120001101111(10) 2BUCKETS DATA BLOCKS 2018 A. Alawini & A. Parameswaran47Insertion in Extensible Hash Table Now insert 0000: where would it go? Then 0101? Need to split block, but not bucket array10(11)10(10)i=220001101111(10) 2BUCKETS DATA BLOCKS00(10)00(00)201(01) 2 2018 A. Alawini & A. Parameswaran48Performance: Extensible Hash Table No overflow blocks: access always one read for distinct keys BUT: Extensions can be costly and disruptive After an extension bucket table may no longer fit in memory Imagine three records whose keys share the first 20 bits. These three records cannot be in same block (assume two records per block). But a block split would require setting i = 20, i.e., accommodating for 2^20 = 1 million buckets, even though there may be only a few hundred records. 2018 A. Alawini & A. Parameswaran49Linear Hash Table Idea 1: add only one bucket at a timeProblem: n = no longer a power of 2 Let i be # bits necessary to address n buckets. i = ceil(log2 n) After computing h(k), use last i bits: If last i bits represent a number >= n, change msb from 1 to 0 (get a number
< n) Idea 2: allow overflow blocks (not expensive to overflow) Convention: Read from the right (as opposed to the left) 2018 A. Alawini & A. Parameswaran50Linear Hash Table Example N=3 <= 22 = 4 Therefore, only buckets until 10 When inserting 0111, 11 is flipped => 01
(01)00
(10)10
i = 2
n = 3
r = 4
00
01
10
(01)01
(01)11 BIT FLIP
2018 A. Alawini & A. Parameswaran
51
Linear Hash Table Example
Insert 1001:
(01)00
(10)10
00
01
10
(01)01
(01)11
i = 2
n = 3
r = 4
2018 A. Alawini & A. Parameswaran
52
Linear Hash Table Example
Insert 1001: overflow blocks
(01)00
(10)10
00
01
10
(01)01
(10)01
(01)11
i = 2
n = 3
r = 5
2018 A. Alawini & A. Parameswaran
53
Linear Hash Tables
Extend n n+1 when average number of records per bucket exceeds
(say) 80% of total number of records per block
e.g., r/n <= 0.8 * 2 = 1.6 (for block size = 2) Until then, use overflow blocks (cheaper than adding buckets)r/n = 5/3 = 1.67 > 1.6
Time to add a bucket
(01)00
(10)10
00
01
10
(01)01
(10)01
(01)11
i = 2
n = 3
r = 5
2018 A. Alawini & A. Parameswaran
54
Linear Hash Table Extension
From n=3 to n=4
Only need to touch
one block (which one ?)
(01)00
(10)10
00
01
10
(01)01
(10)01
(01)11
00
01
10
11
i = 2
n = 3
r = 5
i = 2
n = 4
r = 5
(01)11
(01)00
(10)10
(01)01
(10)01
(01)11
2018 A. Alawini & A. Parameswaran
55
Linear Hash Table Extension
From n=3 to n=4 finished
(01)01
(10)01
(01)11
00
01
10 (10)10
(01)00
11
i = 2
n = 4
r = 5
r/n = 5/4 = 1.25 < 1.6 2018 A. Alawini & A. Parameswaran56Summary B+ Trees (search, insertion, deletion) Good for point and range queries Log time lookup, insertion and deletion because of balanced tree Hash Tables (search, insertion) Static hash tables: one I/O lookup, unless long chain of overflow Extensible hash tables: one I/O lookup, extension can take long Linear hash tables: ~ one I/O lookup, cheaper extensionNo panacea; dependent on data and use case 2018 A. Alawini & A. Parameswaran57Index 2.0 Learn the best index from the data and queries logs Machine Learning to the rescue! Recall, an index is a function Machine learning are good at learning functions from data Whats your cool idea for a better index?
Reviews
There are no reviews yet.