CSE 1729 Introduction to Principles of Programming November 6, 2016 Problem Set 8
This problem set builds upon this weeks lab, in that we start by getting the frequencies of symbols (words) in a list (text). From there, we use a heap to support finding the least-weighted values. Where this differs from the lab is that we will use the symbol frequencies and heap to build a Huffman tree to encode the symbols. You will then use this Huffman tree to be able to encode and decode texts. This looks fairly complex, but can be broken down into a number of manageable tasks.
For testing, you can use any string, but you should demonstrate all of the tasks using the text of Surfin Bird, a combination of two Rivingtons song recorded by the Trashmen in 1963. The Surfin Bird text (as a Scheme string) is available in HuskyCT with this assignment in file huffman-support.rkt.
The goal of this assignment is to be able to read in a text, turn it into a list of symbols, and use the frequencies of those symbols to develop a Huffman coding scheme. Once you have built the Huffman tree, you should be able to use it to encode inputs into strings of zeroes and ones, and decode strings of zeroes and ones into texts. Some code will be provided, in particular some of the code to deal with converting strings to lists of symbols, and vice-versa. The coding scheme that you will build is a prefix code that associates a list of zeroes and ones to each symbol in its language.
There is code supplied in file huffman-support.rkt that you can use. The main things you need are (string->symbol-list str) which turns a string into a list of symbols (including newlines), and (symbol-list->string lst) which turns a list of symbols into a string. These two functions are nearly inverses of each other. There is also a defined value surfin-bird, which is a relatively long string that can be used for final testing. You will need to use some shorter strings when you are trying to debug your code.
Huffman trees are discussed in your textbook, but there are differences in how they represent the trees and how they represent coded texts (as lists rather than strings, as we do in this assignment). By all means read the Huffman tree section in the book, but be sure that your code matches the tasks given in this assignment rather than the code in the book.
Get going early on this, do lots of testing, and keep your wits about you.
- Write a Scheme procedure (word-frequencies str) that will take a string and return a list of the words and their frequencies in the string. Look familiar? This is the same as the freq-list pro- cedure from Lab 9 except its input will be a string. Hint: use the supplied string->symbol-list function to get something freq-list can act upon. word-frequencies will evaluate to a list of pairs; the first element of each pair will be a symbol corresponding to a word, the second element will be the number of times the word occurs in the string (which is the number of times the symbol occurs in the output of string->symbol-list.
Here is an example:
- In this question you will write a Scheme function which, given a list of symbols and frequencies, constructs a Huffman encoding tree. (You might want to look at Background: Constructing Huff- man trees near the end of this document) You may assume that the symbols and their frequencies are given in a list of pairs: for example, the list ((ron . 57) (doo . 21) (da . 12))
1
> (word-frequencies da doo ron ron ron da doo ron ron) ((ron . 5) (doo . 2) (da . 2))
represents the 3 symbols ron, doo, and da, with frequencies 57, 21, and 12, respectively. This is the form you should expect as output from word-frequencies in the previous question. Given such a list, you wish to compute the tree that results from the algorithm described at the end of the document. Your Huffman trees should be binary trees, where each node has a value, a left subtree, and a right subtree.
In Huffman trees, the symbols are in leaf nodes that have the form
(s () ())
where s is the symbol held by the leaf. Internal nodes do not have meaningful values: for clarity,
use the form
(internal 0-tree 1-tree)
where internal is a symbol that indicates that this is an internal node, and 0-tree and 1-tree
are the two subtrees. You could use the following code to construct an internal node:
(define (make-internal-node 0-tree 1-tree)(make-tree internal 0-tree 1-tree))
Note that when you traverse a Huffman coding tree, you can determine if a given node is an internal node by deciding if the car of the list associated with that node is equal to internal.
So how do I build this tree? Use a heap with pairs like in the lab, where the first element of the pair is a Huffman tree and the second element is its weight. Start out by making a tree (a leaf node) out of every character, make a pair of it and its frequency these will be the initial pairs in the heap. You combine pairs by making a pair where the first element is a tree made of the trees from the two pairs; the weight of this new tree is the sum of the weights. So the process works like this: remove the two minimum-weight pairs from the heap; combine them into a new pair; insert that new pair into the heap. Continue until there is only one pair left in the heap the first element of that pair is your Huffman tree. IMPORTANT: To get credit for this function you need to construct the Huffman tree using a heap as described above, not by using an ordered list as in the textbook.
(a) Write the Scheme function (combine-htree-pairs hp1 hp2) that takes two tree-weight pairs (the first element of each pair is a Huffman tree, the second element is its weight) and combines them into a new pair, where the first element of the pair is a Huffman tree whose root is internal, left subtree is the tree from hp1, and right subtree is the tree from hp2; the second element of this pair is the sum of the weights from pairs hp1 and hp2. This will be a useful function in the process described above.
Example:
> (define hpair1 (cons (make-tree doo () ()) 21)) > hpair1
(( doo ()()) . 21)
> (define hpair2 (cons (make-tree da () ()) 12)) > hpair2
((da ()()) . 12)
> (combine-htree-pairs hpair1 hpair2) ((internal (doo ()()) (da ()())) . 33)
2
(b) Write a Scheme function (build-huffman sf-list) which, given a list of symbols and frequencies, constructs a Huffman encoding tree.
Example:
- Define a Scheme function get-encoding-list that takes, as input, a Huffman coding tree and outputs a list of encodings, one for each leaf in the tree. Each encoding will be a list whose first element is the symbol and whose cdr is its code a list of zeroes and ones. For example, given the tree in the 2(b) example above, your function should return the list
((da 0 0) (doo 0 1) (ron 1))
in some order.
Remember that the code for a symbol is a coding of the path from the root to the symbols node in a Huffman coding tree, where 0 corresponds to left turns and 1 corresponds to right turns. There is more about Huffman coding trees in the Background: Constructing Huffman trees section at the end of the assignment.
- Define a Scheme function encode which takes, as input, a string and a list of symbol encoding lists (as in problem 3) and encodes the string into a list of zeroes and ones using Huffman coding. As you did before, you should use string->symbol-list to convert the string into a list of symbols to use in encoding. You may assume that every symbol in the list you get from the string is indeed the first element of one of the symbol-encoding lists. Getting the coding for the list of symbols will involve building up a list by concatenating the code lists for each symbol (think append).
> (encode doo doo doo da da da da ((da 0 0) (doo 0 1) (ron 1))) (0 1 0 1 0 1 0 0 0 0 0 0 0 0)
- Define a Scheme function (decode zo-list huff-tree which takes, as input, a list containing zeroes and ones and a Huffman coding tree and decodes the list according to the tree. This function returns a string, but it makes sense to decodes the whole list of zeroes and ones into a list of symbols, which is then converted to a string using the supplied symbol-list->string function when it is returned.
- Finally, demonstrate that your code works by running it on the Surfin bird text:
- (a) Run word-frequencies on the Surfin bird string to produce a list of word-frequency pairs.
- (b) Use build-huffman to produce a Huffman tree that encodes the words in the Surfin bird string
- (c) Use get-encoding-list on the Surfin bird Huffman tree to get a list of word-code pairs.
- (d) Use encode on the Surfin bird text to get a zero-one list that encodes it.
- (e) Use decode on the encoded Surfin bird text to get from the 0/1 string back to the original string. Note: if you (display decoded-text) you should get something that matches the original including line breaks.
3
> (build-huffman ((ron . 57) (doo . 21) (da . 12))) (internal (internal (da () ()) (doo () ())) (ron () ()))
Background: Constructing Huffman trees Recall the discussion from class on Huffman trees. In particular, to construct an optimal encoding tree for a family of symbols 1,,k with frequencies
f1,, fk, carry out the following algorithm:
- Place each symbol i into its own tree; define the weight of this tree to be fi.
- If all symbols are in a single tree, return this tree (which yields the optimal code).
- Otherwise, find the two current trees of minimal weight (breaking ties arbitrarily) and combine them into a new tree by introducing a new root node, and assigning the two light trees to be its subtrees. The weight of this new tree is defined to be the sum of the weights of the subtrees. Return to step 2.
As an example, consider Huffman encoding a long English text document:
- You would begin by computing the frequencies of each symbol in the document. This would produce a table, something like the one shown below.
Here the frequency is the number of times the symbol appeared in the document. (If you prefer, you could divide each of these numbers by the total length of the document; in that case, you could think of the frequencies as probabilities. This wont change the results of the Huffman code algorithm.)
- Following this, you can apply the Huffman code algorithm above: this will produce a Huffman code tree. The purpose of this tree is to associate a codeword (list of zeroes and ones) with each symbol. Specifically, the path from the root to a given symbol can be turned into a codeword by treating every step to a left child as a zero and every step to a right child as a one. In Figure 1 below, the symbol doo would be associated with the code (0 1) .
- Finally, you can encode the document by associating each symbol in the document with the corresponding codeword. Note that this will turn the document into a long list of zeroes and ones.
- Likewise, you can decode the encoded version of a document, by reading the encoded version of the document from left to right, and following the path it describes in the Huffman tree. Every time a symbol is reached, the process starts again at the root of the tree.
4
Symbol |
Frequency |
the |
2013 |
be |
507 |
to |
711 |
. |
. |
ron
da doo
Figure 1: A Huffman tree; the codeword associated with doo is (0 1).
Lists and bit-vectors If we really wanted to save space, rather than coding into a list of zeroes and ones we would code into a bit-vector, which is a sequence of bits in contiguous memory, something that is supported in many languages. Scheme does not have a data type for bit-vectors, nor bitwise operators, but it is possible to use positive integers (which are represented as binary numbers) to implement bit sequences.
The file bit-lists.rkt, available with the assignment, allows us to represent lists of zeroes and ones as positive numbers whose sequence of bits matches the sequence in the list. There is a small hack required, as numbers ignore leading zeroes and we need to know how many there are.
In this file, the functions are (bit-list-encode zo-list), which takes a list of zeroes and ones andturnsitintoanintegerwhosebinaryencodingisthatof(cons 1 zo-list),and(bit-list-decode num), which takes a number and turns it into an list of zeroes and ones that correspond to its binary encoding (less the first 1).
These allow us convert back and forth between zero-one lists and numbers, so if you want a compact representation of an encoding you can apply bit-list-encode to it and see how short the encoded version is, then apply bit-list-decode to that to get back the encoding that your decode function above can handle.
You could try these out between steps (d) and (e) of problem 6. The length of the number (in bits) that you get from bit-length-encode is one more than the number of elements in the zero-one list. Or you may just want to play around with this code and see what it can (or cannot) do.
5
Reviews
There are no reviews yet.