For Programming Project 4, you will be generating Huffman codes to compress a given string. A Huffman code uses a set of prefix code to compress the string with no loss of data (lossless).
David Huffman developed this algorithm in the paper A Method for the Construction of Minimum-Redundancy Codes (http://compression.ru/download/articles/huff/ huffman_1952_minimum-redundancy-codes.pdf)
A program can generate Huffman codes from a string using the following steps:
- Generate a list of the frequency in which characters appear in the string using a map
- Inserting the characters and their frequencies into a priority queue (sorted first by the lowest frequency and then lexicographically)
- Until there is one element left in the priority queue
- Remove two characters/frequencies pairs from the priority queue
- Turn them into leaf nodes on a binary tree
- Create an intermediate node to be their parent using the sum of the frequencies for those children
- Put that intermediate node back in the priority queue
- The last pair in the priority queue is the root node of the tree
- Using this new tree, encode the characters in the string using a map with their prefix code by traversing the tree to find where the characters leaf is. When traversal goes left, add a 0 to the code, when it goes right, add a 1 to the code
- With this encoding, replace the characters in the string with their new variable-length prefix codes
In addition to the compress string, you will need to be able to serialize the tree. Without the serialized version of the Huffman tree, you will not be able to decompress the Huffman codes.
Tree serialization will organize the characters associated with the nodes using post order.
During the post order when you visit a node,
- if it the node is a leaf (external node) then you add a L plus the character to the serialize tree string
- if it is a branch (internal node) then you add a B to the serialize tree string
For decompression, two input arguments will be needed. The Huffman Code that was generated by your compress method and the serialized tree string from your serializeTree method. Your Huffman tree will have to be built by deserializing the tree string by using the leaves and branches indicators. After you have your tree back, you can decompress the Huffman Code by tracing the tree to figure out what variable length codes represent actual characters from the original string.
So, for example, if we are compressing the string if a machine is expected to be infallible it cannot also be intelligent:
Our compress algorithm would generate the following codes for the characters:
Character | Frequency | Prefix Code |
(space) | 12 | 00 |
a | 5 | 1011 |
b | 3 | 10101 |
c | 3 | 11000 |
d | 1 | 010000 |
e | 9 | 100 |
f | 2 | 01011 |
g | 1 | 010001 |
h | 1 | 010010 |
i | 8 | 011 |
l | 6 | 1101 |
m | 1 | 010011 |
n | 6 | 1110 |
o | 3 | 11001 |
p | 1 | 010100 |
s | 2 | 10100 |
t | 6 | 1111 |
x | 1 | 010101 |
Our code would be:
011010110010110001001110111100001001001111101000001110100001000101010101001001
1000111110001000000111111001001010110000011111001011101111011101011101011101100
00011111100110001011111011101100111110010111101101001100100101011000001111101111 1001101110101101000110011101111
And our serialize tree would look like:
L LdLgBLhLmBBLpLxBLfBBLiBBLeLsLbBLaBBLcLoBLlBLnLtBBBB
You will need to create one class for this project: HuffmanTree for the compression, decompression, and serialization that uses a linked binary tree. You are given a Heap-based Priority Queue for the sorting. You are allowed to use the STL map, vector, and stack, but not the STL priority queue.
Abstract Class Methods
std::string compress(const std::string inputStr)
Compress the input string using the method explained above. Note: Typically we would be returning a number of bits to represent the code, but for this project we are returning a string
std::string serializeTree() const
Serialize the tree using the above method. We do not need the frequency values to rebuild the tree, just the characters on the leaves and where the branches are in the post order.
std::string decompress(const std::string inputCode, const std::string serializedTree)
Given a string created with the compress method and a serialized version of the tree, return the decompressed original string
Other things in Huffman hpp/cpp
To simplify the process, I have given the full interface and implementation for a class called HuffmanNode. This class has all the basics for a tree node (leaf, branch, root, data members for linking, accessor) and also includes a comparator class for use with the heap. You should not need to alter any of the code for this node.
Examples
Below are some examples of how your code will run HuffmanTree t;
string test = It is time to unmask the computing community as a Secret Society for the Creation and Preservation of Artificial
Complexity;
/*
1000101011110000101001100110000010111011001110111101111100011001001011
0100100001111001110010011101101111010110010101010111110011000001110000 1011011110101100100010111110001100001111111111001011010011001011101001
1111101111001001110011110100111101111110000111001111111111010101110110 1001100111001001110110100110010011100101011000101100111100101001110000
0111010000000100111010100111001001000110010101100010110011110101110101
1110100010001000110001010110001111000001011001011101001101011001010101
010010111101000111000011111111 */ string code = t.compress(test);
/* LiLmLnBBLrLaBLtBBLPLdBLgLkBBLALIBLvLxBBBLhLlBLCLSBBBLsLpLfBBLoBBL LeLcLuLyBBBBBB */
string tree = t.serializeTree();
/* It is time to unmask the computing community as a Secret Society for the Creation and Preservation of Artificial Complexity */ string orig = t.decompress(code, tree);
Deliverables
Please submit complete projects as zipped folders. The zipped folder should contain:
- cpp (Your written HuffmanTree class)
- hpp (Your written HuffmanTree class)
- hpp (The given Heap Priority Queue using an array/vector class)
- cpp (The provided base class and helper class)
- hpp (The provided base class and helper class)
- cpp (Test file)
- hpp (Test file)
- hpp (Catch2 Header)
And any additional source and header files needed for your project.
Hints
For the encoding step where you translate characters using your Huffman Tree, this is essentially a preordering of the tree and can be done recursively.
Remember, when you are deserializing, you are going from post ordering back to the full tree. This is very similar to the postfix to infix conversion you did in PP2, but now building a tree instead of an expression.
For decoding the characters, you just follow the tree down the branches until you hit the leaf with the character, adding a zero for a left move and adding a 1 for a right move.
I suggest implementing a recursive method to destroy nodes for your destructor.
For the branching nodes, I suggest using the null character just to hold a spot since this should not be popping up in text you are compressing.
The test cases are based on using the standard library header <map>, not the unordered map.
Additional resources: https://en.wikipedia.org/wiki/Huffman_coding https://www.youtube.com/watch?v=0kNXhFIEd_w
Reviews
There are no reviews yet.