[Solved] SENG265 C Assignment2-

$25

File Name: SENG265_C_Assignment2_.zip
File Size: 207.24 KB

SKU: [Solved] SENG265 C Assignment2- Category: Tag:
5/5 - (1 vote)

Use the C programming language to write your implementation of the

Lempel-Ziv-Welch compression/decompression algorithm, in a file named LZW.c

  • Test your code against the provided test data, and other test data you create, to ensure its working as expected. The instructor will also be testing your program against test data that isnt provided to you. Think creatively about how to break and strengthen your code!

This assignment: LZW.c

In this assignment, we will implement LempelZivWelch (LZW), a universal, lossless data compression algorithm. LZW is the algorithm of the Unix file compression utility, compress, and is used in the GIF image format. Lossless means that the compression operation preserves 100% of the information in the original format: with a correctly implemented algorithm, you can always recover the identical original data given its compressed form, even though the compressed form may be significantly smaller.

While the (glory) days when you could develop a company based on a compression algorithm are long-gone, there is still lasting practical value in knowing how to manipulate files of all types while implementing algorithms of increasing complexity, not to mention continuing to gain proficiency with array-based implementations we can achieve all of the above with a deeper dive into compression via LZW.

Broadly speaking, compression is a process of recognizing recurring patterns in some piece of information to be communicated, and replacing multiple duplicate patterns with shorter references to entries in a list of such patterns. For example, imagine you have a list of numbers on a piece of paper, and you need to read it to a friend over the phone. As youre going through the list, if you see the same number repeated enough times, youre likely to say something like 48, 3011 ten times, 189, and rely on a shared understanding of natural language grammar to distinguish between the in-band data (the numbers) and the out-of-band data (the instructions about patterns). Unfortunately, digital data doesnt have a universal grammar

general-purpose data compression algorithms that are expected to operate on any type of data must be able to learn the repeated sequences in particular files automatically. As in life, There aint no such thing as a free lunch. For files that contain random data, or patterns that are too diverse or too sparsely distributed, a general-purpose compression algorithm might even produce compressed data thats larger than the original!

LZW is a method for adaptive / dynamic compression it starts with an initial model, reads data piece by piece and then updates the model while encoding the data as it proceeds through the input file. LZW can be used for binary and text files it works on any type of file that said, it is known for performing well on files or data streams with (any type of) repeated substrings (e.g. text files, which often contain frequently used words like an, the, there, etc.).

With English text, the algorithm can be expected to offer 50% compression levels or greater. Binary data files, depending on structuring are capable of decent compression levels as well for this assignment, file size is not part of the grading criteriain fact, the evaluation relies on your code producing exact replicas of the expected compressed files so dont get too creative!

As in the previous assignment, we are interested in round-trip encoding and decoding, translating from the original format to encoded and back to original. I hope that some of the lessons from the previous assignment will become useful sources for your reuse. While the compression technique may be new, other details remain similar. Implementing a more complex algorithm shows how we might layer on complexity once we establish base skills.

Dictionary Data Structure

A dictionary is a general-purpose data structure for storing a group of objects. A dictionary has a set of keys and each key has a single associated value the elements in this abstract data structure are commonly called, key-value pairs . Dictionaries are usually implemented using hash tables, and are often known as dict, hash, map, or sometimes hashmap in different programming languages.

Binary Coding Schemes

The LZW algorithm described by Welchs 1984 article [1] encodes variable-length sequences of 8-bit bytes as fixed-length 12-bit codes.

As an example of a binary code, this toy coding scheme uses three-bit codes to represent a choice of eight possible symbols (the letters A through H). The choice of symbols is arbitrarywe could have used eight different emoji. As long as the sender and receiver have a shared understanding of what the codes mean, they can communicate using compressed data.

binary code symbol
000 A
001 B
010 C
011 D
100 E
101 F
110 G
111 H

In a binary code, each additional bit added to the code scheme doubles the number of distinct symbols that can be encodedfor an n-bit code, 2n different symbols can be represented. In the classic LZW algorithm, the choice of a 12-bit coding scheme allows for 4096 distinct symbols with binary codes between 000000000000 and 111111111111. Each binary code can also be represented as three hexadecimal digits, e.g. 000 to FFF.

The dictionary data structure to be used for this assignment has a capacity of 4096 entries (212), of which:

  • the first 256 entries, numbered 0 through 255, are reserved for every possible 1-byte value (i.e. entry #0 is for the ASCII NUL character, entry #32 is for the ASCII space character, entry #65 is for the capital letter A look at man ascii ! (press q to exit the man page reader))
  • the entries numbered 256 through 4094 are assigned dynamically by the compression algorithm to represent multi-byte sequences in the input data.
  • the last entry, numbered 4095, is reserved for a padding marker

The code that refers to an entry in the dictionary is simply the entrys numeric position, encoded as a 12-bit binary integer.

Additional Notes

Both encoding and decoding programs refer to the dictionary countless times during the algorithm. You may want to reference the hints in the Provided Materials section of this document.

The encoder looks up indexes in the dictionary by using strings. That is, to aid the look-up of indexes, we use a dictionary of strings i.e. an array of strings. With each string having some maximum number of characters, you might consider using as your data structure, a multidimensional array as discussed in Unit 3, Slide 29, this has a form similar to: e.g., X[row][column], recall that we access elements of the array through integer indexes.

Why select such an approach? A data structure with 0(1), Order-1 time complexity, can be useful in practice.

Algorithm and Example

LZW Encoding 1. Start with a standard initial dictionary of 4096 entries (each entry can be identified using a 12-bit number)2. Set w to 3. As long as there are more characters to reada. read a character k b. if wk is in the dictionary set w to wk elseoutput the code for w add wk to the dictionary, if its not full set w to k 4. output the code for w Note #1: Step 2 is: Set w to It is not:Set w to That is, you start off with w set to nothing, not a space character. So if you append t to , you end up with t (not t).Note #2: The first 256 entries (0-255) in the LZW dictionary should be the 1-byte codes whose values are 0-255. That is, dict[42][1] is 42 (and dict[42][0] is 1) dict[228][1] is 228 (and dict[228][0] is 1)In general, for 0 <= i <= 255: dict[i][1] is i (and dict[i][0] is 1)
LZW Decoding 1. Start with a standard initial dictionary of 4096 entries (the first 256 are standard ASCII, the rest are empty).2. Read a code k from the encoded file3. Output dict[k]4. Set w to dict[k]5. As long as there are more codes to read in the input filea. read a code kb. if k is in the dictionary output dict[k]add w + first character of dict[k] to the dict elseadd w + first char of w to the dict and output itc. Set w to dict[k]

The programs LZW.c takes as its first two inputs, the name of an input file as the first command line argument, and the name of an output file as the second command line argument.

If the second passed command line argument is an e, then the program creates a new file with the same name as the file in the command line, but with the extension .LZW. The new file contains the LZW encoding of the original file.

If the second passed command line argument is a d, then the program takes the name of compressed file as a command line argument (a file ending in extension, .LZW). The program interprets the original file name by dropping the .LZW file extension and then decodes the LZW encoding back to the original. (actually a copy).

The program should open both the input file and output file as binary files and make no assumption about what byte values appear in the files. Heres what their behaviour looks like:

~ /assignment2 > cat tests/encode_short_string/input.txt this_is_his_history~ /assignment2 > ./tools/b2x ./tests/encode_short_string/input.txt746869735F69735F6869735F686973746F72790A~ /assignment2 > ./LZW ./tests/encode_short_string/input.txt e~ /assignment2 > ./tools/b2x ./tests/encode_short_string/input.txt.LZW07406806907305F10205F10110310707406F07207900AFFF# different hex code means the input file and the encoded file are different~ /assignment2 > diff ./tests/encode_short_string/compare.txt.LZW./tests/encode_short_string/input.txt.LZW# no output means the files are the same# rename our file so we dont overwrite it (harder to determine if program executed correctly) mv ./tests/encode_short_string/input.txt ./tests/encode_short_string/input.old# now lets decode our compressed file, and make sure it matches the original input ~ /assignment2 > ./LZW ./tests/encode_short_string/input.txt.LZW d~ /assignment2 > cat ./tests/encode_short_string/input.txt this_is_his_history~/assignment2 > diff ./tests/encode_short_string/input.txt ./tests/encode_short_string/input.old # no output means theyre the same!~ /assignment2 >

This session shows the contents of the file input.txt. The file contains the string this_is_his_history.

The hexadecimal values of the ASCII codes are also shown. Running the program LZW (with the encode flag) on input.txt produces a file input.ext.LZW. The file input.txt.LZW contains 12-bit values for the LZW codes:

074 = t

068 = h

069 = i

073 = s

05F = _

102 = is

05F = _

101 = hi 103 = s_

etc.). The decode version of the program, LZW <filename> d, then correctly recovers the original file name from the given input, dropping .LZW from the input file name, and recovers the original data.

Requirements

The requirements for this assignment are as follows:

  1. Use arrays (and not dynamic memory allocation!) as an aggregate data type for file processing (encoding and decoding).
  2. Create a main function that:
  3. Takes two command line arguments:
    1. A string indicating the name of the file to operate on (the input file);
    2. A one-character flag indicating whether the program should operate in encoding or decoding mode. Use e for encoding mode, or d for decoding mode.

If the second argument is not provided, or is not either e or d, iii. print Invalid Usage, expected: LZW {input_file} [e | d] to the console

  1. terminate the program with exit code 4.
  2. Opens the input file indicated by the first argument.

If there is no filename specified:

  1. print Error: No input file specified! to the console
  2. terminate the program with exit code 1.

If the input filename specified doesnt exist or cant be read for any reason: iii. print Read error: file not found or cannot be read to the console iv. terminate the program with exit code 2.

  1. Opens an output file using the naming convention as specified
  2. Depending on the value of the second command-line argument, passes the string read from the input file to either the encode or decode function, as described below
  3. Create a second function called encode with the following characteristics:
    1. Takes two C files as function parameters:
      1. an input file
      2. an output file
  4. Encodes the provided file using the algorithm described above.
  5. If the file was encoded successfully:
    1. create the encoded representation to the file system ii. terminate the program with exit code 0, indicating success
  1. If the file could not be encoded for any reason (e.g. a particular input fails):
    1. print Error: File could not be encoded to the console ii. terminate the program with exit code
  1. Create a third function called decode with the following characteristics:
    1. Take two C files as function parameters:
      1. an input file
      2. an output file
  2. Decode the provided file using the algorithm described above

If your program encounters invalid, corrupted or badly formatted data that cannot be decoded according to the specified algorithm, the program should:

  1. print Error: Invalid format to the console ii. terminate the program with exit code 3.
  2. If the file was decoded successfully:
    1. write the resulting decoded representation to a file with the expected name
    2. terminate the program with exit code 0, indicating success
  1. If the file could not be decoded for any reason (other than those stated above):
    1. print Error: File could not be decoded to the console
    2. terminate the program with exit code 5

By now, youre developing your own tests to evaluate your software (we dont get what we expect, we get what we inspect) consequently, we would like to see at least one test input from YOU use the testing convention described in this assignment i.e. if your test input is expected to successfully execute, then it should follow the happy path of the algorithm. If the test input is expected to fail, then it should show a brief error message followed by exiting with error code 6.

You may also think of tweaks to the algorithms that will make the encoder/decoder run faster, or make the compression better. If so, implement these ideas, and compare. But, this is not necessary. You are encouraged to devote time to testing your program.

Provided Materials

We have created a git repository for you it contains a starter file, LZW-starter.c, a tool, b2x, a limited test suite, and these assignment instructions. There are also some hints in this section

LZW-starter.c

A starter kit for LZW.c for you to reuse in your implementation. Copy it to LZW.c and edit it, or copy and paste parts of it. The starter-file contains a few functions you are invited to use in your solution these are read12(), write12() and flush12(). It is not your aim to understand the implementation of these functions (although youre welcome to explore as you see fit) it IS your aim to understand how you might apply. these functions in your solution if you choose to apply them, it should have the effect of simplifying the solution. The use of flush12() might have an analogous application to the use of fflush(stdin) while were on the subject: fflush() might have some use in your solution as well.

b2x Tool

We created a directory /tools and placed a source file, b2x.c, that reads a binary file and writes the binhex representation of a file to the screen binhex is short for binary-to-hexadecimal its yet another encoding mechanism you can compile and use this tool if useful (it is not required). You can also use the common unix hexdump utility for similar purposes.

Test Suite

We will supply a limited suite of test cases that you can use to evaluate your own program, and get a feel for the types of inputs you can expect. You will see the tests include both text and binary files. The test suite has the same structure as the previous assignment a set of directories, with one test case provided per directory. The directory, as previously, is named according to the type of test e.g. encode_text or decode_image, where the first word denotes the mode the program should be run in while running this test.

Things are not identical to the previous assignment for example, in the case of encoding:

  • the filename passed may have any name and extension, e.g.ext (your program is expected to read that file as its input, with ext replaced, e.g. with txt or jpg) NOTE: Your program does not need to locate the appropriate file to operate on you can look at the files in the test case directory, and pass the input filename as a command line parameter!
  • Continuing the example above, if your file is expected to run successfully given its input, the directory will contain a file whose name isext. LZW where ext is the original uncompressed filename extension

NOTE: Why compare.ext? Well, we give you the input file for testing purposes; if properly run, you should get input.ext.LZW as your output if we provided you with this properly named file as output, then you could perhaps overwrite it accidentally, which would remove your ability to compare hence, we prefix the outputs with compare.

  • Your programs output is expected to precisely (bit-for-bit!) match the contents of this output file.
  • If your program is expected to terminate with a non-zero (unsuccessful) error code when given the provided input file, the directory will contain one of the following:
    • a file named err.txt, where n is the expected exit code, containing text that your programs output is expected to precisely match, or;
    • a file named err, where n is the expected exit code. In this case your output doesnt need to match any specific error message, but your program should still exit with error code n
    • NOTE: for YOUR test(s) that you provide, should the test(s) exit with an error code, the error code must be 6 .
    • In the case of decoding, we provide input.ext.LZW and also provide compare.ext again, you can derive the resulting output filename from the given input, and then compare accordingly.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] SENG265 C Assignment2-
$25