15-212-ML Homework 5

Introduction

Handin

Due Tuesday, 10-Nov-1998, at 12:00 noon (electronically)
Maximum Points: 100 (85 correctness, 15 style)
Handin in a single file to /afs/andrew/scs/cs/15-212-ML/studentdir/$USER/ass5/ass5.sml
Late homeworks will be accepted only until start lecture on Thursday, with a 25% penalty.

Send questions to:

Philip Wickline designed and wrote the model solution for the LZW [de]compressor
Adam Megacz wrote the prose that you're reading right now as well as the autograder.

Electronic Grading

To make life easier for the TA's, and more predictible for the students, 85 of the 100 points on this assignment will be graded electronically.
At any time after noon on Friday, 30-Oct-1998, you can run the grading script on your own code and get an "unofficial" report on how many of the first 85 points you can expect to recieve. You may do this by copying your assignment to the handin directory and invoking the script /afs/andrew/scs/cs/15-212-ML/bin/autograde-ass5. You must be logged into one of unix**.andrew.cmu.edu under your own userid to run this script.
There is nothing secret about the autograding script; feel free to examine it.
We doubt it will be necessary, but we reserve the right to alter the grading script after the assignments are turned in.
Your code MUST run properly under SML/NJ on unix10.andrew.cmu.edu. Feel free to develop your code with another computer/os/interpreter, but make sure that your code works under SML/NJ before you turn it in.
Absolutely no correctness points will be awarded aside from those awarded by the grading script. If your code does not compile, the grading script will not run, and you will get a 0/85. If your code compiles but doesn't work and the grading script says you deserve a 5/85, that will be your score; we will not award "effort points".
The remaining 15/100 will be awarded based on style. Sample style criteria include:
- Elegance
- Conciseness
- Proper documentation of all invariants not enforced by the compiler
- Do not rewrite code that is in the SML library and has been introduced in class or recitation

The Magic of LZW

History Lesson

LZW is a simple adaptive dictionary based compression algorithm named after its inventors, Lempel, Ziv, and Welch. Today it is used in several common applications, including GIF images and the unix compress program.

Data Compression 15-999

The general concept behind most data compression programs is:

The removal of redundancy ("5 zebras plus 3 zebras plus 2 zebras is the same as 5 zebras plus 5 zebras if addition is commutative over zebras" becomes "//z=zebras//p=plus//5 z p 3 z p 2 z is the same as 5 z p 5 z if addition is commutative over z")
Elimination of unused elements of the character space. A file full of the binary representation of 1000 integers is smaller than the ASCII text file containing human-readable representations of those integers.

LZW focuses on the first compression goal. It maintains a dictionary which maps from integers (called "codes") to the strings they represent; initially this dictionary has 256 entries (numbered 0..255), where the integer n maps to the one-charachter string n.

LZW's Adaptive Dictionary

However, instead of sending a dictionary ahead of time, as the zebra example does above, LZW infers the dictionary from the stream that it is compressing or decompressing.

How does it do this? The compressor translates a stream of characters to a stream of integers (codes), maintaining a dictionary D. We use D(a) to denote the numerical index for the string a in the dictionary and write ab for the concatenation of a and b (where each is either a character or a string) In addition to the dictionary, the compressor maintains a string called w and a character K. Initially, the first character of the data stream is placed into w, and the second character into K.

The LZW algorithm

The compression loop proceeds as follows:

Invariant: w is always present in the dictionary.
If wK is in the dictionary, then we append K to w and place the next character of the input stream in K.
If wK is not in the dictionary, output D(w) on the output stream. Enter wK into the dictionary; its index should be the numerically smallest unused index. Now let w be equal to K and K be equal to the next character to be compressed.
When the end of the stream is reached, output D(w) to the output stream.

For this assignment we will limit the size of the dictionary to 2¹⁶ entries; if the dictionary fills up, your code should simply not make any more entries into the dictionary.

This process will yield as its output a sequence of codes, representing the compressed data stream.

Practice

Try these examples by hand. It is important that you completely master the process of LZW compression by hand before you write your code. All examples are over a 3-character alphabet, with "a"=0, "b"=1, and "c"=2 initially in the dictionary.

Uncompressed	Compressed
aabababaaa	0,0,1,4,6,3
abcabbbbbcab	0,1,2,3,1,7,4,3
cabcabcababcabc	2,0,1,3,5,4,4,6,2

Decompression

Decompression is just a bit more difficult. At the beginning of the decompression, initialize the dictionary just as you did for compression.

Read a code off the input stream. Call this code B.
Call the previous code read off the input stream A.
Let L(n) denote the entry in the dictionary with index n.

If B is a valid index into the dictionary:

Output: Output L(B).
Dictionary Update: Define a new entry into the dictionary occupying the numerically smallest unused index. The string entered into the dictionary will be L(A)b, where b is the first character of L(B). If the B is the first code read off the input stream, do not make any entry into the dictionary.

If the B is not a valid index into the dictionary ("KwKwK case"):

Invariant: B must be equal to the numerically smallest unused index in the dictionary.
Infrence: If the first character of L(A) is a, then make a new entry into the dictionary with index B. The entry should be L(A)a.
Output: output L(A)a onto the output stream.

Note: the description of the algorithm is imperative in nature in order to make it easy to understand. However, the implementation of this section should not use any mutable data structures.

Variable-length encoding

The final step is to use variable-length encoding to output the stream of integers we have produced. At any point in the stream, both the compressor and decompressor both know the number of entries in the dictionary. The number of bits required to represent the largest index is then ceil(log₂ maxindex). To conserve space, when we output an index, we only output that many bits. Note: it is important to realize that the number of bits output is determined by the largest index in the dictionary, not the index being ouput. This feature will be for extra credit.

Problem 0: Setup (0/85)

We provide a library for streams stream.sml, stream-based file I/O stream-io.sml, dictionaries dict.sml and bit vectors bit-vector.sml. You should include use statements for them at the beginning of your code:

    use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/stream.sml";
    use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/stream-io.sml";
    use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/bit-vector.sml"; (* for extra credit *)
    use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/dict.sml";

Problem 1: Stream Transducers (15/85)

Write a structure Transducer conforming to this signature (in file transducer.sml):

    signature TRANSDUCER =
    sig
	type byte = Word8.word
	type 'a stream                (* Use STREAM, not BASIC_STREAM *)
	exception OverFlow
	exception Error of string

	val byteStreamToCharStream    : byte stream -> char stream      

	val intStreamToByteStream     : int stream -> byte stream
	(* Raises exception OverFlow if an int exceeds 2^16 *)    
	(* Outputs each 16 bit int as two 8bit bytes, least *)
	(* significant byte first (network byte order)      *)

	val byteStreamToIntStream     : byte stream -> int stream
	(* reverse of intStreamToByteStream                 *)
	(* Raises exception Error with a description if     *)
	(* there are an odd number of bytes in the stream   *)

	val charStreamToByteStream    : char stream -> byte stream      
    end;

Implement the byteStreamToCharStream, byteStreamToIntStream, intStreamToByteStream, and charStreamToByteStream functions which convert the elements of their input streams into a different format, yielding an output stream. Style warning! There is an extremely concise way to write two of these functions, and we will deduct style points if you fail to recognize it. Think: what do these functions have in common?

Problem 2: De/Compression (60/85)

For this section you will be implementing an LZW compressor structure Compression that meets this signature (which can be found in file compress.sml):

    signature COMPRESSION =
    sig
	type 'a stream                (* Use STREAM, not BASIC_STREAM *)
	structure intDict    : DICT where type key = int
	structure stringDict : DICT where type key = string

        exception Error of string

	val compress                  : char stream -> int stream
	val compressAndShowDict       : char stream -> (int * int stringDict.dict) stream

        (* may raise Error on invalid input *)
	val decompress                : int stream -> char stream
	val decompressAndShowDict     : int stream -> (char * string intDict.dict) stream
    end;

Note that you do not have to do variable length code words.

Problem 2.1: Compression (30/85)

Implement Compression.compress and Compression.compressAndShowDict. The latter function should yield a stream of not just the compression codes but also the state of the dictionary after each new output code is written. This function is required for full credit and is essential in order for us to grant you partial credit on section 2.1; it allows us to watch you perform the compression step-by-step so we can determine what you did right and give you partial credit for it.

Problem 2.2: Decompression (30/85)

Implement Compression.decompress and Compression.decompressAndShowDict. The latter function should yield a stream of not just the compression codes but also the state of the dictionary after each new output character is written. This function is required for full credit and is essential in order for us to grant you partial credit on section 2.2; it allows us to watch you perform the decompression step-by-step so we can determine what you did right and give you partial credit for it.

Problem 3: De/Compressing Files (10/85)

Write a structure Lzw conforming to the following signature (which can be found in file lzw.sml):

     signature LZW = 
     sig
       exception Error of string
       (* takes a file and writes the compressed version to file.mlZ *)
       val compress : string -> unit

       (* decompresses file.mlZ into file and raises Error("not an mlZ file")
	  if parameter does not end in ".mlZ" *)
       val decompress : string -> unit

       (* reads the file in and runs it through both the compressor and
	  decompressor, then writes it back to file.muZ. Hopefully the
	  original and munged files will be identical (if done right) *)
       val munge : string -> unit
     end;

Problem 4: Variable length code words (+15 points ec)

One problem with the compression system described above is that 16 bits are used for each code word all the time, even if we don't 16 bits to distinguish that code for all other valid codes that might be seen at a particular point.

For extra credit write a structures VarCompress and VarLzw conforming to COMPRESS and LZW respectively, which implement compression and decompression using variable length code words. You should start with code words of length 9, and go up to a maximum of 16 bits. You may use the BitVector implementation provided.