15-212-ML Homework 5
Introduction
Handin
- Due Tuesday, 10-Nov-1998, at 12:00 noon (electronically)
- Maximum Points: 100 (85 correctness, 15 style)
- Handin in a single file to
/afs/andrew/scs/cs/15-212-ML/studentdir/$USER/ass5/ass5.sml
- Late homeworks will be accepted only until start lecture on Thursday, with a 25% penalty.
Send questions to:
- Philip Wickline designed
and wrote the model solution for the LZW [de]compressor
- Adam Megacz wrote the prose
that you're reading right now as well as the autograder.
Electronic Grading
- To make life easier for the TA's, and more predictible
for the students, 85 of the 100 points on this assignment
will be graded electronically.
- At any time after noon on Friday, 30-Oct-1998, you can
run the grading script on your own code and get an
"unofficial" report on how many of the first 85 points
you can expect to recieve. You may do this by copying your
assignment to the handin directory and invoking the script
/afs/andrew/scs/cs/15-212-ML/bin/autograde-ass5. You must
be logged into one of unix**.andrew.cmu.edu under your
own userid to run this script.
- There is nothing secret about the autograding script; feel
free to examine it.
- We doubt it will be necessary, but we reserve the right
to alter the grading script after the assignments are
turned in.
- Your code MUST run properly under SML/NJ on
unix10.andrew.cmu.edu. Feel free to develop your code
with another computer/os/interpreter, but make sure that
your code works under SML/NJ before you turn it in.
- Absolutely no correctness points will be awarded aside
from those awarded by the grading script. If your code
does not compile, the grading script will not run, and
you will get a 0/85. If your code compiles but doesn't
work and the grading script says you deserve a 5/85, that
will be your score; we will not award "effort
points".
- The remaining 15/100 will be awarded based on
style. Sample style criteria include:
- Elegance
- Conciseness
- Proper documentation of all invariants not enforced by the compiler
- Do not rewrite code that is in the SML library and has been introduced
in class or recitation
The Magic of LZW
History Lesson
LZW is a simple adaptive dictionary based compression algorithm
named after its inventors, Lempel, Ziv, and Welch. Today it is
used in several common applications, including GIF images and the
unix compress program.
Data Compression 15-999
The general concept behind most data compression programs is:
- The removal of redundancy ("5 zebras plus 3 zebras
plus 2 zebras is the same as 5 zebras plus 5 zebras if
addition is commutative over zebras" becomes
"//z=zebras//p=plus//5 z p 3 z p 2 z is the same as 5
z p 5 z if addition is commutative over z")
- Elimination of unused elements of the character space.
A file full of the binary representation of 1000 integers is
smaller than the ASCII text file containing human-readable
representations of those integers.
LZW focuses on the first compression goal. It maintains a dictionary
which maps from integers (called "codes") to the strings they
represent; initially this dictionary has 256 entries (numbered
0..255), where the integer n maps to the one-charachter string
n.
LZW's Adaptive Dictionary
However, instead of sending a dictionary ahead of time, as the zebra
example does above, LZW infers the dictionary from the stream that
it is compressing or decompressing.
How does it do this? The compressor translates a stream of
characters to a stream of integers (codes), maintaining a dictionary D. We
use D(a) to denote the numerical index for the string a in the
dictionary and write ab for the concatenation of a and b
(where each is either a character or a string)
In addition to the dictionary, the compressor maintains a string called
w and a character K. Initially, the first character of the data
stream is placed into w, and the second character into K.
The LZW algorithm
The compression loop proceeds as follows:
- Invariant: w is always present in the dictionary.
- If wK is in the dictionary, then we append K to w and place
the next character of the input stream in K.
- If wK is not in the dictionary, output D(w) on the output
stream. Enter wK into the dictionary; its index should be
the numerically smallest unused index. Now let w
be equal to K and K be equal to the next character
to be compressed.
- When the end of the stream is reached, output D(w) to the
output stream.
For this assignment we will limit the size of the dictionary to
216 entries; if the dictionary fills up, your code
should simply not make any more entries into the dictionary.
This process will yield as its output a sequence of codes,
representing the compressed data stream.
Practice
Try these examples by hand. It is important that you completely master
the process of LZW compression by hand before you write your
code. All examples are over a 3-character alphabet, with "a"=0,
"b"=1, and "c"=2 initially in the dictionary.
Uncompressed | Compressed |
aabababaaa | 0,0,1,4,6,3 |
abcabbbbbcab | 0,1,2,3,1,7,4,3 |
cabcabcababcabc | 2,0,1,3,5,4,4,6,2 |
Decompression
Decompression is just a bit more difficult. At the beginning of the
decompression, initialize the dictionary just as you did for
compression.
- Read a code off the input stream. Call this code B.
- Call the previous code read off the input stream A.
- Let L(n) denote the entry in the dictionary with index
n.
If B is a valid index into the dictionary:
- Output: Output L(B).
- Dictionary Update: Define a new entry into the
dictionary occupying the numerically smallest unused
index. The string entered into the dictionary will be
L(A)b, where b is the first character of L(B). If the B is
the first code read off the input stream, do not make any
entry into the dictionary.
If the B is not a valid index into the dictionary ("KwKwK case"):
- Invariant: B must be equal to the
numerically smallest unused index in the dictionary.
- Infrence: If the first character of L(A) is a,
then make a new entry into the dictionary with index
B. The entry should be L(A)a.
- Output: output L(A)a onto the output stream.
Note: the description of the algorithm is imperative in nature in order
to make it easy to understand. However, the implementation of this
section should not use any mutable data structures.
Variable-length encoding
The final step is to use variable-length encoding to output the
stream of integers we have produced. At any point in the stream,
both the compressor and decompressor both know the number of
entries in the dictionary. The number of bits required to
represent the largest index is then ceil(log2 maxindex).
To conserve space, when we output an index, we only output
that many bits. Note: it is important to realize that the
number of bits output is determined by the largest
index in the dictionary, not the index being ouput. This feature
will be for extra credit.
Problem 0: Setup (0/85)
We provide a library for streams stream.sml,
stream-based file I/O stream-io.sml,
dictionaries dict.sml and bit vectors bit-vector.sml. You should include use
statements for them at the beginning of your code:
use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/stream.sml";
use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/stream-io.sml";
use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/bit-vector.sml"; (* for extra credit *)
use "/afs/andrew/scs/cs/15-212-ML/assignments/ass5/dict.sml";
Problem 1: Stream Transducers (15/85)
Write a structure Transducer conforming to this signature (in file transducer.sml):
signature TRANSDUCER =
sig
type byte = Word8.word
type 'a stream (* Use STREAM, not BASIC_STREAM *)
exception OverFlow
exception Error of string
val byteStreamToCharStream : byte stream -> char stream
val intStreamToByteStream : int stream -> byte stream
(* Raises exception OverFlow if an int exceeds 2^16 *)
(* Outputs each 16 bit int as two 8bit bytes, least *)
(* significant byte first (network byte order) *)
val byteStreamToIntStream : byte stream -> int stream
(* reverse of intStreamToByteStream *)
(* Raises exception Error with a description if *)
(* there are an odd number of bytes in the stream *)
val charStreamToByteStream : char stream -> byte stream
end;
Implement the byteStreamToCharStream, byteStreamToIntStream,
intStreamToByteStream, and charStreamToByteStream functions which
convert the elements of their input streams into a different
format, yielding an output stream. Style warning! There is
an extremely concise way to write two of these functions, and we
will deduct style points if you fail to recognize it. Think:
what do these functions have in common?
Problem 2: De/Compression (60/85)
For this section you will be implementing an LZW compressor
structure Compression that meets this signature
(which can be found in file compress.sml):
signature COMPRESSION =
sig
type 'a stream (* Use STREAM, not BASIC_STREAM *)
structure intDict : DICT where type key = int
structure stringDict : DICT where type key = string
exception Error of string
val compress : char stream -> int stream
val compressAndShowDict : char stream -> (int * int stringDict.dict) stream
(* may raise Error on invalid input *)
val decompress : int stream -> char stream
val decompressAndShowDict : int stream -> (char * string intDict.dict) stream
end;
Note that you do not have to do variable length code words.
Problem 2.1: Compression (30/85)
Implement Compression.compress and Compression.compressAndShowDict. The
latter function should yield a stream of not just the compression
codes but also the state of the dictionary after each new output
code is written. This function is required for full credit and is
essential in order for us to grant you partial credit on
section 2.1; it allows us to watch you perform the compression
step-by-step so we can determine what you did right and give you
partial credit for it.
Problem 2.2: Decompression (30/85)
Implement Compression.decompress and
Compression.decompressAndShowDict. The latter function should yield a
stream of not just the compression codes but also the state of the
dictionary after each new output character is written. This function is
required for full credit and is essential in order for us
to grant you partial credit on section 2.2; it allows us to watch
you perform the decompression step-by-step so we can determine what
you did right and give you partial credit for it.
Problem 3: De/Compressing Files (10/85)
Write a structure Lzw conforming to the following signature
(which can be found in file lzw.sml):
signature LZW =
sig
exception Error of string
(* takes a file and writes the compressed version to file.mlZ *)
val compress : string -> unit
(* decompresses file.mlZ into file and raises Error("not an mlZ file")
if parameter does not end in ".mlZ" *)
val decompress : string -> unit
(* reads the file in and runs it through both the compressor and
decompressor, then writes it back to file.muZ. Hopefully the
original and munged files will be identical (if done right) *)
val munge : string -> unit
end;
Problem 4: Variable length code words (+15 points ec)
One problem with the compression system described above is that 16
bits are used for each code word all the time, even if we don't 16
bits to distinguish that code for all other valid codes that might
be seen at a particular point.
For extra credit write a structures VarCompress and VarLzw
conforming to COMPRESS and LZW respectively, which implement
compression and decompression using variable length code words. You
should start with code words of length 9, and go up to a maximum of
16 bits. You may use the BitVector implementation provided.