Normally we store data using a fixed-length code (such as ASCII). This is easy, but assigning a shorter code to more frequent data is more efficient with space.
We examine a compression algorithm called Huffman encoding.
We can regard ASCII as a tree.
* 0/ : / *______ 0/ \__________1__ / \ * * : \1 0/ : \ / * * : \1 0/ : \ / * * 0/ : 0/ : / / * * 0/ : 0/ : / / /\ /\ 0/ \1 0/ \1 / \ / \ 0/\1 0/\1 0/\1 0/\1 ... 0 1 2 3 ... @ A B C ... 0 <=> 00110000 A <=> 01000001
Because every character is a leaf, we call this a prefix code.
For tree T it takes depth[T](x) bits to represent x.
Given some text, compute the number of occurrences of each letter x. Call this freq(x).
PGSS is exhausting but exhilarating.
_ 4 s 2 r 1 i 4 n 2 l 1 t 3 h 2 b 1 a 3 g 2 P 1 x 2 e 2 G 1 u 2 S 2 . 1
A prefix code tree T takes
bits[T] = sum freq(c) depth[T](c) ccharacters to represent the string.
The Huffman encoding finds the optimal T.
Start with many trees with weight according to freq.
_ i t a x u s n h g e S r l b P G . 4 4 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1
Combine the two lightest trees.
/\ _ i t a x u s n h g e S G . r l b P 4 4 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1
Repeat until one tree left.
/\ /\ /\ _ i t a x u s n h g e S G . b P r l 4 4 3 3 2 2 2 2 2 2 2 2 2 2 2 /\ / \ / \ /\ /\ /\ S /\ /\ /\ /\ _ i b P r l G . g e n h u s t a x 4 4 4 4 4 4 4 3 3 2
/\_____________ / /\ /\ / \___ / /\ / \ / _ i __/_ \____ /\ / \ / \ / \ __/_ /\ /\ /\ /\ \ / \ S /\ g e n h /\ t /\ /\ /\ G . u s a x b P r l
10001 10110 00001 00001 010 011 00001 010 1101 0011 1111 0010 00000 P G S S _ i s _ e x h a u 00001 0001 011 1110 1100 010 10000 00000 0001 010 1101 0011 1111 011 s t i n g _ b u t _ e x h i 10011 0010 10010 0010 0001 011 1110 1100 10111 l a r a t i n g .
The string takes 304 bits with ASCII, but 148 bits with the Huffman encoding: a 52% savings!
Lemma: If x and y occur least frequently, some optimal trees has x and y as siblings.
Say a and b are the deepest siblings.
* : : * * / \ x y * / \ a bThen freq(x) <= freq(a) and depth[T](x) <= depth[T](a). Let T' be T with a and x swapped. We show that T' is at least as good as T.
bits[T] - bits[T'] = freq(a) (depth[T](a) - depth[T](x)) + freq(x) (depth[T](x) - depth[T](a)) = (freq(a) - freq(x)) (depth[T](a) - depth[T](x)) >= 0So bits[T'] <= bits[T].
By the same reasoning we can swap y and b to make x and y siblings.
Lemma: Say T is optimal for alphabet A. Choose siblings x and y in T with parent z. Let T' be T without x and y. Let freq(z) = freq(x) + freq(y). Then T' is optimal for A-{x,y}+{z}.
bits[T'] = bits[T] - freq(x) depth[T](x) - freq(y) depth[T](y) + freq(z) depth[T](z) = bits[T] - freq(z) depth[T](x) + freq(z) depth[T](z) = bits[T] - freq(z)
Say U' is better than T' for A-{x,y}+{z}. Replace z in U' with
/\ x yto get U. We see that
bits[U] = bits[U'] + freq(z) < bits[T'] + freq(z) = bits[T](The first step follows by the relationship similar to that between bits[T] and bits[T'].) The statement that bits[U] < bits[T] is a contradiction, so U' does not exist, and T' is optimal.