QUESTION 1

(A) Need to send the probabilities as well as the compressed data.
Might also have to send the original ordering of chars if not fixed.

(B) .97 bpc (using arithmetic coding)
   1.4  bpc (using Huffman)
   (I had asked for arithmetic coding, but many of you gave me Huffman)

(C)  0 bpc (using arithmetic coding) 
     1 bpc (using Huffman)

(D) Yes, adapts well to changes in type of text

(E) Yes, works better using pairs.  For example I ran it on the
    final itself (the .tex file) and got
       5.04 bpc  using single char
       4.46 bpc  using pairs
    This does not include the space for the probabilities, but the
    question said "long messages".

(F) Easy to decrypt.  Basically a substitution cipher which can be
    cracked by using probabilities of chars in the English language.


QUESTION 2

(A) The set of columns corresponding to a cycle are not linearly
    independent since they will sum to a zero vector.  This means that any
    square matrix that includes them will not be invertible.
    The basis variables for simplex must correspond to an 
    invertible set of columns.

(B) Because of the result of (A) the basis variables 
    must correspond to a tree and must include the root edge 
    (so that the total number of basis variables adds up to n).
    Only basis variable can be nonzero.

(C) A step of simplex corresponds to identifying an edge that is not
    in the current tree (i.e. that forms a cycle) and then removing
    another edge from the cycle it forms by adding it.
    There is a specialized method for selecting these edges, which is
    called the network simplex method and is significantly more efficient
    than the general simplex method.


QUESTION 3:

x_ijl  if path from i to j of length l or less
z_ijkl  if path from i to j of length l or less that first goes through k

Minimize  x_ij1 * w_ij

Subject to:
   x_ijl = x_jil                            i,j in V
   x_ijl <= x_ij(l-1) + \sum_{k=0} z_ijkl   i,j in V    l in [2,n]
   z_ijkl <= x_ik1 + x_kj(l-1) - 1          i,j,k in V  l in [2,n]
   x_ij1 = 0                                (i,j) not in E

The first equation is for symmetry.

The second says that there is a path from i of j of length l or less
only if there is either a path of length less than l or there is
a path through a neighboring node of length l or less

The third checks each neighboring node for whether there is a path
of length l or less through that vertex.

The fourth makes sure that we do not have paths of length 1 between
vertices that are not connected.

(I appologize that 1 and l look so similar in plaintext)


QUESTION 4:

Put points evenly around a circle (or arc).  Add point to middle.


QUESTION 5:

Log n - 1
(I will accept 2 log n if you did binary cuts, as in CK method)

QUESTION 6:

(A) Documents 1 and 4 (I will accept 2,3 and 5 as the converse answer)
(B) We want to solve for x in
      w = x \Sigma_2 V_2^T
    where w = [0,1,0,0,1]
    Assuming V^T V = I, we can multiply both sides by V_2 Sigma_2^{-1}
    and get
        w V_2 \Sigma_2^{-1} = x \Sigma_2 V_2^T V_2 \Sigma_2^{-1}
                            = x
    plugging in, we get x = [.1304, -.3333]
    I will also accept answers that solve directly without V^T T = I 
      assumption.
(C) The original query could be very sparse (only a couple terms) while
    the transformed query is likely to be dense (include nonzero 
    vals in all dimensions).


QUESTION 7:

(A) For splitting on variable i, the compressed space if we split on it is:
       Space(i) = H(i) + Sum_j H(j|i)
    where the j are the other remaining variables.
    Find the var that minimizes space.
    (Many of you forgot the H(i) term)

(B) Stop splitting when
       Sum_j H(j) + Space(probs) < Space(i) + Space(cprobs)
    where Space(i) is the space of the best split,
       Space(probs) is the extra space needed to store the probabilities
       for each variable, and Space(cprobs) is the space needed to
       store the probabilities of var i and the conditional probabilities
       of the other variables.
    In fact we could have included the space of the probabilities in
       part (A), but it is much more important here otherwise we would
       never stop since Sum_j H(j) < Space(i) is never true.
       Also, Space(cprobs) adds up to about the same for 
       each split choice so it would not effect which var to split on.
    Since we only have binary variables we can approximate the space
       for the probabilities as:
         Space(probs) = 2 * k * v
         Space(cprobs) = 2 * k + 4 k (v - 1)
       where k is the space to store each probability (a better measure
       might be log of the number of objects) and v is the number of
       variables.  The reason it is 4 for the conditional probs is that
       we have to store probabilities for the four cases: 00,01,10,11
    If we plug this into the above equation, we will stop splitting when
       Sum_j H(j) < Space(i) + 2 k (v - 1)
    Most of you did not consider the space needed to store
      the probabilities, which becomes significant for a small number of
      objects.
    I will give partial credit for other reasonable answers.