QUESTION 1 (A) Need to send the probabilities as well as the compressed data. Might also have to send the original ordering of chars if not fixed. (B) .97 bpc (using arithmetic coding) 1.4 bpc (using Huffman) (I had asked for arithmetic coding, but many of you gave me Huffman) (C) 0 bpc (using arithmetic coding) 1 bpc (using Huffman) (D) Yes, adapts well to changes in type of text (E) Yes, works better using pairs. For example I ran it on the final itself (the .tex file) and got 5.04 bpc using single char 4.46 bpc using pairs This does not include the space for the probabilities, but the question said "long messages". (F) Easy to decrypt. Basically a substitution cipher which can be cracked by using probabilities of chars in the English language. QUESTION 2 (A) The set of columns corresponding to a cycle are not linearly independent since they will sum to a zero vector. This means that any square matrix that includes them will not be invertible. The basis variables for simplex must correspond to an invertible set of columns. (B) Because of the result of (A) the basis variables must correspond to a tree and must include the root edge (so that the total number of basis variables adds up to n). Only basis variable can be nonzero. (C) A step of simplex corresponds to identifying an edge that is not in the current tree (i.e. that forms a cycle) and then removing another edge from the cycle it forms by adding it. There is a specialized method for selecting these edges, which is called the network simplex method and is significantly more efficient than the general simplex method. QUESTION 3: x_ijl if path from i to j of length l or less z_ijkl if path from i to j of length l or less that first goes through k Minimize x_ij1 * w_ij Subject to: x_ijl = x_jil i,j in V x_ijl <= x_ij(l-1) + \sum_{k=0} z_ijkl i,j in V l in [2,n] z_ijkl <= x_ik1 + x_kj(l-1) - 1 i,j,k in V l in [2,n] x_ij1 = 0 (i,j) not in E The first equation is for symmetry. The second says that there is a path from i of j of length l or less only if there is either a path of length less than l or there is a path through a neighboring node of length l or less The third checks each neighboring node for whether there is a path of length l or less through that vertex. The fourth makes sure that we do not have paths of length 1 between vertices that are not connected. (I appologize that 1 and l look so similar in plaintext) QUESTION 4: Put points evenly around a circle (or arc). Add point to middle. QUESTION 5: Log n - 1 (I will accept 2 log n if you did binary cuts, as in CK method) QUESTION 6: (A) Documents 1 and 4 (I will accept 2,3 and 5 as the converse answer) (B) We want to solve for x in w = x \Sigma_2 V_2^T where w = [0,1,0,0,1] Assuming V^T V = I, we can multiply both sides by V_2 Sigma_2^{-1} and get w V_2 \Sigma_2^{-1} = x \Sigma_2 V_2^T V_2 \Sigma_2^{-1} = x plugging in, we get x = [.1304, -.3333] I will also accept answers that solve directly without V^T T = I assumption. (C) The original query could be very sparse (only a couple terms) while the transformed query is likely to be dense (include nonzero vals in all dimensions). QUESTION 7: (A) For splitting on variable i, the compressed space if we split on it is: Space(i) = H(i) + Sum_j H(j|i) where the j are the other remaining variables. Find the var that minimizes space. (Many of you forgot the H(i) term) (B) Stop splitting when Sum_j H(j) + Space(probs) < Space(i) + Space(cprobs) where Space(i) is the space of the best split, Space(probs) is the extra space needed to store the probabilities for each variable, and Space(cprobs) is the space needed to store the probabilities of var i and the conditional probabilities of the other variables. In fact we could have included the space of the probabilities in part (A), but it is much more important here otherwise we would never stop since Sum_j H(j) < Space(i) is never true. Also, Space(cprobs) adds up to about the same for each split choice so it would not effect which var to split on. Since we only have binary variables we can approximate the space for the probabilities as: Space(probs) = 2 * k * v Space(cprobs) = 2 * k + 4 k (v - 1) where k is the space to store each probability (a better measure might be log of the number of objects) and v is the number of variables. The reason it is 4 for the conditional probs is that we have to store probabilities for the four cases: 00,01,10,11 If we plug this into the above equation, we will stop splitting when Sum_j H(j) < Space(i) + 2 k (v - 1) Most of you did not consider the space needed to store the probabilities, which becomes significant for a small number of objects. I will give partial credit for other reasonable answers.