RECITATION NOTES 15-451 Algorithms 10/06/04 Topics: * search trees vs hashing, * augmenting search tree data structures (if didn't do it last time) * another way of getting universal hashing [if time] ========================================================================== ========================================================== Search trees versus hashing: hashing is very convenient if you just need to do inserts and lookups, but what are some operations that are better suited for search trees? - lookup-by-prefix. E.g., type in first few letters of someone's name and have it output all names in the database that begin with those letters. - find-in-range[key1, key2] (output all keys between key1 and key2) - a version of lookup(x) that returns the closest key to x when x is not in the database. #############if didn't do this last time############# Now, what if we want to do the following operations: - output the kth smallest element (let's assume all keys are distinct) - do a version of lookup(x) that tells us the rank of x (how many keys are smaller than x). If we had a sorted array, these would be easy. To output the kth smallest element, we just output A[k]. To get the rank of x, we just do binary search to find it, and then output which location it's in. So this brings us to the question: how could we do this with search trees? Auxiliary information in search trees: ===================================== By adding extra information to the nodes of a binary search tree, it's often possible to dramatically expand what can be done. However, the data you add has to have two properties: (1) it's sufficient to solve the problem you want to solve (2) the values of the data in a node can be maintained efficiently. So, here would be a *bad* way of solving the problem: have each node store the rank of its key. Why is this bad? Because if you insert a new key (that, say, is smaller than everything else) you may have to update *everyone's* rank. So we don't have property (2). What's a better way? One thing we can do is store the size of the subtree rooted at each node in that node. - how do we use this to solve the problems? - how do we maintain these when we do an insertion or tree rotation? =================================================== Here's another method of doing universal hashing that requires less computation than the one in lecture, and is closer to what you likely coded up in 15-211. The hashing scheme: Let's assume keys in {0,1,...,U-1}. Hashing down to {0,1,...,M-1} Pick prime p >= U. (Or, think of U as prime). Define h_{a,b}(x) = ( (a*x + b) mod p ) mod M. H: Pick a at random from 1,...,p-1. Pick b at random from 0,...,p-1. Claim: H is a 2-universal family. Proof idea: the inner part will spread things out nicely in the range 0..p-1. In particular, given x!= y, (1) x is equally likely to go anywhere, and (2) given where x goes, y is equally likely to go to each of the p-1 places x didn't go (there won't be any collisions at the mod p level). Then, this allows us to show things will be nice when we take the results mod M. Actual Proof: Fixing x!=y and two locations r and s, let's look at: Pr[[(ax + b) = r (mod p)] AND [(ay + b) = s (mod p)]]. -> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which has exactly one solution (mod p) since Z_p* is a field. [We needed p >= U to ensure that x!=y mod p] -> If r=s, then this means we need a=0 which wasn't allowed. So Pr=0. -> If r!=s, then there is a 1/(p-1) chance that a has the right value. Given this value of a, we need b = r-ax (mod p), and there is a 1/p chance b gets this value, so the overall probability is 1/[p(p-1)]. Now, the probability x and y collide is equal to 1/p(p-1) times the number of pairs r!=s in {0,...,p-1} such that r = s (mod M). We have p choices for r, and then at most ceiling(p/M)-1 choices for s ("-1" since we disallow s=r). The product is at most p(p-1)/M Putting this all together, Pr[(a*x + b mod p) mod M = (a*y + b mod p) mod M] <= p(p-1)/M * [1/(p(p-1))] = 1/M. QED