15-451 Algorithms 9/23/04 * Search trees - simple binary search trees Quiz next Tues. 30 min. - B-trees 1 sheet notes - treaps ===================================================================== For the next few lectures we'll be looking at several important data-structures. A data-structure is a method for storing data so that operations you care about can be performed quickly. Defn: a DICTIONARY data structure is a DS that lets you perform operations insert and lookup (and maybe delete). E.g., maybe we are the phone company and we have a database of people and their phone numbers (plus addresses, billing information and so on). As new people come in, we'd like to be able to insert them into our database. And, given a name x we'd like to be able to do lookup(x) to find out their phone number and other info. Specifically, operations in a Dictionary data structure are: - insert(key, object) // e.g., word and definition, name and telephone #,etc - lookup(key) // returns object - delete(key) // Might or might not care about this. One option: sorted array. Then, lookup takes O(log n) time using binary search. But insert may take Omega(n) time in the worst case because we have to shift everything to the right in order to make room for the new key. How about an unsorted array? Then insert is O(1) but lookup may take Omega(n). Last time we saw a DS that consisted of an unsorted *set* of sorted arrays where insert was O(log n) amortized and search was O(log^2 n). Today: search trees. A binary search tree is a binary tree in which each node stores a (key, object) pair such that all descendants to the left have smaller keys and all descendants to the right have larger keys (let's not worry about the case of multiple equal keys). To do a lookup operation you simply walk down from the root, going left or right depending on whether the query is smaller or larger than the key in the node, until you get to the correct key or walk off the tree. We will also talk about non-binary search trees that potentially have more than one key in each node, and nodes may have more than two children. For the rest of this discussion, we'll ignore the "object" part of things. Just worry about the keys since that's all that matters as far as understanding the data structures is concerned. SIMPLE BINARY SEARCH TREES ========================== The simplest way to maintain a binary search tree is to implement the insert operations as follows. insert(x): walk down the tree as if doing a lookup. Insert x into a new leaf at the end. E.g., how about inserting: C A R N E G I E' (where E' > E). plusses: easy to do (deletes are slightly painful) minuses: bad worst-case behavior. In fact, behaves exactly like quicksort using leftmost element as pivot. If elements are in sorted order, will get very unbalanced tree where operations take time Omega(n). Today: two ways to fix this problem. First is a particularly nice method used in database applications called B-trees. We'll then look at another called treaps, which is a lot like randomized qsort, but trickier since keys are coming in one at a time. IMPORTANT IDEA: the problem with the basic bst was that we weren't maintaining balance in the tree. On the other hand, if we try to maintain a perfectly balanced bst, it will be too expensive rearranging things. So, we want to give ourselves some slack. In AVL trees, you give yourself slack by allowing the tree to be "almost balanced". In the median-finding algorithm, we gave ourselves slack by allowing the pivot to be "near" the middle. For B-trees, we will make the tree perfectly balanced, but give ourselves slack by allowing some nodes to have a higher degree (number of children) than others. B-trees and 2-3-4 trees ======================= B-tree is a search tree where for some pre-specified t>=2, * each node has between t-1 and 2t-1 keys in it (except root can have fewer). Keys are stored in a sorted array. * each non-leaf has degree = (# keys)+1. So, in range [t, 2t], except root is [2,2t]. E.g., if array is [a1, a2, ..., a10], then there is one child for things less than a1, one child for things between a1 and a2, ..., one child for things bigger than a10. * All leaves at same depth. (Case of t=2 is called 2-3-4 tree since degrees are 2,3,or 4) E.g., here is a tree for t=3. (3-6 tree) (max size of node is 5) [H M R] / | | [A B C D] [K L] [N O] [T Y Z] Idea is that by using flexibility in sizes and degrees of nodes, we'll be able to keep trees perfectly balanced. Rules for search and insert turn out to be pretty easy: - Search: just do binary search in array at root, then go to appropriate child and recurse. - Insert: 1) act like you're doing a search, but if you ever encounter a FULL node (FULL means it has the maximum 2t-1 keys in it), pull the median element up to parent and split the node. This takes O(t) time since need to insert into parent and move half of the keys into a new array. 2) then insert into the leaf. Example: insert "E" above, then insert "F": - when doing part (1) for F we will pull the "C" up to root and split the node. End result: [ C H M R ] / | | \ \ [A B] [D E F] [K L] [N O] [T Y Z] Question: Do we maintain requirement of at least t-1 keys per non-root node? Yes, because we split the node at the median. Question: Can the parent get *over-full*? Answer: no, since if it was full we would have already split it on the way down. insert "S", "U", "V": [ C H M R U ] / | | | \ \ [A B] [D E F] [K L] [N O] [S T] [V Y Z] now, insert "P". Doing this will bring M up to new root. [ M ] / \ [ C H ] [ R U ] / | | | | \ [A B] [D E F] [K L] [N O P] [S T] [V Y Z] Question: is tree always height-balanced? (All leaves at same depth). Answer: yes, since only grow tree "up". So, we have maintained properties. What about running time? Say we do binary search at each node to find position: Time for search is O(depth * log(t)) What is maximum depth? O(log_t(n)). E.g., if t = 1000, n = 1 billion, then depth is at most 3 (or 2 if you count starting at 0) Interesting thing: the t's cancel in expression for search time: Time for search is: O(log_t(n) * log(t)) = O(log n) Inserts: same as search but we also need to split nodes which takes O(t) time each. In the worst case, we need to do this every step of the way down. Worst-case time for insert = search-time + splitting-time + time-to-insert-into-leaf. where splitting-time = O(depth * t) and leaf-insert is O(t). If t is constant, this is O(log n). Interesting thing is that even if t is large, amortized analysis comes to our rescue. In particular, if we create a tree from n inserts, I claim we can have made at most O(n/t) splits total in the process. Why? -- Because each split creates a new node, and there are O(n/t) nodes total. So *total* time spent on splits over all n inserts is O(n), which means that we are only spending O(1) time on average per split. So, the amortized time per insert is: O(log n) + O(1) + O(t) = O(log n + t) More facts about B-trees: * used a lot in databases in practice because fits in nicely with memory heirarchy. Basically, define t so that node fits into a page. Keep root in fast memory (and maybe even its children), so in above example, each insert or lookup would use only 1 or 2 disk accesses. * if use t=2, have 2-3-4 tree. What's special about 2-3-4 trees is that they can be implemented efficiently as binary trees using "red-black-trees". ======================================================= Treaps ====== Going back to binary search trees: we saw how a standard BST is like quicksort using the leftmost element as a pivot, with all of its worst-case problems. Question: can we come up with a method that's instead like *randomized* quicksort? The problem is that we don't have all the elements at the start, so it's not so obvious. But it turns out we *can*, and the idea is called "treaps". Treaps: the idea for a treap is that when an element x is inserted, we also give it a random *priority* value. Think of the priorities as giving the order they are supposed to be chosen as pivots. (And think of them as real numbers so we don't get any ties). In particular, the property we will require is that if v is a child of u, then priority(v) > priority(u). For example: N:5 / \ C:10 R:7 / A:12 So, the keys are search-tree ordered and the priorities are heap-ordered, which is why this is called a "treap". Might ask: is it always possible to have keys in bst order and priorities in heap order? Answer: yes, just think of standard BST if nodes came in order of priority. Question: how do you do an insert? Answer: just do usual BST insert, putting the element at the leaf, and then rotate up the tree if parent has a larger priority value. Remember how tree rotations work: x y / \ / \ T1 y <------> x T3 / \ / \ T2 T3 T1 T2 For instance, if doing CARNEGIE', and so far we have the tree above, and we now insert E with priority 15 (so no rotations) and then G with 8, we would do: N:5 N:5 N:5 / \ / \ / \ C:10 R:7 ---> C:10 R:7 ---> G:8 R:7 / \ / \ / A:12 E:15 A:12 G:8 C:10 \ / / \ G:8 E:15 A:12 E:15 Need to prove this maintains treap property. First, search-tree property on keys is maintained since it's not affected by rotations. What about heap property? Proof is that initially, all descendant relations are OK (if v is descendant of u then priority(v) > priority(u)) except for case that v is the new node (note: heap is defined in terms of children but implies for descendants by transitivity). Now, when we do a rotation (e.g., see generic picture above), the only new descendant relation we add is that x and T1 become descendants of y. But since priority(x) > priority(y), and priority(T1) > priority(x) by assumption, these are all OK. So, we maintain invariant. Finally when new node has priority > its parent, all descendant relations are OK and we're done. So, insert can be done in time O(time for search) since at worst the number of rotations = the number of steps on the way down. Can actually show that the *expected* number of rotations per insert is O(1). Now, we're almost done. Inserts and searches are both O(depth of tree). What about the depth of the tree? When we analyzed randomized qsort, we showed that in expectation, the total number of comparisons is O(n log n). This means that in expectation, the sum of all node depths is O(n log n). But you can actually show something a lot stronger: in fact, with high probability, the maximum node depth is O(log n). [which also implies that the qsort bound is "with high probability"]. Here is a sketch of one way to prove it: Proof: let's go back to our "dart-throwing" argument for qsort. Let's line up the elements in sorted order, and pick one X that we care about. We can think about the depth of this node like this: we throw a dart at random into this sorted list, representing whatever element is at the root of the tree, and whereever it lands, it cuts the list and the part that X isn't on disappears. We do it again, representing the next node on the path to X, and keep going until only X is left. Now, if you think about it, whether the dart lands to the left or right of X, it has a 50/50 chance of deleting half of that side of the list. This can happen at most lg(n) times to the left of X, and at most lg(n) times to the right. So, we can think of it like this: each dart/pivot is like a coin toss with a 50% chance of heads. We have to stop sometime before getting 2*lg(n) heads. There's a nice bound called "Hoeffding's inequality" that says that if you flip a coin t times, the chance of getting < t/4 heads is at most e^{-t/8}. So, if we flip 8*lg(n) times, the chance of getting at most 2*lg(n) heads is at most e^{-lg(n)} = e^{-ln(n)/ln(2)} = 1/n^{1.44}. Even if you multiply this by n to account for the fact that we want this to be true for *every* node X, you still get that with high probability, each element has depth < 8*lg(n).