15-451 Algorithms 09/28/04 Digit-based sorting/data-structures Quiz * Radix sort * Tries ========================================================================= So far, have looked at sorting and storing items whose keys can be anything at all, so long as we have a notion of "<". Today will look at methods for the case that keys are natural things like strings or integers. Say you have n objects to sort, whose keys are all integers in a small range: 1,...,r (e.g., say I have a list of movies and want to sort by how many stars they got). What is a fast way to sort? --> make array of size r, of "buckets" (maybe implement as linked lists) --> Make 1 pass through records. Insert object with key K into bucket K. --> Make pass through array of buckets, read off as you go. Total time: O(n+r). This is called "bucket sort". Notice one thing interesting about it is that by using the magic of indirect addressing, we are making an r-way decision at each step (something you can't do in the comparison model). Unfortunately, bucket-sort is only good if the range of keys is small. RADIX SORT ---------- What if we have keys in a larger range, but the keys are numbers (or strings) that can be viewed as a sequence of digits (or characters). In that case, there are two natural generalizations: one that's conceptually easier called Most-Significant-First radix sort, where we go top-down digit by digit (or byte by byte), and another that's trickier to think of but easy to code called Least-Significant-First radix sort where we go in the other direction. "radix" is the "r". E.g., r might be 10 or 256. Most-significant-First radix sort: --------------------------------- Sort by most significant character/digit first into buckets. Get bucket for 'a', 'b', ... ---- a's ---- | ----- b's ---- | ---- c's ---- | ... Now, recursively sort each bucket, starting with the next most significant digit/character, and so on. (sort of like quicksort). Keep doing this for any bucket that has more than one element. If we ignore the time spent scanning empty buckets (the "+r" part) then we just spend O(n) time at each level. So, if strings are length L, then the total time is O(nL). In fact, it may end a lot earlier if, e.g., we can distinguish all the strings from just their first few characters. How does this relate to our lower bound? One thing to think about: how small can L be, if all n keys are distinct? Ans: it has to be at least log_r(n). So, in the case that all strings are roughly this length, our time is only n*log_r(n) rather than n*log_2(n), because we are making an r-way split rather than a 2-way split at each step. What about the empty buckets? That can actually be a problem if we get to a case in the recursion where the number of elements in a bucket gets small compared to r (e.g., "algorithm" and "algorithmic") so at that point you might want to switch to a different algorithm or do something trickier. But if we view r as constant, then overall it's just O(nL), which is how long it takes to read the input anyway. Least-significant-first radix sort: ----------------------------------- Another idea - easier to implement but trickier to see why it works: First bucket-sort by LEAST significant digit. That is, sort them into buckets by the ones digit and then concatenate the buckets. Now bucketsort by the tens, and so on. This sounds weird, but the claim is that if we do this using a STABLE bucket-sort in each step ("stable" means that it doesn't rearrange the orders of items that go into the same bucket) then this will end up correctly sorting the items. Let's see what happens on an example: Example: 28, 15, 24, 17. -> sort by least significant: 24, 15, 17, 28 -> sort by next-least: 15, 17, 24, 28 (since STABLE) Why does the algorithm work? Let's prove by induction that after the ith pass, the items are correctly sorted according to the least i digits. This is clearly true for the base case i=1. Now, let's do the general case. We know that after the ith pass the items that differ in the ith digit will be in the desired order with respect to each other (because we just sorted them by that digit!) but what about the items that are equal in this digit? Well, by induction, these were in the desired order *before* we began the ith pass, and since our bucketsort is STABLE, they remain in the correct order afterwards. So we are done. Running time is O(L*(r+n)) = O(Ln) if r = O(n). Advantage: easy to implement since no need to keep buckets separate or even to do recursion: We just have have a loop that for j = L down to 1 calls bucketsort(A,j) which does a bucketsort using the jth character of each string for sorting. If have big numbers and small numbers, could add a check that identifies the part of the array that's done so don't keep re-sorting it. [technical point: if keys are strings, and they have different lengths, then to match the usual notion of what we mean by "sorted in alphabetical order", we should pad them to the right with blanks that are defined to be less than 'a'. E.g., {car, card} should be viewed as {car_, card}.] Tries ===== (TRIE comes from reTRIEval. also called digital search trees). Tries are to MSF radix sort like binary search trees are to quicksort. Can think of a trie as an adaptive version of MSF radix. In a trie, you put letters on the edges and you walk down the tree reading off the word. In particular, each node will have an array of size r (e.g., r=26 or r=256) of child pointers. To store a string, you just follow down the associated child for each letter from first to last. For instance, say we wanted to store the words: and, bat, add, bag, cat, car. When doing an insert, we end each string with "$" (a special end character) in case some strings are substrings of others. To do a lookup, we just walk down the tree letter-by-letter and then see if the node we get to at the end has a "$" in it. (If we ever get to a null pointer, the we know the key is not there -- e.g., if we look up "apple" then we will notice in the 2nd step that it can't possibly be in the trie). Time to do a search is only O(length of key). Same for doing an insert, if we view r as a constant. (If r is really big then we should worry about the time spent allocating the arrays of child pointers). So, this is really nice. Also, prefix searching is especially easy (e.g., if you wanted to make a text editor that did word-completion). The main drawback is that the overhead is high because you have so much pointer-following to do. E.g., what if we added "automobile" to the above trie? Also, you have a factor of r extra space since you need to have an array of size r in each node. Since the design is so nice conceptually, people have thought about ways of making it more practical. In particular, some things you can do are: --> compress paths that don't branch into single edge. --> only split when needed So, for instance, the set {add, automobile, and, another} would look like: * a| ****** dd$/ |n |utomobile$ * * * d$/ \other$ * * This is called a "Patricia tree" (practical algorithm to retrieve information coded in alphanumeric).