Homework 2: Integer Data Structures


This is just the writeup for the spicy Integer Data Structures problem in hw2.
Integer Data Structures
This is a fun and instructive exercise that combines some elements from 15-112, 15-122, 15-150, and 15-251. We will use normal Python integers to represent lists, sets, maps, strings, and finite state machines. While there are more efficient ways to represent all those things, it is fascinating that arbitrary-length Python integers alone are sufficient for all of that and more!

Note: don't be daunted by the long writeup. You can do this!

Since this problem has many parts, and it would not be fair if you only got credit if you completed ALL of them, there are several checkpoints that you can complete to get chunks of the 4 points available from this problem:

  1. Background
    Since we plan to include multiple integers in a single integer, we will start by encoding integers with a prefix of the number of digits in that integer. For example, we might encode 25 as 225, where the first 2 says there are 2 digits. But then we'd have to encode 1234567890 with a count of 10. But then we'd have to know how many digits our count itself has! And we are now recursing. Oh no!

    Our solution, which is not fully general but which is good enough for our purposes, is to first have a count-count of the number of digits in the count. That first count-count will always be exactly one digit, so the count itself can be between 1 and 9 digits. So the largest digit count is 999,999,999. So this approach does not allow numbers with one billion or more digits. We can live with that restriction.

    We also have to deal with the sign (+/-) of the number. Normally this might be 0 for positive and 1 for negative, but leading 0's can cause confusion, so we'll use 1 for positive and 2 for negative.

    Thus, we will encode numbers as such: [sign-digit] [count-count] [count] [number]. So, for example, to encode 789, the sign-digit is 1 (positive). The count is 3, so the count-count is 1. Thus, the entire encoding of 789 is 113789.

    For another example, to encode -1234512345, the sign-digit is 2 (negative). The count is 10, so the count-count is 2. Thus, the entire encoding of -1234512345 is 22101234512345.

  2. lengthEncode(n)
    Time to write some code! Write the function lengthEncode(n) that takes a possibly-negative Python integer and returns the length prefix encoded integer as described above. Here is a test function:
    def testLengthEncode():
        print('Testing lengthEncode()...', end='')
        assert(lengthEncode(789) == 113789)
        assert(lengthEncode(-789) == 213789)
        assert(lengthEncode(1234512345) == 12101234512345)
        assert(lengthEncode(-1234512345) == 22101234512345)
        assert(lengthEncode(0) == 1110)
        print('Passed!')
    Hint: while not required, we found it very helpful to write the helper function intCat(n, m) that took two non-negative integers and returned their concatenation. So, for example, intCat(123, 45) returns 12345.

  3. lengthDecode(n)
    Now write lengthDecode(n) that performs the inverse operation of lengthEncode(n), so:
    def testLengthDecode():
        print('Testing lengthDecode()...', end='')
        assert(lengthDecode(113789) == 789)
        assert(lengthDecode(213789) == -789)
        assert(lengthDecode(12101234512345) == 1234512345)
        assert(lengthDecode(22101234512345) == -1234512345)
        assert(lengthDecode(1110) == 0)
        print('Passed!')
  4. Hint: while not required, we found it very helpful to write a helper function that could extract a "substring" of an integer, i.e. if I wanted to extract digits 2 through 4 of 1511242 it would return 112.

  5. lengthDecodeLeftmostValue(n)
    The whole point of using the length prefix was to allow us to place multiple integers one after another inside a single larger integer. For example, we encode 2 as 1112 and we encode 789 as 113789, so we can encode a 2 followed by a 789 as simply 1112113789.

    Thus, we need a way to decode just the leftmost value. For that, write the function lengthDecodeLeftmostValue(n) that takes a number with one-or-possibly-more encoded values and decodes just the leftmost value. This function has to return two values: the decoded value, and also the rest of the encoded values! So lengthDecodeLeftmostValue(1112113789) returns (2, 113789), meaning that 2 was the leftmost value, and after removing the 2, 113789 still remains.

    Thus, we call this function using this pattern:
    encodedValues, nextValue = lengthDecodeLeftmostValue(encodedValues)
    With that in mind, write lengthDecodeLeftmostValue(n) so that the following test function passes:
    def testLengthDecodeLeftmostValue():
        print('Testing lengthDecodeLeftmostValue()...', end='')
        assert(lengthDecodeLeftmostValue(111211131114) == (2, 11131114))
        assert(lengthDecodeLeftmostValue(112341115) == (34, 1115))
        assert(lengthDecodeLeftmostValue(111211101110) == (2, 11101110))
        assert(lengthDecodeLeftmostValue(11101110) == (0, 1110))
        print('Passed!')
    Hint: you may want to think carefully about the case where the second-to-leftmost value is a 0. The last two tests above deal with that case.
    PS: We highly reccomend completing this function BEFORE lengthDecode

  6. Lists
    We can now use a sequence of length-prefix-encoded integers to represent a list of integers. We will just prefix the entire list with the length of that list. So, for example, the Python list [9, 8888]. This list is of length 2, so we will encode it as the values 2, 9, 8888. Since these are encoded as 1112, 1119, and 1148888, respectively, we will encode the entire list as 111211191148888. Note that an empty list is just a list of length 0, and so it is just the length-prefix-encoded value of 0, which is 1110 (the result of calling lengthEncode(0)).

    With that, here are the functions you need to write:
    • newIntList(): takes no argument, returns an empty list.
    • intListLen(L): takes a list, returns its length (the leftmost encoded value).
    • intListGet(L, i): takes a list and an index, and returns the decoded value at that index. Returns the string 'index out of range' as needed.
    • intListSet(L, i, v): takes a list and an index, and returns a new list with the value at the given index changed to be the encoded value of v. Returns the string 'index out of range' as needed.
    • intListAppend(L, v): takes a list and a value, and returns a new list with that value (encoded) appended.
    • intListPop(L): takes a list and removes the last value ("pops" it). It then returns two values: the list without that popped value, and that popped value itself.

    And here is the test function to pass:
    def testIntList():
        print('Testing intList functions...', end='')
        a1 = newIntList()
        assert(a1 == 1110) # length = 0, list = []
        assert(intListLen(a1) == 0)
        assert(intListGet(a1, 0) == 'index out of range')
    
        a1 = intListAppend(a1, 42)
        assert(a1 == 111111242) # length = 1, list = [42]
        assert(intListLen(a1) == 1)
        assert(intListGet(a1, 0) == 42)
        assert(intListGet(a1, 1) == 'index out of range')
        assert(intListSet(a1, 1, 99) == 'index out of range')
    
        a1 = intListSet(a1, 0, 567)
        assert(a1 == 1111113567) # length = 1, list = [567]
        assert(intListLen(a1) == 1)
        assert(intListGet(a1, 0) == 567)
    
        a1 = intListAppend(a1, 8888)
        a1 = intListSet(a1, 0, 9)
        assert(a1 == 111211191148888) # length = 2, list = [9, 8888]
        assert(intListLen(a1) == 2)
        assert(intListGet(a1, 0) == 9)
        assert(intListGet(a1, 1) == 8888)
    
        a1, poppedValue = intListPop(a1)
        assert(poppedValue == 8888)
        assert(a1 == 11111119) # length = 1, list = [9]
        assert(intListLen(a1) == 1)
        assert(intListGet(a1, 0) == 9)
        assert(intListGet(a1, 1) == 'index out of range')
    
        a2 = newIntList()
        a2 = intListAppend(a2, 0)
        assert(a2 == 11111110)
        a2 = intListAppend(a2, 0)
        assert(a2 == 111211101110)
        print('Passed!')

  7. Sets
    A set is similar to a list, only with the restriction that values can only appear once, even if they are added multiple times. While there are more efficient approaches, we can represent sets using lists. In particular, when you add a value, we can first check if it is already in the set, and if so, do nothing. So we only add values that are not already in the set. That's it!

    With that, here are the functions you need to write:
    • newIntSet(): returns a new empty set (which is the same as a new empty list, which is the same as the length-prefix-encoded value for 0).
    • intSetAdd(s, v): takes a set and a value, and returns the same set if the value is already in it, or a new set that appends the value to the end of the set s.
    • intSetContains(s, v): takes a set and a value, and returns True if that set contains that value, and False otherwise.

    And here is the test function to pass:
    def testIntSet():
        print('Testing intSet functions...', end='')
        s = newIntSet()
        assert(s == 1110) # length = 0 
        assert(intSetContains(s, 42) == False)
        s = intSetAdd(s, 42)
        assert(s == 111111242)  # length = 1, set = [42]
        assert(intSetContains(s, 42) == True)
        s = intSetAdd(s, 42) # multiple adds --> still just one
        assert(s == 111111242) # length = 1, set = [42]
        assert(intSetContains(s, 42) == True)
        print('Passed!')

  8. Strings
    We can convert individual characters into numbers using the Python function ord(c), and we can undo that and convert those numbers back into characters using chr(n). Thus, we can store a Python String as a single integer by creating an encoded list of those ord values.

    With that, here are the functions you need to write:
    • encodeString(s): takes a Python string s and returns a length-prefixed list of the ord values of the characters in s.
    • decodeString(L): takes a length-prefixed list L and returns the corresponding Python string.

    And here is the test function to pass:
    def testEncodeDecodeStrings():
          print('Testing encodeString and decodeString...', end='')
          assert(encodeString('A') == 111111265) # length = 1, str = [65]
          assert(encodeString('f') == 1111113102) # length = 1, str = [102]
          assert(encodeString('3') == 111111251) # length = 1, str = [51]
          assert(encodeString('!') == 111111233) # length = 1, str = [33]
          assert(encodeString('Af3!') == 1114112651131021125111233) # length = 4, str = [65,102,51,33]
          assert(decodeString(111111265) == 'A')
          assert(decodeString(1114112651131021125111233) == 'Af3!')
          assert(decodeString(encodeString('WOW!!!')) == 'WOW!!!')
          print('Passed!')

  9. Maps
    A map is a data structure that converts keys into values (in Python this is called a "dictionary"). For example, we could use a map m to store country capitals. We could call intMapSet(m, 'France', 'Paris') to set the capital of 'France' to 'Paris'. Then, we could retrieve the capital by calling intMapGet(m, 'France'), which should return 'Paris'. Maps convert keys to values. So in this example, 'France' is the key, and 'Paris' is the value.

    Here, we are only using integers, but the idea is the same: a map will convert keys to values. And while there are more efficient approaches, we will store a map as a list of alternating values: key1, value1, key2, value2,...

    Thus, an empty map is just an empty list. Then, if we map the value 42 to 73, we get the list [42, 73], which is encoded as 11121124211273. Then, if we re-map the value 42 to 98765, since 42 can only be present once as a key, we change the 73 to instead be 98765, so we get the list [42, 98765].

    With that, here are the functions you need to write:
    • newIntMap(): takes no arguments and returns an empty map, which is an empty list.
    • intMapContains(m, key): takes a map and a key, and returns True if that map contains that key (as a key, not a value), and False otherwise.
    • intMapGet(m, key): takes a map and a key, and returns the value associated with that key in that map, or the string 'no such key' as appropriate.
    • intMapSet(m, key, value): takes a map, a key and a value, and returns a new map which is the same as m only with the given key mapped to the given value. If the key was already in m, this involves modifying the value the key maps to. Otherwise, this involves adding the key and value to the end of the list.

    And here is the test function to pass:
    def testIntMap():
        print('Testing intMap functions...', end='')
        m = newIntMap()
        assert(m == 1110) # length = 0
        assert(intMapContains(m, 42) == False)
        assert(intMapGet(m, 42) == 'no such key')
        m = intMapSet(m, 42, 73)
        assert(m == 11121124211273) # length = 2, map = [42, 73]
        assert(intMapContains(m, 42) == True)
        assert(intMapGet(m, 42) == 73)
        m = intMapSet(m, 42, 98765)
        assert(m == 11121124211598765) # length = 2, map = [42, 98765]
        assert(intMapGet(m, 42) == 98765)
        m = intMapSet(m, 99, 0)
        assert(m == 11141124211598765112991110) # length = 4, map = [42, 98765, 99, 0]
        assert(intMapGet(m, 42) == 98765)
        assert(intMapGet(m, 99) == 0)
        print('Passed!')

  10. Finite State Machines (FSM's)
    A Finite State Machine (FSM) is a theoretical machine that is useful for modeling certain kinds of limited computation. You can read about them online. The basic ideas are these:
    • An FSM has states. Ours will be numbered.
    • An FSM has a special start state. Ours will be state 1.
    • An FSM has transitions, which are rules about how to move from one state to another. A transition is specified by 3 values -- a startState, a symbol, and an endState. If the FSM is in the startState when it sees that symbol, it will transition to the endState. In our case, the startState and the endState will both be integers, and the symbol will be a single digit.
    • An FSM has accepting states. If the FSM finishes reading symbols and it is in an accepting state, we say that the FSM accepts those symbols. Our FSM will work in this way.


    We will represent our FSM as a list of two values: a map of transitions, and a set of accepting states. Each transition itself maps a startState to yet another map, that second map mapping symbols (digits) to endStates. This is a bit complicated, and you may need to read this a few times, think about it, and carefully study the examples in the test code below to fully understand it.

    With that, here are the functions you need to write:
    • newIntFSM(): takes no arguments and returns an empty FSM, which contains an empty map of transitions and an empty set of accepting states.
    • isAcceptingState(fsm, state): takes an fsm and a state, and returns True if that state is in the set of accepting states, and False otherwise.
    • addAcceptingState(fsm, state): takes an fsm and a state, and returns a new FSM that is the same except that the given state is added to the set of accepting states.
    • setTransition(fsm, fromState, digit, toState): returns a new FSM that is the same as the given fsm except with this new transition added. This involves updating two maps -- the outer map (mapping each fromState to its own map), and the inner map (mapping each digit to its toState). Look closely at the examples in the test function for details.
    • getTransition(fsm, fromState, digit): returns the toState that is mapped to by this fromState on this digit, or the string 'no such transition' as appropriate.
    • accepts(fsm, inputValue): takes an fsm and an inputValue, and returns True if the FSM accepts that inputValue, and False otherwise. To do this, it starts the FSM in its startState (1), then it transitions on each symbol from left-to-right in the inputValue. When it is done with all the values, it checks if it is in an accepting state. Note that if any transition is missing, this is not an error, it just means that the FSM does not accept the input.
    • states(fsm, inputValue): takes an fsm and an inputValue, and does basically the same thing as accepts(fsm, inputValue), only here instead of returning True or False, this returns a list (length-encoded-prefix list) of all the states that the FSM visited while computing the solution. This is mainly for testing purposes, so the autograder can verify your machine is working properly. Again, see the test code for details.

    And here is the first test function to pass:
    def testIntFSM():
        print('Testing intFSM functions...', end='')
        fsm = newIntFSM()
        assert(fsm == 111211411101141110) # length = 2, [empty stateMap, empty startStateSet] 
        assert(isAcceptingState(fsm, 1) == False)
    
        fsm = addAcceptingState(fsm, 1)
        assert(fsm == 1112114111011811111111)
        assert(isAcceptingState(fsm, 1) == True)
    
        assert(getTransition(fsm, 0, 8) == 'no such transition')
        fsm = setTransition(fsm, 4, 5, 6)
        # map[5] = 6: 111211151116
        # map[4] = (map[5] = 6):  111211141212111211151116
        assert(fsm == 1112122411121114121211121115111611811111111)
        assert(getTransition(fsm, 4, 5) == 6)
    
        fsm = setTransition(fsm, 4, 7, 8)
        fsm = setTransition(fsm, 5, 7, 9)
        assert(getTransition(fsm, 4, 5) == 6)
        assert(getTransition(fsm, 4, 7) == 8)
        assert(getTransition(fsm, 5, 7) == 9)
    
        fsm = newIntFSM()
        assert(fsm == 111211411101141110) # length = 2, [empty stateMap, empty startStateSet] 
        fsm = setTransition(fsm, 0, 5, 6)
        # map[5] = 6: 111211151116
        # map[0] = (map[5] = 6):  111211101212111211151116
        assert(fsm == 111212241112111012121112111511161141110)
        assert(getTransition(fsm, 0, 5) == 6)
    
        print('Passed!')

    That test function verifies that the FSM is setup properly. This next test function verifies that it runs properly:
    def testAccepts():
        print('Testing accepts()...', end='')
        fsm = newIntFSM()
        # fsm accepts 6*7+8 => See the diagram
        fsm = addAcceptingState(fsm, 3)
        fsm = setTransition(fsm, 1, 6, 1) # At state 1, receive 6, move to state 1
        fsm = setTransition(fsm, 1, 7, 2) # At state 1, receive 7, move to state 2
        fsm = setTransition(fsm, 2, 7, 2) # At state 2, receive 7, move to state 2
        fsm = setTransition(fsm, 2, 8, 3) # At state 2, receive 8, move to state 3
        assert(accepts(fsm, 78) == True)
        assert(states(fsm, 78) == 1113111111121113) # length = 3, list = [1,2,3]
        assert(accepts(fsm, 678) == True)
        assert(states(fsm, 678) == 11141111111111121113) # length = 4, list = [1,1,2,3]
    
        assert(accepts(fsm, 5) == False)
        assert(accepts(fsm, 788) == False)
        assert(accepts(fsm, 67) == False)
        assert(accepts(fsm, 666678) == True)
        assert(accepts(fsm, 66667777777777778) == True)
        assert(accepts(fsm, 7777777777778) == True)
        assert(accepts(fsm, 666677777777777788) == False)
        assert(accepts(fsm, 77777777777788) == False)
        assert(accepts(fsm, 7777777777778) == True)
        assert(accepts(fsm, 67777777777778) == True)
        print('Passed!')

    Here is a diagram explaining this FSM:

    And another diagram visualizing the integer data structure form of the FSM and its relationship to the various states and transitions.

  11. In Summary
    11241112421124211242112321128911311111311711232113109112971131001131011123211310511311611233112321127111311411310111297113116112321131061131111129811233112321128711311111311111310411311111311111233112331123311232112421124211242

That's it. Have fun!!!