CMU 15-112: Fundamentals of Programming and Computer Science
Homework 11 (Due Saturday 31-Jul at 8pm)


Important notes:
  • To start:
      • Create a folder named 'hw11'
      • Download these files into that folder:
        1. hw11.py
        2. cs112_n21_week4_linter.py
        3. Please also download and unzip these datasets. Place the text files in the same directory as hw11.py for easy access. BUT don't submit these files to Autolab! (Note also, the text files should be in the same folder level as hw11.py, i.e. your hw11 folder shouldn't have any other folders inside of it)
      • Edit hw11.py using VSCode
      • When you have completed and fully tested hw11, submit hw11.py to Autolab. You may submit up to 5 times and, as usual, only your last submission counts.
  • Do not use recursion.
  • Do not hardcode the test cases in your solutions.

    1. Facebook friends [100pts]
      When performing tasks on large amounts of data, we frequently need to format the dataset in a way that allows us to perform our task efficiently. For example, if we want to study social networks to find mutual friendships between specific users, we may want to format our data in such a way where we don't need to iterate over the entire dataset for every single search. This is especially true if we need to perform many searches. Perhaps we have recently discussed a data type that is well suited to this task...?

      For this problem, we will provide a text file of hundreds of anonymized Facebook friendships. Please download them here if you haven't already, move the unzipped text files to the same directory as hw11.py, and then open the files in a text editor to see how the data is represented. Individual users are identified by a unique integer. Each line consists of a pair of integers representing a "friendship" between two users. For example, the line "123 234" indicates that User #123 lists User #234 as a friend. See below for a very short example of how these text files are formatted:

      403 401 403 402 401 402 402 400 401 399 402 403 403 399 401 403 402 403 402 401 400 402 399 401 403 402 399 403
      Given a specific user, we want to be able to efficiently search for all of the userIds listed as friends by that user. If we have to perform lots and lots of searches, an inefficient approach would be to iterate over every line of the text file for each query, and when a line begins with the user's ID, the friend's ID is appended to a list, and the list is returned after the entire file has been searched. This is potentially very slow for large datasets and many queries! We don't want to have to iterate over the entire dataset every time we want to know who one person is friends with.

      Let's do better than that. We'll format the dataset once in a way where we can search for a specific user's friends easily, and then we'll create some functions that will quickly perform those searches on our formatted dataset. We should be able to call these functions many times without having to worry about how big our dataset is.

      1. First, write the function formatDataset(filename), which reads the specified file and returns friendData, which is the dataset represented in a format of your choosing. For example, you might choose to return friendData as a string, though this is likely a poor choice. You could also return friendData as some sort of list or set or dictionary. It's up to you! This function does not have to be exceptionally efficient. Just make sure formatDataset(filename) runs in O(n3) time or better, and does not time out the Autograder (i.e. it should run in under 1 minute.). You should be able to do better than O(n3 pretty easily, as our solution function runs in O(n). This function will only be called once, and once we have friendData, we'll use that as an input for our queries.

        Hint: In this problem, n scales with the number of unique users in the input file which (in this problem) means that it also scales with the number of lines. For large enough datasets, the number of friends each user has is fairly constant. (i.e. maybe user 123 has 10 friends in a dataset of 100 userIds. User 123 would also have about 10 friends in a dataset of 1000 userIds.).


      2. Now, write the function friendsOfUser(userId, friendData) where userId is the integer representing a specific user, and friendData contains the stored output of your previous function. (Do not loop over the original input file for each query!) Your function should return the set of userIds identified by the input user as Facebook friends.

        For example, if friendData is the formatted data from our previous example, friendsOfUser(401, friendData) should return:
        {402, 399, 403}


      3. Finally, write the function mutualFriends(userId1, userId2, friendData) where userId1 and userId2 are integers representing two specific users, and friendData contains the stored output of formatDataset(filename). (Again, do not loop over the original input file for each query!) Your function should return the set of userIds identified by BOTH users as Facebook friends.

        For example, if friendData is the formatted data from our previous example, mutualFriends(401, 403, friendData) should return:
        {402, 399}


      Here's the catch: friendsofUser(userId, friendData) and mutalFriends(userId1, userId2, friendData) must run efficiently! The autograder is going to test these functions many times with different userIds. For full credit, friendsofUser(userId, friendData) and mutalFriends(userId1, userId2, friendData) must run in O(n) time or better (where n is the number of unique userIds in the input file)! Only partial credit is given for a function that runs in O(n2), and no credit for any function that times out the autograder. This means you can't simply check the entire text file for every combination of users. Remember that formatDataset(filename) does not have to be exceedingly efficient, so think carefully about how you create friendData to make searches most efficient!

      Hint 2: When testing your functions, first make sure everything works on a very small dataset, like the example given above. Then try it on successively larger datasets from the zip file. Does the time increase proportionally to the size of the file / number of unique userIds? Or is it too slow to run on the largest dataset? Use these files to empirically approxiamte the efficiency of your functions. Do they match your expectation?

      Note: With the exception of the short example above, the datasets used in this assignment are real! Their source is Snap Datasets: Stanford Large Network Dataset Collection (Jure Leskovec and Andrej Krevl, Jun. 2014).

      Note: Don't use "id" as a variable name anywhere! It's a disallowed token in the linter. Also, in case you weren't reading carefully, make sure all userIds are integers and not strings.

      Last step: In the function facebookDataAnalysis() answer the short-answer questions by replacing the 42 on each line with the userID that answers the question. Note: If you downloaded the starter file before 12:15pm, please change the name of your function from someMoreDataAnalysis() to facebookDataAnalysis()!