Homework 6: Erdős Numbers



Due Monday, December 2 @ 8:00pm on Gradescope

Notes:

1. Introduction

Paul Erdős was a famous mathematician who published over 1500 research papers in his lifetime, collaborating with over 500 other authors while doing do. As a tribute to him, and as an interesting study of collaborative distance between researchers, the concept of the Erdős number was born.

An author’s Erdős number is a measure of how many “hops” it takes to move from that author to Paul Erdős by following a chain of shared authorship. This is best understood by example:

This means that John Mackey has an Erdős number of at most three. We can also see that Mikhail Lavrov has an Erdős number of at most two, and Alexandr V. Kostochka at most one. If you don’t understand how this number is calculated, go back and look through the example again, paying careful attention to the common authors between some of the papers.

Determining someone’s Erdős number is just a shortest-path search problem in a graph. Imagine a graph where the nodes are authors and the edges are publications in which those two authors collaborated. Given that graph, you could find someone’s Erdős number by finding the length of the shortest path between Paul Erdős and that person.

2. Your Tasks

Download a pre-prepared Eclipse Project hw6.zip and and import it into Eclipse using these instructions. You will make all of your changes inside of this project.

2.1. Writing the Publication class

A publication should store information about a single publication. The relevant information that needs to be stored is the title of the publication and each of its authors.

public Publication(String title, String[] authors)
public String getTitle()
public String[] getAuthors()
Publication [title=Improved upper and lower bounds on a geometric Ramsey problem., authors=[Mikhail Lavrov, Mitchell Lee, John Mackey]]

2.2. Writing the ErdosCalculator class

The ErdosCalculator class is used to load papers from a text file and determine the shortest path between authors. Here some sample code that shows how it can be used to find the Erdős number (and a corresponding path) for both John Mackey and Ryan Riley:

ErdosCalculator e = new ErdosCalculator();
System.out.print("Loading papers and building graph...");
e.loadPapers("dblp-partial.txt");
System.out.println("done.");

System.out.print("Calculating the shortest path to Paul Erdös...");
e.preBuildShortestPath("Paul Erdös");
System.out.println("done.");

String[] people = { "Ryan Riley", "John Mackey" };
for (String person : people) {
    Publication[] pubs = e.getShortestPathFrom(person);
    System.out.println("Shortest Path from " + person);
    for (Publication p : pubs) {
        System.out.println(p);
 }
    System.out.println();
}
public void loadPapers(String filename)

Load the research papers from the specified file. This should load the papers and use them to build a graph.


public void preBuildShortestPath(String author)

Use Dijkstra’s shortest path algorithm to calculate the shortest path between the specified author and every other author in the graph.


public Publication[] getShortestPathFrom(String author)

Return the shortest path (an array of publications) from a specified author to whichever author preBuidlShortestPath was called on previously.


2.3. Other Classes

You will almost certainly need to write other classes to help you complete this project. Here are some recommendations.

Node

The will represent a node in the graph. Each node represents one author. Each author should have only one node in the graph.

Edge

An edge between two nodes. This will represent a paper. Each paper could be represented by multiple edges in the graph.

Graph

Most of the hard work in this project could be contained here. You will likely need methods for the following operations:

2.4. The File Format

A sample file of publications is provided. You should be able to figure out its format by looking at it.

3. Implementing Dijkstra’s Shortest Path Algorithm

In this section, we will provide tips for how to implement Dijkstra’s shortest path algorithm in Java. To follow this, you need to make sure you understand and can work the algorithm on paper. If you can’t do that, then there is no hope of understanding how to implement it.

Want a refresher from the graph algorithms lecture? Check out this video of the Dijkstra’s algorithm example:

The Distance List

The distance list is best implemented as two different maps: 1. A map between the node name (author) and the cost (length of shortest path seen so far). 2. A map between the node name (author) and the previous node on the shortest path seen so far.

The Unvisited List

The unvisited list needs to allow you to quickly determine the entry with the smallest cost. To accomplish this, a priority queue is a good choice. Each entry in the priority queue needs to contain two items: The name of the node, and its cost. Java has a built-in priority queue class, feel free to use it. To store two items in each entry, you should probably just create an object with both of those items and put that object into the queue. Here is some sample code demonstrating this:

public class PQDemo {
    private static class Pair implements Comparable<Pair> {
        String name;
        int value;

        public Pair(String name, int value) {
            this.name = name;
            this.value = value;
        }

        @Override
        public int compareTo(Pair other) {
            return this.value - other.value;
        }
    }
    public static void main(String[] args) {
        PriorityQueue<Pair> pq = new PriorityQueue<Pair>();
        pq.add(new Pair("Bob", 10));
        pq.add(new Pair("Ahmed", 7));
        pq.add(new Pair("Fred", 12));
        pq.add(new Pair("Fatima", 2));

        while (!pq.isEmpty()) {
            Pair p = pq.poll();
            System.out.println(p.name + ": " + p.value);
        }
    }
}

In most descriptions of Dijkstra’s algorithm, the cost of each item on the unvisited list is updated as needed. However, a priority queue does not do well with updates, so we just won’t bother. If an existing entry in the queue needs to be updated with a new, lower cost then just add a new entry with that new, lower cost. Even though this results in multiple entries with the same name and different costs, it won’t matter in the end. (Because we’ll process the lower-cost entries first.)

The Algorithm

Now that you can represent the distance list (DL) and the unvisited list (UL), the shortest past algorithm for a given author (A) can be implemented as follows:

  1. Add an entry to the UL showing A has a cost of 0.
  2. Add an entry to the DL showing that the shortest known path to A is zero.
  3. As long as the UL has entries…
    • Remove the lowest cost item from the UL. Let’s call this Cur.
    • Find the length of the shortest path from A to Cur. (Think: Where would this be stored?) If we don’t know, then just use the largest possible value.
    • For each edge leaving Cur, see if the path going through Cur is shorter than the previously known path. If it is, update the DL and add it to the UL.
  4. At this point the DL should store the length of the shortest path from A to each node. It should also store the previous node in each path.

4. Grading and Submission

Important Note: Do not submit dblp-partial.txt or any other large text files to Gradescope. If you do, the autograder will give you an immediate 0. When exporting your project, make sure to uncheck the box next to dblp-partial.txt or any other large input files that you may have created.

There are multiple parts of the grading of this assignment:

  1. For the first 90 points, your submission will be auto-graded based on your implementation.

  2. For the next 5 points, your submission will be manually graded to check for good implementation methodologies. (Did you use a good approach to solving the problems?)

  3. For the next 5 points, your submission will be manually graded to check for good testcases that you include in the main method. (Do you have 2-3 of your own testcases for each method, and do they all execute automatically?)

  4. Your code will also be checked for style. The parts of style that can be checked automatically (things like spacing, indentation, the use of CamelCase, etc.) are automatically checked by the autograder. Other parts of style, such as choosing good variable names, will be checked manually. Autograded style guide violations can result in, at most, -10 points. Manually checked style guide violations can result in, at most, -5 points.

You will submit your program to Gradescope. Log in to the system and you will see the homework. Once there, you need to submit a zip file containing your code. Lucky for you, however, Eclipse can create this zip file for you. Check out these instructions for exporting. On Gradescope, you’ll submit that exported zip file. On the page that follows your submission, you will see your live score. If you receive a lower score than you expect, you should study the autograder output to see which testcases you failed.

For this assignment you have 5 submissions. You should write your own testcases while you work so that you don’t waste submissions. After you have used your submissions, you may continue to submit, but your submission will be penalized 4 points for every extra submission. (So, your 6th submission receives a -4, your 7th submission a -8, etc.)

5. Notes/Hints

6. Important Notes