Homework 6: Erdős Numbers

1. Introduction
2. Your Tasks
3. Implementing Dijkstra’s Shortest Path Algorithm
4. Grading and Submission
5. Notes/Hints
6. Important Notes

Due Monday, December 2 @ 8:00pm on Gradescope

Notes:

This homework is challenging and requires some thought. You really need to start early.
Pay careful attention to the note under Grading and Submission that mentions not turning in the large input file.

1. Introduction

Paul Erdős was a famous mathematician who published over 1500 research papers in his lifetime, collaborating with over 500 other authors while doing do. As a tribute to him, and as an interesting study of collaborative distance between researchers, the concept of the Erdős number was born.

An author’s Erdős number is a measure of how many “hops” it takes to move from that author to Paul Erdős by following a chain of shared authorship. This is best understood by example:

John Mackey is an author of the paper:
Improved upper and lower bounds on a geometric Ramsey problem by Mikhail Lavrov, Mitchell Lee, and John Mackey.
Mikhail Lavrov is an author of the paper:
Conditions for a Bigraph to be Super-Cyclic by Alexandr V. Kostochka, Mikhail Lavrov, Ruth Luo, and Dara Zirlin.
Alexandr V. Kostochka is an author of the paper:
Intersection Statements for Systems of Sets by Walter A. Deuber, Paul Erdös, David S. Gunderson, Alexandr V. Kostochka, and A. G. Meyer.

This means that John Mackey has an Erdős number of at most three. We can also see that Mikhail Lavrov has an Erdős number of at most two, and Alexandr V. Kostochka at most one. If you don’t understand how this number is calculated, go back and look through the example again, paying careful attention to the common authors between some of the papers.

Determining someone’s Erdős number is just a shortest-path search problem in a graph. Imagine a graph where the nodes are authors and the edges are publications in which those two authors collaborated. Given that graph, you could find someone’s Erdős number by finding the length of the shortest path between Paul Erdős and that person.

2. Your Tasks

Download a pre-prepared Eclipse Project hw6.zip and and import it into Eclipse using these instructions. You will make all of your changes inside of this project.

2.1. Writing the `Publication` class

A publication should store information about a single publication. The relevant information that needs to be stored is the title of the publication and each of its authors.

The constructor of Publication should have the following prototype:

public Publication(String title, String[] authors)

There should be an appropriately named getter for both the title and the authors:

public String getTitle()
public String[] getAuthors()

The toString of Publication should produce output in the following format:

Publication [title=Improved upper and lower bounds on a geometric Ramsey problem., authors=[Mikhail Lavrov, Mitchell Lee, John Mackey]]

2.2. Writing the `ErdosCalculator` class

The ErdosCalculator class is used to load papers from a text file and determine the shortest path between authors. Here some sample code that shows how it can be used to find the Erdős number (and a corresponding path) for both John Mackey and Ryan Riley:

ErdosCalculator e = new ErdosCalculator();
System.out.print("Loading papers and building graph...");
e.loadPapers("dblp-partial.txt");
System.out.println("done.");

System.out.print("Calculating the shortest path to Paul Erdös...");
e.preBuildShortestPath("Paul Erdös");
System.out.println("done.");

String[] people = { "Ryan Riley", "John Mackey" };
for (String person : people) {
    Publication[] pubs = e.getShortestPathFrom(person);
    System.out.println("Shortest Path from " + person);
    for (Publication p : pubs) {
        System.out.println(p);
 }
    System.out.println();
}

`public void loadPapers(String filename)`

Load the research papers from the specified file. This should load the papers and use them to build a graph.

Parameters: filename —

`public void preBuildShortestPath(String author)`

Use Dijkstra’s shortest path algorithm to calculate the shortest path between the specified author and every other author in the graph.

Parameters: author — The “base” author for all future shortest path searches.

`public Publication[] getShortestPathFrom(String author)`

Return the shortest path (an array of publications) from a specified author to whichever author preBuidlShortestPath was called on previously.

Parameters: author — The author to determine the shortest path to.
Returns: An array of all of the publications in the path.

2.3. Other Classes

You will almost certainly need to write other classes to help you complete this project. Here are some recommendations.

`Node`

The will represent a node in the graph. Each node represents one author. Each author should have only one node in the graph.

`Edge`

An edge between two nodes. This will represent a paper. Each paper could be represented by multiple edges in the graph.

`Graph`

Most of the hard work in this project could be contained here. You will likely need methods for the following operations:

Adding a node (author) to the graph and ensuring that no duplicate nodes are added.
Adding an edge (publication) to the graph that goes between two nodes.
Running Dijkstra’s shortest path algorithm to create the visited list which will store the cost (distance) and previous between each author and the target author (usually Erdős).
Determine the actual path from an author to the target author (Erdős) by analyzing the visited list calculated in the previous step.

2.4. The File Format

A sample file of publications is provided. You should be able to figure out its format by looking at it.

3. Implementing Dijkstra’s Shortest Path Algorithm

In this section, we will provide tips for how to implement Dijkstra’s shortest path algorithm in Java. To follow this, you need to make sure you understand and can work the algorithm on paper. If you can’t do that, then there is no hope of understanding how to implement it.

Want a refresher from the graph algorithms lecture? Check out this video of the Dijkstra’s algorithm example:

The Distance List

The distance list is best implemented as two different maps: 1. A map between the node name (author) and the cost (length of shortest path seen so far). 2. A map between the node name (author) and the previous node on the shortest path seen so far.

The Unvisited List

The unvisited list needs to allow you to quickly determine the entry with the smallest cost. To accomplish this, a priority queue is a good choice. Each entry in the priority queue needs to contain two items: The name of the node, and its cost. Java has a built-in priority queue class, feel free to use it. To store two items in each entry, you should probably just create an object with both of those items and put that object into the queue. Here is some sample code demonstrating this:

public class PQDemo {
    private static class Pair implements Comparable<Pair> {
        String name;
        int value;

        public Pair(String name, int value) {
            this.name = name;
            this.value = value;
        }

        @Override
        public int compareTo(Pair other) {
            return this.value - other.value;
        }
    }
    public static void main(String[] args) {
        PriorityQueue<Pair> pq = new PriorityQueue<Pair>();
        pq.add(new Pair("Bob", 10));
        pq.add(new Pair("Ahmed", 7));
        pq.add(new Pair("Fred", 12));
        pq.add(new Pair("Fatima", 2));

        while (!pq.isEmpty()) {
            Pair p = pq.poll();
            System.out.println(p.name + ": " + p.value);
        }
    }
}

In most descriptions of Dijkstra’s algorithm, the cost of each item on the unvisited list is updated as needed. However, a priority queue does not do well with updates, so we just won’t bother. If an existing entry in the queue needs to be updated with a new, lower cost then just add a new entry with that new, lower cost. Even though this results in multiple entries with the same name and different costs, it won’t matter in the end. (Because we’ll process the lower-cost entries first.)

The Algorithm

Now that you can represent the distance list (DL) and the unvisited list (UL), the shortest past algorithm for a given author (A) can be implemented as follows:

Add an entry to the UL showing A has a cost of 0.
Add an entry to the DL showing that the shortest known path to A is zero.
As long as the UL has entries…
- Remove the lowest cost item from the UL. Let’s call this Cur.
- Find the length of the shortest path from A to Cur. (Think: Where would this be stored?) If we don’t know, then just use the largest possible value.
- For each edge leaving Cur, see if the path going through Cur is shorter than the previously known path. If it is, update the DL and add it to the UL.
At this point the DL should store the length of the shortest path from A to each node. It should also store the previous node in each path.

4. Grading and Submission

Important Note: Do not submit dblp-partial.txt or any other large text files to Gradescope. If you do, the autograder will give you an immediate 0. When exporting your project, make sure to uncheck the box next to dblp-partial.txt or any other large input files that you may have created.

There are multiple parts of the grading of this assignment:

For the first 90 points, your submission will be auto-graded based on your implementation.
For the next 5 points, your submission will be manually graded to check for good implementation methodologies. (Did you use a good approach to solving the problems?)
For the next 5 points, your submission will be manually graded to check for good testcases that you include in the main method. (Do you have 2-3 of your own testcases for each method, and do they all execute automatically?)
Your code will also be checked for style. The parts of style that can be checked automatically (things like spacing, indentation, the use of CamelCase, etc.) are automatically checked by the autograder. Other parts of style, such as choosing good variable names, will be checked manually. Autograded style guide violations can result in, at most, -10 points. Manually checked style guide violations can result in, at most, -5 points.

You will submit your program to Gradescope. Log in to the system and you will see the homework. Once there, you need to submit a zip file containing your code. Lucky for you, however, Eclipse can create this zip file for you. Check out these instructions for exporting. On Gradescope, you’ll submit that exported zip file. On the page that follows your submission, you will see your live score. If you receive a lower score than you expect, you should study the autograder output to see which testcases you failed.

For this assignment you have 5 submissions. You should write your own testcases while you work so that you don’t waste submissions. After you have used your submissions, you may continue to submit, but your submission will be penalized 4 points for every extra submission. (So, your 6th submission receives a -4, your 7th submission a -8, etc.)

5. Notes/Hints

For many authors, there are multiple shortest paths. In that situation, you may return any of the shortest paths. (Our autograder is checking that you return any valid shortest path, not that you return the exact one that our implementation chose.)
The provided sample file, dblp-partial.txt is about 20% of all Computer Science and Mathematics publications. Even with that only being a partial set of publications, any inefficiencies in your program may cause significant increases in runtime. Our solution takes about 10 seconds to load that entire file and calculate Dijkstra’s on it.

6. Important Notes

If you have questions, please post to Discord. The course staff, including the instructor, monitor and answer questions there.
When posting to Discord, if you include any code make sure your question is posted privately to the course staff, and not to the entire class.
Do not change the names of any of the provided methods or files. You may, however, add additional methods as needed.
After you submit to Gradescope, make sure you check your score. If you aren’t sure how to do this, then ask.
There is no partial credit on Gradescope testcases. Your Gradescope score is your Gradescope score.
Read the last bullet point again. Seriously, we won’t go back later and increase your Gradescope score for any reason. Even if you worked really hard and it was only a minor error…
You can submit to Gradescope multiple times. So, after you submit you should check your score. If it isn’t a 90 you can keep working until it is.
Don’t forget to make sure your code conforms to the style guide. We’ll be taking a look at that!
The style guide has a bunch of small rules with regard to indentation and spacing that are hard to follow perfectly. Luckily, Eclipse can handle it for you! Go to Source -> Format in Eclipse and watch all of your spacing and indentation problems get fixed automatically. (You should do this before every submission.)