The purpose of this assignment is to give you experience with the abstract data type known as a graph. The graph is the most frequently used data structure in Artificial Intelligence. As discussed in class, graphs are collections of nodes in which various pairs of nodes are connected by line segments. The nodes are usually called vertices and the line segments are called edges. The vertices and edges in a graph can hold various pieces of information. For example, the vertices could be cities (from assignment #3), and the edges could represent the distances between those cities.
In this assignment, you will be writing the graph data structure for a web-crawling program. The vertices will be the URL addresses of various web sites. The edges will represent the links between sites and information about the context of the link will be stored in each edge. For example, if there is a link on SiteA's page to SiteB's page, then there will be a directed edge in the graph going from the SiteA vertex to the SiteB vertex. Stored along with this edge, will be the html text surrounding the link.
Your first task in this assignment is to complete the code skeleton that I have given you for the Graph class. The representation of a single vertex (class Vertex), and the representation of a list of edges (class Edge_List) has been provided. Missing from this code skeleton are the methods for inserting a new Vertex into the graph (insertVertex), and inserting a directed edge into the graph (insertEdgebyNumber and insertEdgebyName). See the Assign4.java template for a more detailed explanation of each of these methods.
Once your Graph class is complete, you are ready to fill the graph with information. Your second task in the assignment is to write a method (iterative or recursive) that does a breadth first expansion of the web given a starting point. The method will be sent a String pageAddress by Assign4::main that will serve as the starting point for your web-crawler. Given this starting point, your method must enter the pageAddress into the graph as a new Vertex. Then, your function must call LinkHandler::getListOfLinks(String pageAddress). The LinkHandler class (provided) will handle the downloading of web pages and the extraction of links for you. LinkHandler::getListOfLinks(String pageAddress) will return a vector of links that has been extracted from the website with the pageAddress URL.
For each unique Link object in the vector, you must add a vertex in the graph. In addition, you must create a directed edge in the graph that represents the link between this site and the original pageAddress site. Within this directed edge representation you will also store the context of the link (the Link objects supplied by getListOfLinks contain this information). After adding all of the vertices and edges from the vector of links from the original page, you will (recursively or iteratively) repeat the graph expanding process by calling LinkHandler::getListOfLinks() on each distinct site in the vector (thus creating new vectors of links which you can add to the graph and expand, and so on). You will stop your breadth first expansion when your graph contains a maximum number of unique vertices (MAX_VERTICES). See the Assign4.java template for more details.
The final task in the assignment will be to sort the words in the context string before they are stored in an edge of the graph. Your sorting algorithm must break up the input string into distinct words (you can use a tokenizer) and place the words into Vector bins. Each bin will correspond to a starting letter (a, b, c, d, e, f ...). Thus, you will have 26 bins (you may discard any string of characters that does not begin with a letter, and you may convert the words to lowercase). When your sorting algorithm is complete, you must use it to convert the context string into this sorted Vector structure which should be stored in each edge along with the original String context.
The main function is Assign4.java will serve as the driver for the program. The syntax for running the program is:
java Assign4 [starting url] [max_vertices]
For example:
java Assign4 http://www.cmu.edu 100
Good starting sites for this web-crawler are the following:
www.cmu.edu
www.cs.cmu.edu
www.mellon.com
www.javasoft.com
www.javasoft.com/100percent/latestlist.html
Good luck!!
Revised on Thursday, November 13, 1997