In this assignment, you will build a primitive intelligent agent that models (in a vague sense) an infant's language acquisition. In contrast to previous assignments, you will not be supplied with the basic data structures or algorithms to use.
The intelligent agent must be able to train on given examples of the English language, and it must be able to use this training to generate English language. In the training phase, your intelligent agent will be able to examine, one at a time, 1000 grammatically correct sentences (48 distinct sentences with repetition). Each sentence is a sentence that an American infant is likely to hear. Moreover, each sentence is specifically targeted at one category of an infant's life (EATING, SLEEPING, PLAYING, or MAINTENANCE).
By examining the training sentences, you will slowly build a graph which contains the language knowledge you are able to extract with the following restriction:
After examining 100, 500, and all 1000 sentences you will be asked to generate sentences based upon the knowledge you have gained from the examples. Your graph representation of knowledge should be able to generalize some of the properties of the presented language (in fact it will over-generalize in many cases).
For example, given the following simple sentences (not all unique):
DO YOU WANT TO EAT
ARE YOU HUNGRY
DO YOU WANT TO PLAY
ARE YOU THIRSTY
DO YOU WANT TO EAT
DO YOU WANT TO PLAY
Your graph may look like the following:
THIRSTY EAT / / / / DO --- YOU -- WANT -- TO / \ \ / \ \ ARE HUNGRY PLAY
Given this graph, it is possible to generate the following sentences (if they are generated strictly from left to right -- which they don't have to be):
DO YOU THIRSTY
DO YOU WANT TO EAT
DO YOU WANT TO PLAY
DO YOU HUNGRY
ARE YOU THIRSTY
ARE YOU WANT TO EAT
ARE YOU WANT TO PLAY
ARE YOU HUNGRY
Notice that the generalized nature of the graph caused us to generate sentences which are not grammatically correct. This happens with real infant's language acquisition as well. By keeping more information in the graph (for example the relationship with the related context word, or frequency data) the grammaticality of generated sentences can improve. You are allowed to be creative with your graph! Because we want to generalize the language information (not just regurgitate sentences) we expect you to generate grammatically flawed sentences.
The majority (60) of the points for this assignment will be earned by just being able to generate sentences (after 100, 500, and 1000 training examples). In the second phase of the assignment, testing, you will be given 4 category words from the BabyLife tree (i.e. CARROTS, CRIB, etc). For each category word, you must generate an appropriate sentence (a sentence that is targeted at the word's category). In addition, you will be given 4 sentences and asked to generate the name of the category they are targeted at (for example DO YOU WANT TO EAT falls in the EATING CATEGORY). These sentences may not have been included in the training set, but should be identifiable by your intelligent agent. See the grading page for more information.
You have been given a main method which will serve as the driver for the program. In addition, you have been given the file manipulation and tokenizing methods to read the data files. Your task is to create any classes and methods that you need to meet the objective and to fill in the train and test function skeletons in the Assign6 class. See the assignment template for more information.
The syntax for running the program is:
java Assign6 [datafile] [testfile]
For example:
java Assign6 language.dat test.dat
Good luck on the assignment!
Revised on Thursday, December 4, 1997