AI Assignment 5: Sample Input

Problem 3

Sample files

Coins
Books

Test files

Coins

The file contains six training values and two test values. These values correspond to whether a coin will be valuable to a collector, and the fields correspond to the following table:

Classification	How many made?	How old is coin?	How much wear is on the coin?
Positive	Rare	New	Low
Positive	Rare	Old	Low
Positive	Common	Old	Low
Negative	Rare	Old	High
Negative	Common	New	Low
Negative	Common	New	High

From the above table, it should be apparent that the wear on the coin contains the most information for the first classification, as the value for wear is "low" in all of the positive cases. The actual information calculation looks like this:

First, I(p/(p+n), n/(p+n)) can be calculated as I(3/6, 3/6) = -0.5*lg(0.5) - 0.5*lg(0.5) = 1. For the case of I(1, 0) (where 0 is the logarithmic singularity), I() goes to zero.

Next, the gain from each attribute can be computed as:

Gain(Rarity) = 1 - [(3/6)*I(2/3, 1/3) + (3/6)*I(1/3, 2/3)] = 0.0817

Gain(Age) = 1 - [(3/6)*I(2/3, 1/3) + (3/6)*I(1/3, 2/3)] = 0.0817

Gain(Wear) = 1 - [(4/6)*I(3/4,1/4) + (2/6)*I(0,1)] = 0.4591

As expected, we choose Wear to be the first branch of our decision tree. Since the test data both have high wear, they will be classified as Negative, meaning that they do not have much collector's value. The tree should branch again to handle any test cases that have a low wear attribute (using one of the remaining attributes); however, since there are no other possible data that are not already in the training or data sets, this will not be shown here. The next example will show that next step.

Books

This file contains eight training data and three test data, and the attributes refer to whether a book will be expensive at the local bookstore.

Classification	Bind type	Style of book	Color pictures?	Is the book well known?	Length of book
Positive	Hardcover	Novel	Nocolor	Popular	Long
Positive	Softcover	Textbook	Nocolor	Popular	Long
Negative	Softcover	Novel	Nocolor	Popular	Short
Positive	Hardcover	Textbook	Color	Popular	Short
Positive	Hardcover	Photojournal	Color	Unknown	Short
Negative	Softcover	Textbook	Nocolor	Unknown	Short
Positive	Hardcover	Photojournal	Color	Popular	Long
Negative	Softcover	Novel	Color	Unknown	Short

The calculation of the gain from each attribute is just as it was above. First, I(p/(p+n), n/(p+n)) can be computed as -(5/8)*lg(5/8) - (3/8)*lg(3/8) = 0.954434

Next, each the gain from each attribute can be computed:

Gain(Bind) = 0.954434 - [(4/8)*I(1, 0) + (4/8)*I(3/4, 1/4)] = 0.54879494

Gain(Style) = 0.954434 - [(3/8)*I(1/3, 2/3) + (3/8)*I(2/3, 1/3) + (2/8)*I(1, 0)] = 0.265712

Gain(Color) = 0.954434 - [(4/8)*I(2/4, 2/4) + (4/8)*I(3/4, 1/4)] = 0.04879494

Gain(Popularity) = 0.954434 - [(5/8)*I(4/5, 1/5) + (3/8)*I(1/3, 2/3)] = 0.158868

Gain(Length) = 0.954434 - [(3/8)*I(1, 0) + (5/8)*I(2/5, 3/5)] = 0.34758988139

Based on these computed gains for each attribute, the bind is chosen as the first branch in the decision tree. Looking at the training data, a hardcover binding always leads to a Positive classification. To deal with the remaining softcover cases, we must eliminate the hardcover cases and start over.

Classification Bind type Style of book Color pictures? Is the book well known? Length of book

Positive Softcover Textbook Nocolor Popular Long

Negative Softcover Novel Nocolor Popular Short

Negative Softcover Textbook Nocolor Unknown Short

Negative Softcover Novel Color Unknown Short

Here, I(p/(p+n), n/(p+n)) = -(1/4)*lg(1/4) - (3/4)*lg(3/4) = 0.8112781

As above, the gains for the remaining attributes is computed:

Gain(Style) = 0.8112781 - [(2/4)*I(1/2, 1/2) + (2/4)*I(0, 1)] = 0.3112781

Gain(Color) = 0.8112781 - [(3/4)*I(2/3, 1/3) + (1/4)*I(0, 1)] = 0.1225562

Gain(Popularity) = 0.8112781 - [(2/4)*I(1/2, 1/2) + (2/4)*I(0, 1)] = 0.3112781

Gain(Length) = 0.8112781 - [(1/4)*I(1, 0) + (3/4)*I(0, 1)] = 0.8112781

From these new gains, it should be obvious that the length of the book is the next branch in our tree. As it turns out, out of all of the training data, short books get a negative classification and long books get a positive classification. Since all data are classified at this level, no further decision branches are needed. If there were any cases that were not categorized by this tree, then this process would repeat.

Now, consider the test data:

Softcover Photojournal Color Popular Short

Hardcover Novel Nocolor Unknown Long

Hardcover Textbook Nocolor Unknown Long

The first book, as a softcover, will be categorized by its length. Since it is short, its classification will be negative. The second and third books, as hardcovers, will have positive classifications.

Days

The task is to distinguish Sunday from Monday; Sunday is positive, and Monday is negative.

Weather

The attributes are a state, month, and city, and the task is to determine whether a traveler needs a sweater when visiting this city.

Food

We need to decide whether to buy a hamburger; the attributes include the dollar cost of the hamburger, place (Checkers, Burger King, or McDonald's), and time.

Back to the AI home page

Gain(Rarity) =	1 - [(3/6)I(2/3, 1/3) + (3/6)I(1/3, 2/3)]	= 0.0817
Gain(Age) =	1 - [(3/6)I(2/3, 1/3) + (3/6)I(1/3, 2/3)]	= 0.0817
Gain(Wear) =	1 - [(4/6)I(3/4,1/4) + (2/6)I(0,1)]	= 0.4591

Gain(Bind) =	0.954434 - [(4/8)I(1, 0) + (4/8)I(3/4, 1/4)]	= 0.54879494
Gain(Style) =	0.954434 - [(3/8)I(1/3, 2/3) + (3/8)I(2/3, 1/3) + (2/8)*I(1, 0)]	= 0.265712
Gain(Color) =	0.954434 - [(4/8)I(2/4, 2/4) + (4/8)I(3/4, 1/4)]	= 0.04879494
Gain(Popularity) =	0.954434 - [(5/8)I(4/5, 1/5) + (3/8)I(1/3, 2/3)]	= 0.158868
Gain(Length) =	0.954434 - [(3/8)I(1, 0) + (5/8)I(2/5, 3/5)]	= 0.34758988139

Gain(Style) =	0.8112781 - [(2/4)I(1/2, 1/2) + (2/4)I(0, 1)]	= 0.3112781
Gain(Color) =	0.8112781 - [(3/4)I(2/3, 1/3) + (1/4)I(0, 1)]	= 0.1225562
Gain(Popularity) =	0.8112781 - [(2/4)I(1/2, 1/2) + (2/4)I(0, 1)]	= 0.3112781
Gain(Length) =	0.8112781 - [(1/4)I(1, 0) + (3/4)I(0, 1)]	= 0.8112781