CS 4804 Homework #5

Date Assigned: March 19, 2003
Date Due: March 28, 2003, in class, before class starts
  1. (10 points) Problem 18.4 of your textbook.

  2. (10 points) Consider a boolean function written in disjunctive normal form (DNF, not CNF). What is the relationship between the number of clauses and the size of the decision tree needed to represent this function?

  3. (80 points) Take a look at the long list of machine learning datasets available at the UCI Machine Learning Repository (isn't it amazing?) We will be experimenting with the Mushrooms dataset, where the goal is to predict if a mushroom is poisonous or edible. Scroll down the webpage to access this dataset; access the ".names" file (which gives background and format details), and the ".data" file (which gives the actual data).

    For this question, you will implement a decision tree algorithm and unleash it on the mushroom dataset. First divide up the dataset into two parts: a training set (containing 2/3 of the edible mushrooms and 2/3 of the poisonous mushrooms), and a test set (containing the remaining 1/3 of each). Use only the training set to construct the decision tree. Next, notice that some attribute values are missing. This is a characteristic of real-life datasets. To overcome this problem, use the strategy described in Problem 18.12 of your textbook.

    As the tree is being constructed, at every stage measure the tree's performance on both the training set and the test set. i.e., after the root node is decided, assume that this is the final tree and evaluate how many mushrooms it correctly classifies (on both datasets). For every mushroom misclassified, count it as one mistake, and total up the mistakes on both datasets. Express the error as a percentage (of the size of the dataset) and plot it as the tree grows. So you should have two curves (y axis) plotted against the number of nodes in the decision tree (x axis).

    What do you observe? Make interesting observations from your study. Compare the decision tree with the rules mentioned in the ".names" file. How does your decision tree algorithm fare?


Return Home