CS 4804 Homework #5
Date Assigned: March 19, 2003
Date Due: March 28, 2003, in class, before class starts
- (10 points) Problem 18.4 of your textbook.
- (10 points) Consider a boolean function written in disjunctive
normal form (DNF, not CNF). What is the relationship between the number of
clauses
and the size of the decision tree needed to represent this function?
- (80 points) Take a look at the long list of machine learning datasets
available at the UCI Machine Learning Repository (isn't it amazing?) We will be experimenting with
the Mushrooms dataset, where the goal is to predict if a mushroom is
poisonous or edible. Scroll down the webpage to access this dataset;
access the ".names" file
(which gives background and format details), and the ".data" file (which
gives the actual data).
For this question, you will implement a decision tree algorithm and unleash
it on the mushroom dataset. First divide up the dataset into two parts:
a training set (containing 2/3 of the edible mushrooms and 2/3 of the
poisonous mushrooms), and a test set (containing the remaining 1/3 of each).
Use only the training set to construct the decision tree.
Next, notice that some attribute values are missing. This is a characteristic
of real-life datasets. To overcome this problem, use the strategy
described in Problem 18.12 of your textbook.
As the tree is being constructed, at every stage measure the tree's performance
on both the training set and the test set. i.e., after the root node is
decided, assume that this is the final tree and evaluate how many mushrooms
it correctly classifies (on both datasets). For every mushroom misclassified,
count it as one mistake, and total up the mistakes on both datasets. Express
the error as a percentage (of the size of the dataset)
and plot it as the tree grows. So you should
have two curves (y axis) plotted against the number of nodes in
the decision tree (x axis).
What do you observe? Make interesting observations from your study. Compare
the decision tree with the rules mentioned in the ".names" file. How
does your decision tree algorithm fare?