CS 4804 Final Project

Date Assigned: November 21, 2003
Date Due: December 17, 2003, 10am, in Torg 2160L

(60 points) For your project, you will implement Monte Carlo and temporal difference algorithms to play the Donut game, also called "Voracious Steve". You might have encountered it if you are in the ACM programming team; however, familiarity with this problem is really not going to help you since we are approaching it from the viewpoint of reinforcement learning.

There are two players who are given a box containing a certain number of donuts (call it "x"). The players can decide who goes first. They alternately take a certain, positive number of donuts from the box, but no more than some fixed integer (call it "y"). Each player's donuts are gathered on the player's side. The player that empties the box (i.e., takes the last donut(s)) eats his donuts while the other one puts his donuts back into the box, and the game continues with the "loser" player starting. The game goes on until all the donuts are eaten. The goal of the game is to eat the most donuts; the person who eats the most and gets a stomach-upset is declared the winner. So, depending on what the values of x and y are, the game is different.

Design an epsilon-soft on-policy Monte Carlo control algorithm for playing this game, i.e., the goal is to find a policy "pi" for winning the game. Start with small values of x and y initially to get the hang of the game. Assume that the reinforcement learning agent is one of the players (and who is learning the Q values for various state-action combinations). The other player will be simulated by your program; this can be just a black box that makes random moves or some other sophisticated program (it could even be the agent again!). Use the template given in class as a guide. Define states, actions, rewards, and values yourself.

Each time the policy is improved, evaluate its performance by pitting it against another player for, say, 100 new games (this is just to evaluate the performance and is not used by the learning algorithm). Plot a graph of number of games won versus each iteration of policy improvement.

Run your program on different choices of x and y. When the dust settles, inspect the final Q values and see if you can state, in English, the policy learnt by your algorithm (i.e., what is the rule for winning at this game?). For full credit, give a careful definition of how you formulated the RL problem (what are the states, actions, rewards, and how is the value function defined), a printout/URL for your code, graph(s), and a set of observations.
(40 points) This question builds on the previous question and addresses many interesting aspects:
- (20 points) Inspect your Q-values and determine if it is beneficial to go first in this game.
- (20 points) Implement the TD (temporal differences) algorithm and see if it improves convergence. You have to be careful here because this is a two-player game and only certain temporal differences have to be backed up for updating the Q function. Pick some x and some y, and plot the curves for both Monte Carlo and TD. Which is better? For full credit, give a printout of your code, graph(s), a description of how you setup the algorithm, and a list of observations.