CS 4804 Final Project
Date Assigned: November 21, 2003
Date Due: December 17, 2003, 10am, in Torg 2160L
- (60 points) For your project, you will implement Monte Carlo and temporal difference
algorithms to play the Donut game, also called "Voracious Steve". You
might have encountered it if you are in the ACM programming team; however,
familiarity
with this problem is really not going to help you since we are
approaching it from the viewpoint
of reinforcement learning.
There are two players who are
given a box containing a certain number of donuts (call it "x"). The players
can decide who goes first. They
alternately take a certain, positive number of donuts from the box,
but no more than some fixed integer (call it "y"). Each player's donuts are
gathered on the player's side. The player that empties the box (i.e.,
takes the last donut(s)) eats his donuts while the other
one puts his donuts back into the box, and the game continues with
the "loser" player starting. The game goes on until all the donuts
are eaten. The goal of the game is to eat the most donuts; the person
who eats the most and gets a stomach-upset is declared the winner.
So, depending on what the values of x and y are, the game is different.
Design an epsilon-soft on-policy Monte Carlo control algorithm for playing
this game, i.e., the goal is to find a policy "pi" for winning the game.
Start with small values of x and y initially to get the hang of the game.
Assume that the reinforcement learning agent is one of the players (and who
is learning the Q values for various state-action combinations). The other
player will be simulated by your program; this can be just a black box
that makes random moves or some other sophisticated program (it could even be
the agent again!). Use the template given in class as a guide. Define states,
actions, rewards, and values yourself.
Each time the policy is improved, evaluate its performance by pitting it
against another player for, say, 100 new games (this is just to evaluate
the performance and is not used by the learning algorithm). Plot a graph
of number of games won versus each iteration of policy improvement.
Run your program on different choices of x and y.
When the dust settles, inspect the final Q values and see if you can
state, in English, the policy learnt by your algorithm (i.e., what is the
rule for winning at this game?). For full credit, give a careful definition of
how you formulated the RL problem (what are the states, actions, rewards, and
how is the value function defined), a printout/URL for your code,
graph(s), and a set of observations.
- (40 points) This question builds on the previous question and addresses
many interesting aspects:
- (20 points)
Inspect your Q-values and determine if it is beneficial to go first in this
game.
- (20 points) Implement the TD (temporal differences) algorithm and see
if it improves convergence. You have to be careful here because this is a
two-player game and only certain temporal differences have to be backed up
for updating the Q function. Pick some x and some y, and plot the
curves for both Monte Carlo and TD. Which is better? For full credit, give
a printout of your code, graph(s), a description of how you setup the
algorithm, and a list of observations.