(40 points) In this question, we will implement a Monte Carlo
reinforcement learning algorithm for playing the Rocky Road game. Rocky
Road is played by two participants and with a rectangular bar of gridded
chocolate. For instance, the following is a 5 by 2 gridded chocolate
bar:
-----------
| | |
| | |
-----------
| | |
| | |
-----------
| | |
| | |
-----------
| | |
| | |
-----------
| | |
| | |
-----------
In general chocolate bars can have any number of rows and any number
of columns but they are always rectangular. Rocky Road bars can only be broken
along a row or along a column, i.e., they can only be broken along
the lines given (i.e., rows and columns). You cannot break the chocolate in
any other manner.
The players can decide who goes first and
take turns; in each turn, the player breaks off a
piece along either some row or some column (not both) and eats it.
A player typically cannot eat
the whole thing and cannot "pass" when it is his turn (i.e., he or she
has to break the chocolate). It is easy to see that with these rules,
there will come a point when the bar is left of size 1-by-1 and it
is some player's turn. At this point, this player loses (so, one who is
left to eat the last piece is the loser).
Design an epsilon-soft on-policy Monte Carlo control algorithm for
playing this game, i.e., the goal is to find a policy "pi" for winning
the game. Use a 8-by-5 chocolate as the starting state.
Assume that the reinforcement learning agent is one of the players
(and who is learning the Q values for various state-action combinations). The
other player will be simulated by your program; this can be just a
black box that makes random moves or some other sophisticated program (it
could even be the agent again!). Use the template given in class as a guide.
Define states, actions, rewards, and values yourself.
Each time the policy is improved, evaluate its performance by
pitting it against another player for, say, 100 new games (this is just
to evaluate the performance and is not used by the learning algorithm). Plot
a graph of number of games won versus each iteration of policy improvement.
When the dust settles, inspect the final Q values and see if you can state,
in English, the policy learnt by your algorithm (i.e., what is the rule
for winning at this game?). For full credit, give a careful definition of
how you formulated the RL problem (what are the states, actions,
rewards, and how is the value function defined), a printout/URL for your
code, graph(s), and
a set of observations.