CS 4804 Homework #8 (and final)

Date Assigned: April 16, 2003
Date Due: May 2, 2003, 2160L, 4pm
  1. (20 points) In the Gambler's Ruin problem, the agent is the gambler and is making bets on the outcome of coin flips. The gambler has "s" dollars in the beginning of the game and wants to earn 100 dollars. On each step, the gambler can wage at most the minimum of ("s" and "100-"s). For instance, if the gambler has 26 dollars in hand, he can wage at most 26 dollars. If the gambler has 64 dollars, he can wage at most 36 dollars. He has to wage some non-zero amount at each step, until he either goes bankrupt (has zero dollars, in which case he loses the game) or gets his 100 dollars (in which case he wins). So he can't just sit still and pocket the money he has currently.

    The coin is a Las Vegas coin, meaning it is loaded. With probability "p" the coin comes up heads and the gambler receives the amount he waged; with probability "1-p" the coin toss is tails, and the gambler loses his wager. Assume that "p" is known, say 0.3.

    Our goal is to formulate this situation as a reinforcement learning problem, solve it by value iteration or policy iteration, and find a good policy for the gambler, i.e., in a given state, given that he has a certain amount of money, how much should he wager? Here are some ideas and hints for you to proceed.

    • First determine what the states are. Then, see how many actions each state has (what are the actions, by the way?). Use this to define the states and actions in your reinforcement learning program.

    • Convince yourself that this is the correct formulation. Here's how: remember the gambler's policy must recommend to him how much to stake at each step in the game. In reinforcement learning, a policy is a mapping from states to actions, or the probability of taking a given action in a certain state. Make sure that the states and actions you defined allow a suitable policy to be captured.

    • Then setup the probability transition matrix. Are the transitions deterministic? Or are they probabilistic?

    • Setup the rewards matrix. The rewards should reflect the goal that we want to get 100 dollars at the end of the game; it really doesn't matter how long the game takes, as long as we get 100 dollars.

    • Fix a suitable discount factor (it can't be zero, why?) and use value iteration and policy iteration to find the best policy. What does the policy look like? What do the optimal values look like? Plot graphs.

    For full credit, describe in detail your answers to each of the above parts. Repeat the experiment for different values of "p". Look at how the graphs change and make meaningful observations about how the gambler's policy changes depending on the loading of the coin.

  2. (40 points) In this question, we will implement a Monte Carlo reinforcement learning algorithm for playing the Rocky Road game. Rocky Road is played by two participants and with a rectangular bar of gridded chocolate. For instance, the following is a 5 by 2 gridded chocolate bar:
    
    -----------
    |    |    |
    |    |    |
    -----------
    |    |    |
    |    |    |
    -----------
    |    |    |
    |    |    |
    -----------
    |    |    |
    |    |    |
    -----------
    |    |    |
    |    |    |
    -----------
    
    In general chocolate bars can have any number of rows and any number of columns but they are always rectangular. Rocky Road bars can only be broken along a row or along a column, i.e., they can only be broken along the lines given (i.e., rows and columns). You cannot break the chocolate in any other manner.

    The players can decide who goes first and take turns; in each turn, the player breaks off a piece along either some row or some column (not both) and eats it. A player typically cannot eat the whole thing and cannot "pass" when it is his turn (i.e., he or she has to break the chocolate). It is easy to see that with these rules, there will come a point when the bar is left of size 1-by-1 and it is some player's turn. At this point, this player loses (so, one who is left to eat the last piece is the loser).

    Design an epsilon-soft on-policy Monte Carlo control algorithm for playing this game, i.e., the goal is to find a policy "pi" for winning the game. Use a 8-by-5 chocolate as the starting state. Assume that the reinforcement learning agent is one of the players (and who is learning the Q values for various state-action combinations). The other player will be simulated by your program; this can be just a black box that makes random moves or some other sophisticated program (it could even be the agent again!). Use the template given in class as a guide. Define states, actions, rewards, and values yourself.

    Each time the policy is improved, evaluate its performance by pitting it against another player for, say, 100 new games (this is just to evaluate the performance and is not used by the learning algorithm). Plot a graph of number of games won versus each iteration of policy improvement.

    When the dust settles, inspect the final Q values and see if you can state, in English, the policy learnt by your algorithm (i.e., what is the rule for winning at this game?). For full credit, give a careful definition of how you formulated the RL problem (what are the states, actions, rewards, and how is the value function defined), a printout/URL for your code, graph(s), and a set of observations.

  3. (40 points) This question builds on the previous question and addresses many interesting aspects:

    • (10 points) Inspect your Q-values and determine if it is beneficial to go first in the Rocky Road game.

    • (10 points) Change the chocolate size to 9-by-4 and 10-by-6 and redo your Monte Carlo algorithm. Can you notice a general pattern?

    • (20 points) Implement the TD (temporal differences) algorithm and see if it improves convergence. You have to be careful here because this is a two-player game and only certain temporal differences have to be backed up for updating the Q function. Pick some chocolate size and plot the curves for both Monte Carlo and TD. Which is better? For full credit, give a printout of your code, graph(s), a description of how you setup the algorithm, and a list of observations.

Return Home