CS 4804 Homework #8

Date Assigned: November 21, 2003
Date Due: December 10, 2003, in class, before class starts

(20 points) Recall the reinforcement learning formulation of the Tower-of-Hanoi problem presented in class. Let us consider three ways to setup the rewards (states, action, and state-transition table formulations are as usual):
- Every legal move has an associated reward of -1, except for the goal state from which all moves have a reward of zero. Recall that all transitions out of the goal state lead back to the goal state, since it is an absorbing state. The discount factor (gamma) is 1.
- Every legal move has an associated reward of 0, except for transitions into the goal state which have a reward of +10. Once again, the goal state is absorbing (and all rewards are zero), and the discount factor is 1.
- Every legal move has an associated reward of -1, except for the transitions into the goal state (which have a reward of +10). All transitions from the goal state after this point lead back to the goal state, with reward of 0. Discount factor is 1.
What is the essential difference between these formulations? What is the optimal policy under each?
(30 points) Solve problem 17.4 of your textbook (all three parts). To understand what this question is asking:
- Replace "MDP" with "reinforcement learning problem".
- Notice that rewards here are associated with states, not (states,actions) as we are used to. All this means is that the reward for a state is the same, irrespective of how you got there. So, the reward for state 1 is always -1 irrespective of where you landed from. So, using the state transition table (given in the question), you can fill up the rewards table.
(40 points) Consider a 8x8 empty chessboard (there are thus 64 states). Assume you (the agent) are a knight who can make only "knightly" moves. The goal state (terminal state) is the top right corner of the chessboard. The start state can be any position on the chessboard. The purpose of the game is to get to the goal state using only knightly moves. Formulate this problem as a reinforcement learning problem and solve for V* by either policy iteration or value iteration.
- Formulate the actions carefully. How many are there?
- Setup the state transition table. Are the transitions deterministic or probabilistic?
- Setup the rewards table. It is your job to figure out what the entries must be. Use the statement of the problem to suitably formulate the rewards table.
- What is a good discount factor for this problem? Again use the statement of the problem.
- Ensure that the problem is setup so that values don't get infinitely big (or small).
- Choose an initial policy or value to initialize (for policy iteration or value iteration). Run for several hundreds (or thousands) of iterations.
After you are done with this puzzle, inspect the learned V values and determine if there are initial states from which the puzzle is unsolvable. Also inspect the values and see if they suggest some manner in which the initial states can be grouped.
(10 points) How do your observations change if we made it a requirement to get to the goal state as quickly as possible?