CS 4804 Homework #8
Date Assigned: November 21, 2003
Date Due: December 10, 2003, in class, before class starts
- (20 points) Recall the reinforcement learning formulation of
the Tower-of-Hanoi problem presented in class. Let us consider three
ways to setup the rewards (states, action, and state-transition
table formulations are as usual):
- Every legal move has an associated reward of -1, except for the
goal state from which all moves have a reward of zero. Recall that
all transitions out of the goal state lead back to the goal state, since
it is an absorbing state. The discount factor (gamma) is 1.
- Every legal move has an associated reward of 0, except for transitions
into the goal state which have a reward of +10. Once again, the
goal state is absorbing (and all rewards are zero), and the discount factor is
1.
- Every legal move has an associated reward of -1, except for the
transitions into the goal state (which have a reward of +10). All transitions
from the goal state after this point lead back to the goal state, with reward of
0. Discount factor is 1.
What is the essential difference between these formulations? What is
the optimal policy under each?
- (30 points) Solve problem 17.4 of your textbook (all three
parts). To understand
what this question is asking:
- Replace "MDP" with "reinforcement learning problem".
- Notice that rewards here are associated with states, not (states,actions)
as we are used to. All this means is that the reward for a state is the same,
irrespective of how you got there. So, the reward for state 1 is always
-1 irrespective of where you landed from. So, using the state transition
table (given in the question), you can fill up the rewards table.
- (40 points) Consider a 8x8 empty chessboard (there are thus 64 states).
Assume you (the agent) are a knight who can make only "knightly" moves. The
goal state (terminal state)
is the top right corner of the chessboard. The start state
can be any position on the chessboard. The purpose of the game is to
get to the goal state using only knightly moves.
Formulate this problem as a reinforcement learning problem and solve
for V* by either policy iteration or value iteration.
- Formulate the actions carefully. How many are there?
- Setup the state transition table. Are the transitions deterministic
or probabilistic?
- Setup the rewards table. It is your job to figure out what the entries
must be. Use the statement of the problem to suitably formulate
the rewards table.
- What is a good discount factor for this problem? Again use the statement
of the problem.
- Ensure that the problem is setup so that values don't get infinitely
big (or small).
- Choose an initial policy or value to initialize (for policy iteration
or value iteration). Run for several hundreds (or thousands) of iterations.
After you are done with this puzzle, inspect the learned V values and
determine if there are initial states from which the puzzle is unsolvable.
Also inspect the values and see if they suggest some manner in
which the initial states can be grouped.
- (10 points) How do your observations change if we made it a
requirement to get to the goal state as quickly as possible?