Nov 12, 2003
-------------

- Making things more interesting: gridworld2
        - four states 
                S1 | S2
                S3 | S4
                -------
                     S5 (a hidden mine)
        - S2 and S5 are absorbing states
                - S2 is goal state
                - S5 is bad state to be in
        - four actions
                - up, down, left, right
        - state transition table
                - one of these entries itself is probabilistic
        - rewards matrix
                - bumping is not good (-2)
                - just roaming around is not good (-1)
                - into S2 gets reward of +5
                - into S5 gets reward of -50
                - from S2, and from S5, stays there,
                  with reward 0
        - discounting of 0.9

- Take a simple policy: Agent 006
        - S1: down
        - S2: right
        - S3: up
        - S4: left
        - S5: down

- Find V^pi
        - looks like Agent 006 is not such a good agent!

- Improving the policy
        - look at just computed V^pi values
        - change policy at each state to be "greedy" w.r.t. V^pi
                - i.e., do the action that appears to be best
		  using just one step lookahead
        - what if many actions are "equally best"?
                - sprinkle the probabilities among them
                        - any way you like!

- Basic idea: policy iteration
        - fix pi
        - find V^pi
        - find a new pi
        - find V^pi (again)
        - find a new pi, and so on

- Improving agents
        - from 006
        - to 006.5,
        - to James Bond! (play theme music)

- Recap two major types of "games" thus far
        - continuing tasks
                - goes on and on
                - objective of game is "living"
        - episodic tasks
                - objective is to get to a goal
                  state fast!

- How can you 'spot' what type of game it is?
        - look at R matrix (rewards)
        - see how the V function is formulated
                - expected cumulative sum of
                  discounted rewards
                - what does maximizing this mean?