Nov 12, 2003 ------------- - Making things more interesting: gridworld2 - four states S1 | S2 S3 | S4 ------- S5 (a hidden mine) - S2 and S5 are absorbing states - S2 is goal state - S5 is bad state to be in - four actions - up, down, left, right - state transition table - one of these entries itself is probabilistic - rewards matrix - bumping is not good (-2) - just roaming around is not good (-1) - into S2 gets reward of +5 - into S5 gets reward of -50 - from S2, and from S5, stays there, with reward 0 - discounting of 0.9 - Take a simple policy: Agent 006 - S1: down - S2: right - S3: up - S4: left - S5: down - Find V^pi - looks like Agent 006 is not such a good agent! - Improving the policy - look at just computed V^pi values - change policy at each state to be "greedy" w.r.t. V^pi - i.e., do the action that appears to be best using just one step lookahead - what if many actions are "equally best"? - sprinkle the probabilities among them - any way you like! - Basic idea: policy iteration - fix pi - find V^pi - find a new pi - find V^pi (again) - find a new pi, and so on - Improving agents - from 006 - to 006.5, - to James Bond! (play theme music) - Recap two major types of "games" thus far - continuing tasks - goes on and on - objective of game is "living" - episodic tasks - objective is to get to a goal state fast! - How can you 'spot' what type of game it is? - look at R matrix (rewards) - see how the V function is formulated - expected cumulative sum of discounted rewards - what does maximizing this mean?