Nov 12, 2007 ------------- - Reinforcement learning - grandest topic in AI - can subsume all of AI itself - General setting - states - actions - state transition table - rewards table - A simple problem: gridworld1 - three states, organized as - S1 - S2 - S3 - four possible actions - 1l - 2l - 1r - 2r - bumping into wall keeps state same, but - gives reward of -5 or -10 (depends on force of impact) - zero rewards for legal actions - Policies - a facet of the agent - give mapping from states to actions - have value functions - V^pi(s) - expected cumulative sum of discounted rewards, if you start from state "s" and apply policy "pi" - Discounting: gamma - between 0 and 1 - If 1, V^pi(s) becomes straight sum - If 0, V^pi(s) becomes a short sighted agent - Use of gamma - makes infinite sums converge! - Working out V^pi(s) for a given agent (from its policy) - is this the best policy possible? - how would you improve it? - Different agents - have different policies - A more complicated policy - agent has non-zero probability of taking two actions in a given state - Note - two different policies can have same value(s) - Making things more interesting: gridworld2 - four states S1 | S2 S3 | S4 ------- S5 (a hidden mine) - S2 and S5 are absorbing states - S2 is goal state - S5 is bad state to be in - four actions - up, down, left, right - state transition table - one of these entries itself is probabilistic - rewards matrix - bumping is not good (-5) - just roaming around is not good (-1) - into S2 gets reward of +5 - into S5 gets reward of -50 - from S2, and from S5, stays there, with reward 0 - discounting of 0.9 - Take a simple policy: Agent 006 - S1: down - S2: right - S3: right - S4: left - S5: down - Find V^pi - looks like Agent 006 is not such a good agent!