Nov 21, 2003 ------------- - Doing RL without P and R matrices - think of a "simulator" - takes state and action as input - produces new state and reward as input => you do not have access to internal workings! - Revisiting simple example of a three-state problem - s1, s2, s3 horizontally placed - 1l, 2l, 1r, 2r - fix initial policy to be: 1r in everything - generate "episodes" using this policy - assess Q values - improve policy using the Q values - How do you model a two-player game? - simulator "hides" the second player - "fake" the other player and factor him in the stochastics of the environment - A serious problem - what if your original policy did not have the "good" actions in it? - how can improvement ever come up with the best policy? - Ans: it cannot! - basic conflict in RL - "exploration vs exploitation" - Solution: use epsilon-soft policies only! - make your policy always explore a little - outline of an epsilon-soft algorithm - do the best thing 90% of the time - pick a random action 10% of the time! (this facilitates exploration) - this is what you need for final project - psuedocode for e-soft policies Initialize all Q(s,a) to some number, or zero Initialize Returns(s,a) to an empty list Initialize pi to be an arbitrary epsilon-soft policy L1: Generate an episode using pi for each (s,a) occuring in the episode // count only the first time that (s,a) occurs find return "r" following that occurence append "r" to the Returns(s,a) list Q(s,a) = average [ Returns(s,a) ] end for // now use Q(s,a) to create a policy for each s in the episode // only these would have changed a* = argmax Q(s,a) // i.e., find the a at which this // Q becomes maximum distribute 90% probability among a* distribute 10% probability among everybody // including a* // this could have been 80-20 or 95-5, just // make sure all actions have non-zero // probability of being taken update pi end for go back to L1