Nov 21, 2003
-------------

- Doing RL without P and R matrices
        - think of a "simulator"
        - takes state and action as input
        - produces new state and reward as input
                => you do not have access to
                   internal workings!

- Revisiting simple example of a three-state problem
        - s1, s2, s3 horizontally placed
        - 1l, 2l, 1r, 2r
        - fix initial policy to be:
                1r in everything
        - generate "episodes" using this policy
        - assess Q values
        - improve policy using the Q values

- How do you model a two-player game?
        - simulator "hides" the second player
        - "fake" the other player and factor him
          in the stochastics of the environment

- A serious problem
        - what if your original policy did not
          have the "good" actions in it?
        - how can improvement ever come up with
          the best policy?
                - Ans: it cannot!
        - basic conflict in RL
                - "exploration vs exploitation"

- Solution: use epsilon-soft policies only!
        - make your policy always explore a little
        - outline of an epsilon-soft algorithm
                - do the best thing 90% of the time
                - pick a random action 10% of the time!
                  (this facilitates exploration)
        - this is what you need for final project

- psuedocode for e-soft policies


   Initialize all Q(s,a) to some number, or zero
   Initialize Returns(s,a) to an empty list
   
   Initialize pi to be an arbitrary epsilon-soft policy

   L1:  Generate an episode using pi

        for each (s,a) occuring in the episode
                // count only the first time that (s,a) occurs

                find return "r" following that occurence
                append "r" to the Returns(s,a) list

                Q(s,a) = average [ Returns(s,a) ]

        end for

        // now use Q(s,a) to create a policy
        for each s in the episode
                // only these would have changed

                a* = argmax Q(s,a)
                        // i.e., find the a at which this
                        // Q becomes maximum
                distribute 90% probability among a*
                distribute 10% probability among everybody
                        // including a*

                // this could have been 80-20 or 95-5, just
                // make sure all actions have non-zero
                // probability of being taken

                update pi       
        end for

    go back to L1