Final Project

Date Assigned: November 28, 2007
Date Due: December 7, 2007, 11:59:59pm, by email to the instructor and the GTA
  1. (50 points) Due to the interest in autonomous vehicles on campus, for this year we will develop a reinforcement learning agent that learns how to drive a car out of a steep mountainous region, as shown in the figure below:

    Since we are not mechanical engineers, we will "simulate" the car from within our program according to some simple laws of physics. First, observe that we are only interested in the "x" position of the car. The goal is at the far right of the x-axis. We will assume that the range of the x-axis is from [-1.2, 0.5] so that the goal is at coordinate "0.5". The car might start at any x-value in this range and the goal is to steer it to reach "0.5".

    The problem is that gravity is stronger than the car's engine and even at full throttle the car cannot accelerate up the steep slope when starting (at zero velocity) say at the bottom. The only solution is to first move away from the goal and up the opposite slope on the left. This is a simple example of a continuous control task where things have to get worse in a sense (farther from the goal) before they can get better.

    The actions available to the car are only of three possibilities: full throttle forward (+1), full throttle reverse (-1) and zero throttle (0). Initially, the car is at some x-coordinate in the above range and is at rest (i.e., its velocity is zero). Based on what action we take, the velocities and x-coordinate are updated according to the two equations:

    • v[t+1] = bound ( v[t] + 0.001*a[t] - 0.0025*cos(3*x[t]) )
    • x[t+1] = bound ( x[t] + v[t+1] )

    Here the bound operation enforces the constraints

    • -1.2 <= x[t+1] <= 0.5 and
    • -0.07 <= v[t+1] <= 0.07.

    When x[t+1] reaches the left bound (i.e., -1.2) this means the car has crashed into the wall and its velocity v[t+1] is reset to 0. Similarly, when x[t+1] reaches the right boundary it has reached the goal and the episode is terminated. You can try out different values for the acceleration at at full throttle and full throttle reverse. a[t] is the acceleration at time "t", and depends on the action being taken, i.e., full throttle forward, full throttle reverse, or zero throttle. For the case of zero throttle, a[t] would of course be zero. Notice that according to the equation above, the vehicle might still be moving even at zero throttle, because of what happened in the previous time step. For the case of full throttle forward, you can set a[t] to be constant at some value of your choice (depends on whether you are driving a rugged Subaru or a delicate electric car). Similarly for the case of full throttle reverse, you can set a[t] to be some value of your choice. You should design your agent so that the initial position can be anywhere in the range [-1.2, 0.5] (with velocity of course at zero). First, decide how to represent the states, actions, rewards, and discount factor. You can "discretize" the velocity ("v") and position ("x") values in this part of the assignment so that the states can be represented by discrete indices. What this means is that although the equations are calculated with continuous-values of "v" and "x", when it comes time to storing values for specific states, these continuous values are bucketed into discrete array indices. You must design the rewards yourself. You should decide what reward you will give for reaching the goal (far right), what reward for hitting the wall (far left), and what reward if any for languishing in the middle. Also decide on a discount factor.

    Design a Monte Carlo approach to estimating the values for all states. Then use this computed value function to derive a policy "pi" for getting the car to the destination. Inspect your policy and figure out what it is saying: what is the threshold position at which it says to "go forward", instead of "reverse first and then go forward"? How does this threshold position depend on the acceleration parameters? Also experiment with different discretizations and see how that affects the results.

  2. (35 points) Same as above but use the temporal difference algorithm to estimate values. How many iterations does this algorithm take to find the right policy versus the Monte Carlo approach?

  3. (15 points) With the massive explosion in online content, a key problem that AI researchers are looking at is "machine reading", i.e., the automatic understanding of text. For instance, researchers would like an AI system to read in text files of published research papers in, say, cancer journals, and be able to form an internal representation of the research, in order to answer questions such as "what is known about the effect of the p53 tumor suppressor gene in kindey cells?" Observe that this question might not be studied in any particular sentence but information might need to be distilled from multiple sentences and put together to provide an answer.

    Assume that your first job after you graduate is to solve this problem. We don?t expect that you know how to solve this problem. What strategies could you use to find out more about the problem and how to make progress on it? What strategies would you use to evaluate your progress towards solution?

  4. (50 points: extra credit) Back to the mountain car problem. Do not discretize the positions and velocities. Instead use a neural network in place of a lookup table. How exactly to do this is left to your creativity. You are also allowed to use either algorithm: MC or TD, or both. For full credit, explain how you adapted the neural network learning code to use with MC or TD and performance results after learning.

Return Home