Final Project
Date Assigned: November 28, 2007
Date Due: December 7, 2007, 11:59:59pm, by email to the instructor and the GTA
- (50 points) Due to the interest in autonomous vehicles on campus, for this year we will develop a reinforcement
learning agent that learns how to drive a car out of a steep mountainous
region, as shown in the figure below:
Since we are not mechanical engineers, we will "simulate" the car from within our
program according to some simple laws of physics.
First, observe that we are only interested in the "x" position of the car.
The goal is at the far right of the x-axis. We will assume that the range of
the x-axis is from [-1.2, 0.5] so that the goal is at coordinate "0.5". The car
might start at any x-value in this range and the goal is to steer it to reach "0.5".
The problem is that gravity is stronger than the car's engine and even at full throttle the car cannot accelerate up the steep slope when starting (at zero
velocity) say at the bottom. The only solution is to first move away from
the goal and up the opposite slope on the left. This is a simple example
of a continuous control task where things have to get worse in a sense
(farther from the goal) before they can get better.
The actions available to the car are only of three possibilities:
full throttle forward (+1), full throttle
reverse (-1) and zero throttle (0). Initially, the car is at some x-coordinate
in the above range and is at rest (i.e., its velocity is zero). Based on what
action we take, the velocities and x-coordinate are updated according to the
two equations:
- v[t+1] = bound ( v[t] + 0.001*a[t] - 0.0025*cos(3*x[t]) )
- x[t+1] = bound ( x[t] + v[t+1] )
Here the bound operation enforces the constraints
- -1.2 <= x[t+1] <= 0.5 and
- -0.07 <= v[t+1] <= 0.07.
When x[t+1] reaches the left bound (i.e., -1.2) this means the car has crashed
into the wall and its
velocity v[t+1] is reset to 0. Similarly, when x[t+1] reaches the right boundary
it has reached the goal and the episode is terminated.
You can try out different values for the acceleration at at full throttle and full throttle reverse. a[t] is the acceleration at time "t", and depends on the action
being taken, i.e., full throttle forward, full throttle reverse, or zero
throttle. For the case of zero throttle, a[t] would of course be zero. Notice
that according to the equation above, the vehicle might still be moving even
at zero throttle, because of what happened in the previous
time step. For the case of full throttle forward, you can set a[t] to be constant
at some value of your choice (depends on whether you are driving a rugged
Subaru or a delicate electric car). Similarly for the case of
full throttle reverse, you can set a[t] to be some value of your choice.
You should design your agent so that the initial position can be anywhere in the
range [-1.2, 0.5] (with velocity of course at zero). First, decide how to represent
the states, actions, rewards, and discount factor. You can "discretize" the
velocity ("v") and position ("x") values in this part of the assignment so that
the states can be represented by discrete indices. What this means is that
although the equations are calculated with continuous-values of "v" and "x", when
it comes time to storing values for specific states, these continuous values
are bucketed into discrete array indices. You must design the rewards yourself.
You should decide what reward you will give for reaching the goal (far right),
what reward for hitting the wall (far left), and what reward if any for languishing
in the middle. Also decide on a discount factor.
Design a Monte Carlo approach to estimating the values
for all states. Then use this computed value function to derive
a policy "pi" for getting the car to the destination. Inspect your policy
and figure out what it is saying: what is the threshold position at which it
says to "go forward", instead of "reverse first and then go forward"? How does
this threshold position depend on the acceleration parameters? Also experiment
with different discretizations and see how that affects the results.
- (35 points) Same as above but use the temporal difference algorithm to
estimate values. How many iterations does this algorithm take to find the right
policy versus the Monte Carlo approach?
- (15 points)
With the massive explosion in online content, a key problem that AI researchers
are looking at is "machine reading", i.e., the automatic understanding of text.
For instance, researchers would like an AI system to read in text files of
published research papers in, say, cancer journals, and be able to form an internal
representation of the research, in order to answer questions such as "what is
known about the effect of the p53 tumor suppressor gene in kindey cells?"
Observe that this question might not be studied in any particular sentence but
information might need to be distilled from multiple sentences and put together
to provide an answer.
Assume that your first job after you graduate is to solve this problem. We
don?t expect that you know how to solve this problem. What strategies could
you use to find out more about the problem and how to make progress on it? What
strategies would you use to evaluate your progress towards solution?
- (50 points: extra credit) Back to the mountain car problem. Do not discretize the positions and velocities.
Instead use a neural network in place of a lookup table. How exactly to do this
is left to your creativity. You are also allowed to use either algorithm:
MC or TD, or both. For full credit, explain how you adapted the neural network
learning code to use with MC or TD and performance results after learning.