|
|||||
|
|||||
|
|||||
|
|||||
Paper
|
|||||
What's the problem?The main idea is that we are given a single demonstration of a path consisting of observations and actions. Our goal is to re-execute this path either forwards (i.e., following it) or backwards (i.e., homing behavior). The difficulty is that as we execute this path, our actions are noisy and the world may change in the meantime. Both mean that blind replay of actions will not succeed -- if we simply replay the actions, we may end up somewhere else (try going from your bed to your office while blindfolded) or we may end up bumping into things (the pedestrians you walked around yesterday are not in the same place today). The paper presents a method, described below, that aims to solve this task and compares it with alternate approaches on two environments. | |||||
Results
We present a video visualization for the path following experiments. On the left
is the overhead view (which the agent does not have access to). In the middle
is the demonstration, which the agent has to follow (either forwards or backwards).
On the right is the execution as done by the agent.
|
|||||
How does it work?The method contains two components: networks φ and Ψ that abstract the image observations seen while the path is demonstrated into a series of vectors that are easily digested by a learning method; and a recurrent network π that attends to this sequence of vectors, looks through the camera, and chooses an action. The job of φ and Ψ is to convert the image and action observations from the path demonstration into something a learning system can handle, or a path abstraction. It consists of a CNN, φ that is applied to each image, as well a two layer fully-connected network Ψ that blends together image and action observations. This produces a matrix with as many rows as there were steps in the demonstration. The job of π is to use the generated path abstraction to re-execute the path under noisy actuation and a changing world, and is implemented as a GRU. As input, π looks at the image, its previous state, and the path abstraction. Rather than look at all of the steps of the path at once, π has a pointer η into the abstraction that it uses to softly attend to the steps of the execution. At each step, π predicts an update to where this pointer η is pointing. The entire system is end-to-end trainable from data. We train it using imitation learning on 120K episodes. Each episode consists of a 30 step path which the agent is given 40 steps to execute (each step is 40cm forwards or a 30° rotation in expectation). |
|||||
Acknowledgments
|