Visual Memory for Robust Path Following

Ashish Kumar*
Saurabh Gupta*
David Fouhey
Sergey Levine
Jitendra Malik

University of California at Berkeley

Problem Setup: Given a set of reference images and poses, the starting image and pose, and a goal pose, we want a policy π that is able to convey the robot from its current pose to the target pose using first person RGB image observations under noisy actuation.

Humans routinely retrace paths in a novel environment both forwards and backwards despite uncertainty in their motion. This paper presents an approach for doing so. Given a demonstration of a path, a first network generates a path abstraction. Equipped with this abstraction, a second network observes the world and decides how to act to retrace the path under noisy actuation and a changing environment. The two networks are optimized end-to-end at training time. We evaluate the method in two realistic simulators, performing path following and homing under actuation noise and environmental changes. Our experiments show that our approach outperforms classical approaches and other learning based baselines.


Paper

Visual Memory for Robust Path Following
Ashish Kumar*, Saurabh Gupta*, David Fouhey, Sergey Levine, Jitendra Malik
To appear at NeurIPS 2018 (Oral)

bibtex / pdf


What's the problem?



The main idea is that we are given a single demonstration of a path consisting of observations and actions. Our goal is to re-execute this path either forwards (i.e., following it) or backwards (i.e., homing behavior). The difficulty is that as we execute this path, our actions are noisy and the world may change in the meantime. Both mean that blind replay of actions will not succeed -- if we simply replay the actions, we may end up somewhere else (try going from your bed to your office while blindfolded) or we may end up bumping into things (the pedestrians you walked around yesterday are not in the same place today).

The paper presents a method, described below, that aims to solve this task and compares it with alternate approaches on two environments.


Results

We present a video visualization for the path following experiments. On the left is the overhead view (which the agent does not have access to). In the middle is the demonstration, which the agent has to follow (either forwards or backwards). On the right is the execution as done by the agent.


First, we show following a path forwards.


Next, we show following a path backwards. Note the fact that there are two doors. To successfully retrace the path backwards, the agent must identify which door it came out of.


Finally, we show executing a path with the world changing in between demonstration and execution (note that the bed is missing). Here we show both open loop as well as our proposed method.


How does it work?



The method contains two components: networks φ and Ψ that abstract the image observations seen while the path is demonstrated into a series of vectors that are easily digested by a learning method; and a recurrent network π that attends to this sequence of vectors, looks through the camera, and chooses an action.

The job of φ and Ψ is to convert the image and action observations from the path demonstration into something a learning system can handle, or a path abstraction. It consists of a CNN, φ that is applied to each image, as well a two layer fully-connected network Ψ that blends together image and action observations. This produces a matrix with as many rows as there were steps in the demonstration.

The job of π is to use the generated path abstraction to re-execute the path under noisy actuation and a changing world, and is implemented as a GRU. As input, π looks at the image, its previous state, and the path abstraction. Rather than look at all of the steps of the path at once, π has a pointer η into the abstraction that it uses to softly attend to the steps of the execution. At each step, π predicts an update to where this pointer η is pointing.

The entire system is end-to-end trainable from data. We train it using imitation learning on 120K episodes. Each episode consists of a 30 step path which the agent is given 40 steps to execute (each step is 40cm forwards or a 30° rotation in expectation).


Acknowledgments

This work was supported in part by Intel/NSF VEC award IIS-1539099, and the Google Fellowship to SG.This webpage template was borrowed from some colorful folks.