Hierarchies are an effective way to boost sample efficiency in reinforcement learning, and computational efficiency in classical planning. However, acquiring hierarchies via hand-design (as in classical planning) is suboptimal, while acquiring them via end-to-end reward based training (as in reinforcement learning) is unstable and still prohibitively expensive. In this paper, we pursue an alternate paradigm for acquiring such hierarchical abstractions (or visuo-motor subroutines), via use of passive first-person observation data. We use an inverse model trained on small amounts of interaction data to pseudo-label the passive first person videos with agent actions. Visuo-motor subroutines are acquired from these pseudo-labeled videos by learning a latent intent-conditioned policy that predicts the inferred pseudo-actions from the corresponding image observations. We demonstrate our proposed approach in context of navigation, and show that we can successfully learn consistent and diverse visuo-motor subroutines from passive first-person videos. We demonstrate the utility of our acquired visuo-motor subroutines by using them as is for exploration, and as sub-policies in a hierarchical RL framework for reaching point goals and semantic goals. We also demonstrate behavior of our subroutines in the real world, by deploying them on a real robotic platform.
What are we Learning?
Given an input image as shown above, we want to be able to execute
subroutines (short horizon policies that exhibit a coherent behavior such as going left into a room),
and affordances which tell us what subroutines can be invoked where. In our experiments, we learn 4
subroutines from passive first person navigation video which show consistent and diverse
behaviors as shown below (these are the top views of unrolled subroutines
when starting from different locations):
We also deploy the learned subroutines as is in the real world on a real robot:
Why Subroutines?
Learned subroutines and affordance models can be transferred
to downstream navigation tasks. Our subroutines and
the affordance model can be used as is in conjunction with
each other to tackle tasks like exploration of novel environments.
We can simply compose our subroutines via affordance
model to generate exploration behavior. Our method outperforms several handcrafted and learning based baselines.
The coverage of the overall space after sampling 20 roll-outs from 11 different
locations is visualized on the right.
     
Furthermore,
this decomposition into subroutines and affordance
models, very naturally fits into hierarchical reinforcement
learning frameworks. Our subroutines are analogous to
sub-policies, while the affordance model is analogous to the
meta-controller.
Initializing from VMSR leads to large gains in sample complexity
for downstream navigation tasks. First two images show results for
PointGoal (go to (x; y) coordinate), middle two show results for
AreaGoal (go to washroom). We see improvements across these
tasks for both sparse and dense reward scenarios, with larger gains
in the harder case of sparser rewards. VMSR is upto 4 more
sample efficient than the next best hierarchical method.
Initializing a flat RL policy with
VMSR (with only 1 SubR) leads to improved sample complexity
for downstream AreaGoal navigation task (go to washroom), as
compared to alternate schemes for initialization, based on random
initialization, initialization from ImageNet features, and initialization
from skills obtained via curiosity (shown in last two images).
How do we learn Subroutines and the Affordance model?
We start with first person navigation videos from agents
R1 . . Rn, without corresponding action labels. People constantly
upload these kinds of videos online making it freely
available. Given these videos, we want our robot S to learn
subroutines from these videos. This happens in two phases.
In the first phase, we generate
pseudo-action labels for these videos by running an inverse
model on every consecutive pair of images. This inverse
model is learned by the agent S using self supervision on
random exploration data. An interesting thing to note is that
action space of R1 . .Rn might be different from the action
space of our agent S. Hence, these pseudo-action labels are
not the actual action taken, but an action imagined by the
agent S to make transition between the observations in the
reference video in the agent S's action space as closely as
possible. In the second phase, we start with
these pseudo-labeled videos and train a forward prediction
model that takes the image as input and predicts the corresponding
action taken in the reference video. However, this
is a fundamentally ambiguous task, for example, an agent
Rk in the reference video that is facing a T-junction could
have gone either left or right. To disambiguate this, we allow
another network to look at the entire sequence of actions
of Rk in the reference video and encode the behavior
as a one-hot latent intent vector that is additionally used
to make the forward prediction. We additionally train a model, affordance model to predict which subroutines can be invoked
for a given input image from our repertoire of learned
subroutines. We do this by predicting the inferred one-hot
encoding of the trajectory from the first image.
Acknowledgments
This webpage template was borrowed from some colorful folks.