Learning Navigation Subroutines from Egocentric Videos

Ashish Kumar1
Saurabh Gupta2,3
Jitendra Malik1,2

1 University of California at Berkeley, 2 Facebook AI Research, 3UIUC


Paper             Code


Hierarchies are an effective way to boost sample efficiency in reinforcement learning, and computational efficiency in classical planning. However, acquiring hierarchies via hand-design (as in classical planning) is suboptimal, while acquiring them via end-to-end reward based training (as in reinforcement learning) is unstable and still prohibitively expensive. In this paper, we pursue an alternate paradigm for acquiring such hierarchical abstractions (or visuo-motor subroutines), via use of passive first-person observation data. We use an inverse model trained on small amounts of interaction data to pseudo-label the passive first person videos with agent actions. Visuo-motor subroutines are acquired from these pseudo-labeled videos by learning a latent intent-conditioned policy that predicts the inferred pseudo-actions from the corresponding image observations. We demonstrate our proposed approach in context of navigation, and show that we can successfully learn consistent and diverse visuo-motor subroutines from passive first-person videos. We demonstrate the utility of our acquired visuo-motor subroutines by using them as is for exploration, and as sub-policies in a hierarchical RL framework for reaching point goals and semantic goals. We also demonstrate behavior of our subroutines in the real world, by deploying them on a real robotic platform.




What are we Learning?



Given an input image as shown above, we want to be able to execute subroutines (short horizon policies that exhibit a coherent behavior such as going left into a room), and affordances which tell us what subroutines can be invoked where. In our experiments, we learn 4 subroutines from passive first person navigation video which show consistent and diverse behaviors as shown below (these are the top views of unrolled subroutines when starting from different locations):



We also deploy the learned subroutines as is in the real world on a real robot:


Why Subroutines?

Learned subroutines and affordance models can be transferred to downstream navigation tasks. Our subroutines and the affordance model can be used as is in conjunction with each other to tackle tasks like exploration of novel environments. We can simply compose our subroutines via affordance model to generate exploration behavior. Our method outperforms several handcrafted and learning based baselines. The coverage of the overall space after sampling 20 roll-outs from 11 different locations is visualized on the right.

     

Furthermore, this decomposition into subroutines and affordance models, very naturally fits into hierarchical reinforcement learning frameworks. Our subroutines are analogous to sub-policies, while the affordance model is analogous to the meta-controller.



Initializing from VMSR leads to large gains in sample complexity for downstream navigation tasks. First two images show results for PointGoal (go to (x; y) coordinate), middle two show results for AreaGoal (go to washroom). We see improvements across these tasks for both sparse and dense reward scenarios, with larger gains in the harder case of sparser rewards. VMSR is upto 4 more sample efficient than the next best hierarchical method. Initializing a flat RL policy with VMSR (with only 1 SubR) leads to improved sample complexity for downstream AreaGoal navigation task (go to washroom), as compared to alternate schemes for initialization, based on random initialization, initialization from ImageNet features, and initialization from skills obtained via curiosity (shown in last two images).


How do we learn Subroutines and the Affordance model?



We start with first person navigation videos from agents R1 . . Rn, without corresponding action labels. People constantly upload these kinds of videos online making it freely available. Given these videos, we want our robot S to learn subroutines from these videos. This happens in two phases.




















In the first phase, we generate pseudo-action labels for these videos by running an inverse model on every consecutive pair of images. This inverse model is learned by the agent S using self supervision on random exploration data. An interesting thing to note is that action space of R1 . .Rn might be different from the action space of our agent S. Hence, these pseudo-action labels are not the actual action taken, but an action imagined by the agent S to make transition between the observations in the reference video in the agent S's action space as closely as possible. In the second phase, we start with these pseudo-labeled videos and train a forward prediction model that takes the image as input and predicts the corresponding action taken in the reference video. However, this is a fundamentally ambiguous task, for example, an agent Rk in the reference video that is facing a T-junction could have gone either left or right. To disambiguate this, we allow another network to look at the entire sequence of actions of Rk in the reference video and encode the behavior as a one-hot latent intent vector that is additionally used to make the forward prediction. We additionally train a model, affordance model to predict which subroutines can be invoked for a given input image from our repertoire of learned subroutines. We do this by predicting the inferred one-hot encoding of the trajectory from the first image.


Acknowledgments

This webpage template was borrowed from some colorful folks.