Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
- PDF / 3,327,537 Bytes
- 21 Pages / 595.276 x 790.866 pts Page_size
- 76 Downloads / 176 Views
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory Arun Balajee Vasudevan1
· Dengxin Dai1 · Luc Van Gool1,2
Received: 13 August 2019 / Accepted: 19 August 2020 © The Author(s) 2020
Abstract The role of robots in society keeps expanding, bringing with it the necessity of interacting and communicating with humans. In order to keep such interaction intuitive, we provide automatic wayfinding based on verbal navigational instructions. Our first contribution is the creation of a large-scale dataset with verbal navigation instructions. To this end, we have developed an interactive visual navigation environment based on Google Street View; we further design an annotation method to highlight mined anchor landmarks and local directions between them in order to help annotators formulate typical, human references to those. The annotation task was crowdsourced on the AMT platform, to construct a new Talk2Nav dataset with 10, 714 routes. Our second contribution is a new learning method. Inspired by spatial cognition research on the mental conceptualization of navigational instructions, we introduce a soft dual attention mechanism defined over the segmented language instructions to jointly extract two partial instructions—one for matching the next upcoming visual landmark and the other for matching the local directions to the next landmark. On the similar lines, we also introduce spatial memory scheme to encode the local directional transitions. Our work takes advantage of the advance in two lines of research: mental formalization of verbal navigational instructions and training neural network agents for automatic way finding. Extensive experiments show that our method significantly outperforms previous navigation methods. For demo video, dataset and code, please refer to our project page. Keywords Vision-and-language navigation · Long-range navigation · Spatial memory · Dual attention
1 Introduction Consider that you are traveling as a tourist in a new city and are looking for a destination that you would like to visit. You ask the locals and get a directional description “go ahead for about 200 m until you hit a small intersection, then turn left and continue along the street before you see a yellow building on your right”. People give indications that are not purely directional, let alone metric. They mix in referrals to landmarks that you will find along your route. This may seem like a trivial ability, as humans do this routinely. Yet,
B
Arun Balajee Vasudevan [email protected] Dengxin Dai [email protected] Luc Van Gool [email protected]
1
ETH Zurich, Zurich, Switzerland
2
K.U Leuven, Leuven, Belgium
this is a complex cognitive task that relies on the development of an internal, spatial representation that includes visual landmarks (e.g. “the yellow building”) and possible, local directions (e.g. “going forward for about 200 m”). Such representation can support a continuous self-localization as well as conveying a sense of direction towards the goal. Just a
Data Loading...