Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

PDF / 3,327,537 Bytes
21 Pages / 595.276 x 790.866 pts Page_size
76 Downloads / 215 Views

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory Arun Balajee Vasudevan1

· Dengxin Dai1 · Luc Van Gool1,2

Received: 13 August 2019 / Accepted: 19 August 2020 © The Author(s) 2020

Abstract The role of robots in society keeps expanding, bringing with it the necessity of interacting and communicating with humans. In order to keep such interaction intuitive, we provide automatic wayfinding based on verbal navigational instructions. Our first contribution is the creation of a large-scale dataset with verbal navigation instructions. To this end, we have developed an interactive visual navigation environment based on Google Street View; we further design an annotation method to highlight mined anchor landmarks and local directions between them in order to help annotators formulate typical, human references to those. The annotation task was crowdsourced on the AMT platform, to construct a new Talk2Nav dataset with 10, 714 routes. Our second contribution is a new learning method. Inspired by spatial cognition research on the mental conceptualization of navigational instructions, we introduce a soft dual attention mechanism defined over the segmented language instructions to jointly extract two partial instructions—one for matching the next upcoming visual landmark and the other for matching the local directions to the next landmark. On the similar lines, we also introduce spatial memory scheme to encode the local directional transitions. Our work takes advantage of the advance in two lines of research: mental formalization of verbal navigational instructions and training neural network agents for automatic way finding. Extensive experiments show that our method significantly outperforms previous navigation methods. For demo video, dataset and code, please refer to our project page. Keywords Vision-and-language navigation · Long-range navigation · Spatial memory · Dual attention

1 Introduction Consider that you are traveling as a tourist in a new city and are looking for a destination that you would like to visit. You ask the locals and get a directional description “go ahead for about 200 m until you hit a small intersection, then turn left and continue along the street before you see a yellow building on your right”. People give indications that are not purely directional, let alone metric. They mix in referrals to landmarks that you will find along your route. This may seem like a trivial ability, as humans do this routinely. Yet,

B

Arun Balajee Vasudevan [email protected] Dengxin Dai [email protected] Luc Van Gool [email protected]

1

ETH Zurich, Zurich, Switzerland

2

K.U Leuven, Leuven, Belgium

this is a complex cognitive task that relies on the development of an internal, spatial representation that includes visual landmarks (e.g. “the yellow building”) and possible, local directions (e.g. “going forward for about 200 m”). Such representation can support a continuous self-localization as well as conveying a sense of direction towards the goal. Just a

Data Loading...

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

Recommend Documents

Spatial Navigation

On the relationship between trait autobiographical episodic memory and spatial navigation

NavWell: A simplified virtual-reality platform for spatial navigation and memory experiments

Landmarks: A solution for spatial navigation and memory experiments in virtual reality

LSAM: Local Spatial Attention Module

INSIDE: Steering Spatial Attention with Non-imaging Information in CNNs

Spatial attention modulates tactile change detection

Few-Shot Compositional Font Generation with Dual Memory

Co-design Structure of Dual-Band LNA and Dual-Band BPF for Radio Navigation Aid Application

Video gaming addiction and its association with memory, attention and learning skills in Lebanese children

Intrinsic Point Cloud Interpolation via Dual Latent Space Navigation

Spatio-Temporal Memory for Navigation in a Mushroom Body Model