Extending Long Short-Term Memory for Multi-View Structured Learning

Long Short-Term Memory (LSTM) networks have been successfully applied to a number of sequence learning problems but they lack the design flexibility to model multiple view interactions, limiting their ability to exploit multi-view relationships. In this p

  • PDF / 894,581 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 10 Downloads / 208 Views

DOWNLOAD

REPORT


Vision and Sensing, Human-Centred Technology Research Centre, University of Canberra, Canberra, Australia [email protected], [email protected] 2 Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA {morency,tbaltrus}@cs.cmu.edu

Abstract. Long Short-Term Memory (LSTM) networks have been successfully applied to a number of sequence learning problems but they lack the design flexibility to model multiple view interactions, limiting their ability to exploit multi-view relationships. In this paper, we propose a Multi-View LSTM (MV-LSTM), which explicitly models the view-specific and cross-view interactions over time or structured outputs. We evaluate the MV-LSTM model on four publicly available datasets spanning two very different structured learning problems: multimodal behaviour recognition and image captioning. The experimental results show competitive performance on all four datasets when compared with state-of-the-art models.

Keywords: Long Short-Term Memory iour recognition · Image Caption

1

· Multi-View Learning · Behav-

Introduction

There is a need for computational approaches that can model multimodal structured and sequential data. This is important for modelling human actions, caption generation and other sequence analysis problems. The integration of multimodal or multi-view data can occur in different stages. We use a general definition of views as “a particular way of observing a phenomena”. For example, in image captioning, views are from the image and its text caption. For child engagement level prediction from videos, the views are defined by three visual descriptors: Head pose, HOG and HOF. Two ways of fusing multi-view data are early and late fusion techniques [19]. However, these techniques do not take advantage of complex view relationships that may exist in the input data. Structured multi-view learning is aimed at capturing view interactions, thereby exploiting their relationships for effective learning. The key challenge to multi-view structured learning is to model both the view-specific and cross-view c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 338–353, 2016. DOI: 10.1007/978-3-319-46478-7 21

Extending Long Short-Term Memory for Multi-View Structured Learning

339

dynamics. The view-specific dynamics capture the interaction between hidden outputs from the same view, while cross-view captures the interactions between hidden outputs of other views. These dynamics enable learning of subtle view relationships for better representation learning. The notion of capturing viewspecific and cross-view dynamics is application specific and, hence, a need exists for flexibility in the design to model such dynamics. We propose Multi-View LSTM (MV-LSTM), an extension to LSTM, designed to model both view-specific and cross-view dynamics by partitioning internal representations to mirror the multiple input views (see Fig. 1). We define a new family of activation functions (shown as M