Temporal capsule networks for video motion estimation and error concealment

  • PDF / 3,220,004 Bytes
  • 9 Pages / 595.276 x 790.866 pts Page_size
  • 95 Downloads / 288 Views

DOWNLOAD

REPORT


ORIGINAL PAPER

Temporal capsule networks for video motion estimation and error concealment Arun Sankisa1 · Arjun Punjabi1 · Aggelos K. Katsaggelos1 Received: 9 August 2019 / Revised: 6 January 2020 / Accepted: 6 March 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract In this paper, we present a temporal capsule network architecture to encode motion in videos as an instantiation parameter. The extracted motion is used to perform motion-compensated error concealment. We modify the original architecture and use a carefully curated dataset to enable the training of capsules spatially and temporally. First, we add the temporal dimension by taking co-located “patches” from three consecutive frames obtained from standard video sequences to form input data “cubes.” Second, the network is designed with an initial feature extraction layer that operates on all three dimensions to generate spatiotemporal features. Additionally, we implement the PrimaryCaps module with a recurrent layer, instead of a conventional convolutional layer, to extract short-term motion-related temporal dependencies and encode them as activation vectors in the capsule output. Finally, the capsule output is combined with the most-recent past frame and passed through a fully connected reconstruction network to perform motion-compensated error concealment. We study the effectiveness of temporal capsules by comparing the proposed model with architectures that do not include capsules. Although the quality of the reconstruction shows room for improvement, we successfully demonstrate that capsules-based architectures can be designed to operate in the temporal dimension to encode motion-related attributes as instantiation parameters. The accuracy of motion estimation is evaluated by comparing both the reconstructed frame outputs and the corresponding optical flow estimates with ground truth data. Keywords Capsule networks · Conv3D · ConvLSTM · Error concealment · Motion estimation

1 Introduction Since the neural networks (CNN or ConvNet), [1, 2], and a deep CNN architecture [3], numerous works have highlighted their effectiveness in processing natural signals, particularly in their ability to learn hierarchal relationships of objects in images, i.e., low-level features such as edges that progressively build up to more complex, composite structures such as motifs and objects. This ability has been utilized in training networks to perform a wide variety of tasks such as classification [5, 6], object recognition [7, 8],

B

Arun Sankisa [email protected] Arjun Punjabi [email protected] Aggelos K. Katsaggelos [email protected]

1

Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, USA

or inpainting [9, 10], and more generally, in extracting spatial correlations typical of natural images. Recent models such as recurrent neural networks (RNNs), in particular long short-term memory (LSTM) modules, have gained popularity in solving problems that require networks to understand