MRECN: mixed representation enhanced (de)compositional network for caption generation from visual features, modeling as

  • PDF / 3,009,430 Bytes
  • 26 Pages / 595.276 x 790.866 pts Page_size
  • 116 Downloads / 192 Views

DOWNLOAD

REPORT


REGULAR PAPER

MRECN: mixed representation enhanced (de)compositional network for caption generation from visual features, modeling as pseudo tensor product representation Chiranjib Sur1 Received: 7 April 2020 / Revised: 21 September 2020 / Accepted: 5 October 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Semantic feature composition from image features has a drawback because it is unable to capture the content of the captions and failed to evolve as longer and meaningful captions. In this paper, we have proposed improvements on semantic features that can generate and evolve captions through the new approach called mixed fusion of representations and decomposition. Semantic works on the principle of using CNN visual features to generate context-word distribution and use that to generate captions using language decoder. Generated semantics are used for captioning, but have limitations. We have introduced a far better and newer approach with an enhanced representation-based network known as mixed representation enhanced (de)compositional network (MRECN), which can help produce better and different content for captions. As denoted from the results (0.351 BLUE_4), it has outperformed most of the state of the art. We defined a better feature decoding scheme using learned networks, which establishes an incoherence of related words into captions. From our research, we have come to some important conclusions regarding mixed representation strategies as it emerges as the most viable and promising way of representing the relationships of the sophisticated features for decision making and complex applications like the image to natural languages. Keywords Language modeling · Representation learning · Mixed representation · Image description · Sequence generation · Image understanding · Automated textual feature extraction

1 Introduction Image Captioning [1] has gained much attention from the industry due to the enormous volume of images and videos, created each day with uncontrollable increasing rate. Lack of proper tagging has crippled the search engines, and images unable to get into the retrievable nodes of the internet. To overcome this problem, automatic caption generation for scene understanding and summarization of the events in visual features have become inevitable. Previous works in image captioning were relied on object and attribute detectors to describe images [2–4]. Later mostly concentrated on attentions [5–11] and compositional characteristics [12] and top-down compositions [13], while there were limited works on feature refinement and redefining the exiting subspace

B 1

Chiranjib Sur [email protected] Computer and Information Science and Engineering Department, University of Florida, Gainesville, FL, USA

with more effective and efficient ones. While the structural variation of the image features creates many challenges for transforming visual features directly to sentences, identification of objects has never been a problem. Hence, there is an effort to create a representa