A deep multimodal generative and fusion framework for class-imbalanced multimodal data

  • PDF / 3,954,292 Bytes
  • 28 Pages / 439.642 x 666.49 pts Page_size
  • 14 Downloads / 306 Views

DOWNLOAD

REPORT


A deep multimodal generative and fusion framework for class-imbalanced multimodal data Qing Li1 · Guanyuan Yu1 · Jun Wang1 · Yuehao Liu1 Received: 27 May 2019 / Revised: 12 June 2020 / Accepted: 15 June 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract The purpose of multimodal classification is to integrate features from diverse information sources to make decisions. The interactions between different modalities are crucial to this task. However, common strategies in previous studies have been to either concatenate features from various sources into a single compound vector or input them separately into several different classifiers that are then assembled into a single robust classifier to generate the final prediction. Both of these approaches weaken or even ignore the interactions among different feature modalities. In addition, in the case of class-imbalanced data, multimodal classification becomes troublesome. In this study, we propose a deep multimodal generative and fusion framework for multimodal classification with class-imbalanced data. This framework consists of two modules: a deep multimodal generative adversarial network (DMGAN) and a deep multimodal hybrid fusion network (DMHFN). The DMGAN is used to handle the class imbalance problem. The DMHFN identifies fine-grained interactions and integrates different information sources for multimodal classification. Experiments on a faculty homepage dataset show the superiority of our framework compared to several start-of-the-art methods. Keywords Multimodal classification · Class-imbalanced data · Deep multimodal generative adversarial network · Deep multimodal hybrid fusion network

1 Introduction Multimodal data consist of several feature modalities, where each modality is represented by a group of similar data sharing the same attributes. The aim of multimodal classification is to process and integrate information from multiple modalities to make decisions. In the era of big data, many applications of interest involve multimodal classification problems, including audio-visual speech recognition (AVSR) [40], affective computing [39], human emotion recognition [32], medical image analysis [22], user profiling [13], and stock  Guanyuan Yu

[email protected] 1

Fintech Innovation Center and School of Economic Information Engineering, Southwestern University of Finance and Economics, Chendu, China

Multimedia Tools and Applications

movement prediction [29]. However, two challenging problems usually arise when fusing information from multiple interactive modalities for multimodal classification. The first major challenge is multimodal representation. The heterogeneity in the statistical properties of multimodal data makes it more difficult to learn a joint representation using information from multiple sources [3, 17, 24]. A good example is the joint processing of images (which are real-valued and dense) and texts (which are discrete and sparse), which typically have different dimensions and structures [52]. In