LSTM and multiple CNNs based event image classification

  • PDF / 2,151,956 Bytes
  • 18 Pages / 439.37 x 666.142 pts Page_size
  • 0 Downloads / 197 Views

DOWNLOAD

REPORT


LSTM and multiple CNNs based event image classification Peian Li 1,2 & Huadong Tang 2 & Jing Yu 2 & Wei Song 1,3 Received: 24 March 2020 / Revised: 26 September 2020 / Accepted: 10 November 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

Previous studies have demonstrated that complexity and variation of event images are the major challenges in event classification. We approach the problem through an integrated methodology by utilizing Long Short-Term Memory network (LSTM) to fuse multiple Convolutional Neural Networks (CNNs). To address the issue of complexity, we use three specific CNNs to extract the scene, object and human visual cues respectively. To reduce the semantic gap and utilize the complementarity of the features in different levels, we choose AlexNet and VGG-16 network as the basic structures, and concatenate their outputs of the first fully-connected layer and the second fully-connected layer. Considering the contextual correlations between visual cues, we arrange the concatenations of three CNNs in the sequence of scene, object and human as a whole and put into the LSTM network. Particularly for context, we crop the images into five blocks as input and an individual image is supplemented with contextual features due to the temporal characteristics of the LSTM. We evaluate our method on the Web Image Dataset for Event Recognition (WIDER), and the obtained results demonstrate the effectiveness of all the above points. Compared with the state-of-the-art methods, the proposed method gives a considerable way for improving the performance on event classification. Keywords Event classification . Convolutional neural networks . Long short-term memory . Feature combination . Context information

* Wei Song [email protected]

1

School of Information Engineering, Minzu University of China, Beijing, China

2

School of Electronic Information and Engineering, Beijing Jiaotong University, Beijing, China

3

National Language Resource Monitoring and Research Center of Minority Languages, Minzu University of China, Beijing, China

Multimedia Tools and Applications

1 Introduction In the field of computer vision, image classification has always been one of the most remarkable topics, and has attracted extensive attention from researchers. Event categorization in still images is a very challenging problem because events involve multiple interacting characteristics, and the description of events is complicated as well as variable [2]. Generally, the concept of events is highly correlated with many other high-level visual cues. The content of the event image involves various visual information such as human, object and scene. To extract features from images for classification, Scale Invariant Feature Transform (SIFT) [24], Histogram of Oriented Gradient (HOG) [23] and other algorithms [16, 32, 40] were adopted to manually extract features in the early stage. However, these methods which based on hand-engineered features have poor generalization performance. Until 2012, Alex krizh