Static video summarization using multi-CNN with sparse autoencoder and random forest classifier

  • PDF / 478,555 Bytes
  • 8 Pages / 595.276 x 790.866 pts Page_size
  • 61 Downloads / 240 Views

DOWNLOAD

REPORT


ORIGINAL PAPER

Static video summarization using multi-CNN with sparse autoencoder and random forest classifier Madhu S. Nair1

· Jesna Mohan2

Received: 18 May 2020 / Revised: 17 August 2020 / Accepted: 23 September 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract A summarization system detects the parts of the input video that contain an essential message. Such a system aims to generate a very compact and meaningful representation of the original video. A novel method to detect key-frames for static summarization is presented in this paper. The method detects key-frames based on feature vectors extracted from multiple pre-trained Convolutional Neural Network models (Multi-CNN). The features are extracted using four pre-trained models of CNN. These vectors are fed to Sparse Autoencoder, which outputs a combined representation of the input feature vectors. The key-frames of input video are extracted based on combined feature vectors using Random Forest Classifier. The evaluation of the method is done using two datasets: VSUMM and OVP, based on user summaries present in the ground-truth. The method was able to achieve an average F-score of 0.83 on VSUMM dataset and 0.82 on OVP dataset, respectively. The method attained promising results compared to other state-of-the-art methods in the literature. Multi-CNN model was also able to generate high-quality summaries consistently from videos of all categories. Further experiments prove that Multi-CNN model in combination with Random Forest classifier performs better than other classifiers considered in the study. Keywords Video summarization · Key-frames · Sparse Autoencoders (SAE) · Convolutional Neural Network (CNN) · Random forest classifier · Multi-CNN

1 Introduction In the modern era, there is an abundant growth in the amount of multimedia data over the Internet. The situation arises due to the large scale availability of electronic gadgets at low costs. The increased use of large size videos poses a great challenge to network infrastructure. Nowadays, the researchers are trying to find a solution to the challenges posed by the massive amount of video data over the Internet. A large number of video data can be stored and retrieved efficiently if long videos are represented by a shortened video. The shortened video must preserve the essential content of

B

Madhu S. Nair [email protected] Jesna Mohan [email protected]

1

Artificial Intelligence and Computer Vision Lab, Department of Computer Science, Cochin University of Science and Technology, Kochi, Kerala 682022, India

2

Department of Computer Science and Engineering, Mar Baselios College of Engineering and Technology, Nalanchira, Thiruvananthapuram, Kerala 695015, India

the original video. Video summarization is the process by which a compact representation containing the important parts of the original video is generated [1]. There exists static [2] and dynamic video summarization [3] methods in the literature. The static summarization methods present the input video usi