Scene2Wav: a deep convolutional sequence-to-conditional SampleRNN for emotional scene musicalization

  • PDF / 2,042,690 Bytes
  • 20 Pages / 439.642 x 666.49 pts Page_size
  • 86 Downloads / 191 Views

DOWNLOAD

REPORT


Scene2Wav: a deep convolutional sequence-to-conditional SampleRNN for emotional scene musicalization Gwenaelle Cunha Sergio1 · Minho Lee1 Received: 30 September 2019 / Revised: 4 July 2020 / Accepted: 13 August 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract This paper presents Scene2Wav, a novel deep convolutional model proposed to handle the task of music generation from emotionally annotated video. This is important because when paired with the appropriate audio, the resulting music video is able to enhance the emotional effect it has on viewers. The challenge lies in transforming the video to audio domain and generating music. Our proposed encoder Scene2Wav uses a convolutional sequence encoder to embed dynamic emotional visual features from low-level features in the colour space, namely Hue, Saturation and Value. The decoder Scene2Wav is a proposed conditional SampleRNN which uses that emotional visual feature embedding as condition to generate novel emotional music. The entire model is fine-tuned in an end-to-end training fashion to generate a music signal evoking the intended emotional response from the listener. By taking into consideration the emotional and generative aspect of it, this work is a significant contribution to the field of Human-Computer Interaction. It is also a stepping stone towards the creation of an AI movie and/or drama director, which is able to automatically generate appropriate music for trailers and movies. Based on experimental results, this model can effectively generate music that is preferred to the user when compared to the baseline model and able to evoke correct emotions. Keywords Sequence-to-conditional SampleRNN · Convolutional neural network · Deep recurrent neural network · Domain transformation · Emotional music generation

1 Introduction Art may very well be one of the defining characteristics of the human species, and are key to effective human interactions. Its many forms are practiced by almost all human cultures  Minho Lee

[email protected] Gwenaelle Cunha Sergio [email protected] 1

School of Electronics Engineering, Kyungpook National University, 80 Daehakro, Bukgu, Daegu 41566, South Korea

Multimedia Tools and Applications

and in all modern societies, visual arts and music are intimately intertwined [17]. A deeper investigation into the relationship between those two modalities of arts [36] opens up a whole new range of possibilities. For instance, it gives an opportunity for visually and/or hearing impaired people to appreciate the field of arts they are unable to perceive. A second application is automatic movie and/or drama directors which can generate appropriate music given muted videos. Both applications are able to evoke stronger emotion from users with the addition of another modality. Understanding the appeal of emotionally aware systems, researchers in the Affective Computing community have put together efforts in trying to estimate emotion induced by watching videos for various applications [7, 26, 29]. Visual