Audio Classification in Speech and Music: A Comparison between a Statistical and a Neural Approach
- PDF / 765,095 Bytes
- 7 Pages / 600 x 792 pts Page_size
- 100 Downloads / 220 Views
Audio Classification in Speech and Music: A Comparison Between a Statistical and a Neural Approach Alessandro Bugatti Department of Electronics for Automation, University of Brescia, Via Branze 38, 25123 Brescia, Italy Email: [email protected]
Alessandra Flammini Department of Electronics for Automation, University of Brescia, Via Branze 38, 25123 Brescia, Italy Email: [email protected]
Pierangelo Migliorati Department of Electronics for Automation, University of Brescia, Via Branze 38, 25123 Brescia, Italy Email: [email protected] Received 27 July 2001 and in revised form 8 January 2002 We focus the attention on the problem of audio classification in speech and music for multimedia applications. In particular, we present a comparison between two different techniques for speech/music discrimination. The first method is based on zero crossing rate and Bayesian classification. It is very simple from a computational point of view, and gives good results in case of pure music or speech. The simulation results show that some performance degradation arises when the music segment contains also some speech superimposed on music, or strong rhythmic components. To overcome these problems, we propose a second method, that uses more features, and is based on neural networks (specifically a multi-layer Perceptron). In this case we obtain better performance, at the expense of a limited growth in the computational complexity. In practice, the proposed neural network is simple to be implemented if a suitable polynomial is used as the activation function, and a real-time implementation is possible even if low-cost embedded systems are used. Keywords and phrases: speech/music discrimination, indexing of audio-visual documents, neural networks, multimedia applications.
1. INTRODUCTION Effective navigation through multimedia documents is necessary to enable widespread use and access to richer and novel information sources. Design of efficient indexing techniques to retrieve relevant information is another important requirement. Allowing for possible automatic procedures to semantically index audio-video material represents therefore a very important challenge. Such methods should be designed to create indices of the audio-visual material, which characterize the temporal structure of a multimedia document from a semantic point of view. The International Standard Organization (ISO) started in October 1996 a standardization process for the description of the content of multimedia documents, namely MPEG-7: the “Multimedia Content Description Interface” [1, 2]. However, the standard specifications do not indicate methods for the automatic selection of indices. A possible mean is to identify series of consecutive seg-
ments, which exhibit a certain coherence, according to some property of the audio-visual material. By organizing the degree of coherence, according to more abstract criteria, it is possible to construct a hierarchical representation of information, so as to create a Table of Content description of the document. Such descr
Data Loading...