Online Speech/Music Segmentation Based on the Variance Mean of Filter Bank Energy

  • PDF / 1,565,122 Bytes
  • 13 Pages / 600.05 x 792 pts Page_size
  • 89 Downloads / 173 Views

DOWNLOAD

REPORT


Research Article Online Speech/Music Segmentation Based on the Variance Mean of Filter Bank Energy Marko Kos, Matej Graˇsiˇc, and Zdravko Kaˇciˇc Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova ul. 17, 2000 Maribor, Slovenia Correspondence should be addressed to Marko Kos, [email protected] Received 6 March 2009; Revised 4 June 2009; Accepted 2 September 2009 Recommended by Aggelos Pikrakis This paper presents a novel feature for online speech/music segmentation based on the variance mean of filter bank energy (VMFBE). The idea that encouraged the feature’s construction is energy variation in a narrow frequency sub-band. The energy varies more rapidly, and to a greater extent for speech than for music. Therefore, an energy variance in such a sub-band is greater for speech than for music. The radio broadcast database and the BNSI broadcast news database were used for feature discrimination and segmentation ability evaluation. The calculation procedure of the VMFBE feature has 4 out of 6 steps in common with the MFCC feature calculation procedure. Therefore, it is a very convenient speech/music discriminator for use in real-time automatic speech recognition systems based on MFCC features, because valuable processing time can be saved, and computation load is only slightly increased. Analysis of the feature’s speech/music discriminative ability shows an average error rate below 10% for radio broadcast material and it outperforms other features used for comparison, by more than 8%. The proposed feature as a standalone speech/music discriminator in a segmentation system achieves an overall accuracy of over 94% on radio broadcast material. Copyright © 2009 Marko Kos et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction Segmentation of audio data has become a very important procedure in audio processing systems. It is especially significant in applications such as automatic speech recognition (ASR), where only the speech segments of an input audio stream are led to the system’s input, and nonspeech segments are discarded [1, 2]. In this way, the speed and accuracy of an ASR system can be improved, and the computation load is also reduced. The prior segmentation of audio data is also very important for applications such as broadcast news transcription [3], where the speech is typically interspersed with music and background noise. With the development of the internet, content-based indexing [4–6] has emerged, because there is a lot of audio data that is not indexed by web search engines. In such systems, audio segmentation is part of the indexing task. Segmentation is also used in systems for audio and speaker diarization [7–9], retrieval of audio-visual data [10, 11], and so forth. One of the more often used acoustic segmentation types is speech/music segmentation. This is not surprising, because s