Dereverberation by Using Time-Variant Nature of Speech Production System

  • PDF / 897,591 Bytes
  • 15 Pages / 600.03 x 792 pts Page_size
  • 56 Downloads / 191 Views

DOWNLOAD

REPORT


Research Article Dereverberation by Using Time-Variant Nature of Speech Production System Takuya Yoshioka, Takafumi Hikichi, and Masato Miyoshi NTT Communication Science Laboratories, NTT Corporation 2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237, Japan Received 25 August 2006; Revised 7 February 2007; Accepted 21 June 2007 Recommended by Hugo Van hamme This paper addresses the problem of blind speech dereverberation by inverse filtering of a room acoustic system. Since a speech signal can be modeled as being generated by a speech production system driven by an innovations process, a reverberant signal is the output of a composite system consisting of the speech production and room acoustic systems. Therefore, we need to extract only the part corresponding to the room acoustic system (or its inverse filter) from the composite system (or its inverse filter). The time-variant nature of the speech production system can be exploited for this purpose. In order to realize the time-variance-based inverse filter estimation, we introduce a joint estimation of the inverse filters of both the time-invariant room acoustic and the time-variant speech production systems, and present two estimation algorithms with distinct properties. Copyright © 2007 Takuya Yoshioka et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1.

INTRODUCTION

Room reverberation degrades speech intelligibility or corrupts the characteristics inherent in speech. Hence, dereverberation, which recovers a clean speech signal from its reverberant version, is indispensable for a variety of speech processing applications. In many practical situations, only the reverberant speech signal is accessible. Therefore, the dereverberation must be accomplished with blind processing. Let an unknown signal transmission channel from a source to possibly multiple microphones in a room be modeled by a linear time invariant system (to provide a unified description independent of the number of microphones, we refer to a set of signal transmission channel(s) from a source to possibly multiple microphones as a signal transmission channel. The channel from the source to each of the microphones is called a subchannel. A set of signal(s) observed by the microphone(s) is refered to as an observed signal. We also refer to an inverse filter set, which is composed of filters applied to the signal observed by each microphone, as an inverse filter). The observed signal (reverberant signal) is then the output of the system driven by the source signal (clean speech signal). On the other hand, the source signal is modeled as being generated by a time variant autoregressive (AR) system corresponding to an articulatory filter driven by an innovations process [1]. In what follows, for the sake of

definiteness, the AR system corresponding to the articulatory filter and the system corresponding to the room’s signal tran