A Discriminative Model for Polyphonic Piano Transcription

  • PDF / 564,663 Bytes
  • 9 Pages / 600.03 x 792 pts Page_size
  • 8 Downloads / 184 Views

DOWNLOAD

REPORT


Research Article A Discriminative Model for Polyphonic Piano Transcription Graham E. Poliner and Daniel P. W. Ellis Laboratory for Recognition and Organization of Speech and Audio, Department of Electrical Engineering, Columbia University, New York, NY 10027, USA Received 6 December 2005; Revised 17 June 2006; Accepted 29 June 2006 Recommended by Masataka Goto We present a discriminative model for polyphonic piano transcription. Support vector machines trained on spectral features are used to classify frame-level note instances. The classifier outputs are temporally constrained via hidden Markov models, and the proposed system is used to transcribe both synthesized and real piano recordings. A frame-level transcription accuracy of 68% was achieved on a newly generated test set, and direct comparisons to previous approaches are provided. Copyright © 2007 Hindawi Publishing Corporation. All rights reserved.

1.

INTRODUCTION

Music transcription is the process of creating a musical score (i.e., a symbolic representation) from an audio recording. Although expert musicians are capable of transcribing polyphonic pieces of music, the process is often arduous for complex recordings. As such, the ability to automatically generate transcriptions has numerous practical implications in musicological analysis and may potentially aid in content-based music retrieval tasks. The transcription problem may be viewed as identifying the notes that have been played in a given time period (i.e., detecting the onsets of each note). Unfortunately, the harmonic series interaction that occurs in polyphonic music significantly obfuscates automated transcription. Moorer [1] first presented a limited system for duet transcription. Since then, a number of acoustical models for polyphonic transcription have been presented in both the frequency domain, Rossi et al. [2], Sterian [3], Dixon [4], and the time domain, Bello et al. [5]. These methods, however, rely on a core analysis that assumes a specific audio structure, namely, that musical pitch is produced by periodicity at a particular fundamental frequency in the audio signal. For instance, the system of Klapuri [6] estimates multiple fundamental frequencies from spectral peaks using a computational model of the human auditory periphery. Then, discrete hidden Markov models (HMMs) are iteratively applied to extract melody lines from the fundamental frequency estimations, Ryyn¨anen and Klapuri [7].

The assumption that pitch arises from harmonic components is strongly grounded in musical acoustics, but it is not necessary for transcription. In many fields (such as automatic speech recognition) classifiers for particular events are built using the minimum of prior knowledge of how they are represented in the features. Marolt [8] presented such a classification-based approach to transcription using neural networks, but a filterbank of adaptive oscillators was required in order to reduce erroneous note insertions. Bayesian models have also been proposed for music transcription, Godsill and Davy [9