Detection of interactive voice response (IVR) in phone call records
- PDF / 1,494,529 Bytes
- 9 Pages / 595.276 x 790.866 pts Page_size
- 116 Downloads / 188 Views
Detection of interactive voice response (IVR) in phone call records Andrei Kopylov1 · Oleg Seredin1 · Andrei Filin1,2 · Boris Tyshkevich2 Received: 8 January 2020 / Accepted: 11 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Separation of pre-recorded messages (Interactive Voice Response, IVR) from live speech fragments in real-time plays a significant role in speech emotion recognition (SER) systems, unwanted calls filtering, automatic detection of answering machine responses, reduction of stored record sizes, voice mail spam filtration, etc. The problem complexity is that, unlike with silent, music, and noise fragments studied by the conventional voice activity recognition (VAD), IVR usually contains speech. Three classifiers for live speech fragments detection in phone call records are considered: based on the support vector machine (SVM), gradient boosting (XGBoost) and convolutional neural network (CNN). The Geneva Minimalistic Acoustic Parameter Set for XGBoost and SVM, and log-spectrograms and gammatonegrams for CNN were used for feature representation of audio fragments. Experiments with a dataset of phone calls demonstrate comparable quality (around 0.96 according to the F1-averaged measure) of the considered algorithms with CNN having a advantage (0.98). Keywords IVR · SVM · Gradient boosting · CNN · Speech analysis · GeMAPS · Log-spectrogram · Gammatonegram
1 Introduction The development of computer telephony drives the growing popularity of virtual call centers. Often they are considered as a required component of IT infrastructure and today’s economy. Such centers generate huge amounts of audio data, so intelligent algorithms should be applied to analyze it. In particular, automated speech emotion recognition (SER) in dialogues enables enhancing the commonly used call center key performance indicators (KPI) and introducing new KPIs based on round-the-clock monitoring. A required step in a SER system is the identification of spontaneous speech audio fragments, referred to as “live speech” hereinafter, to be analyzed, and recorded speech fragments also containing noise, music, or silence. Most of * Andrei Filin [email protected] Andrei Kopylov [email protected] Oleg Seredin [email protected] Boris Tyshkevich [email protected] 1
Tula State University, Tula, Russia
ITooLabs, Tula, Russia
2
today’s call centers use pre-recorded messages (IVR) for automated interaction with clients (e.g. routing calls, interactive queues, so-called “cold calling”, etc.). IVR detection is also required for unwanted calls filtering, automatic detection of answering machine responses, reduction of stored record sizes, voice mail spam filtration, etc., and it goes beyond the traditional detection of speech fragments (VAD) in audio streams. In most cases, a person can determine whether the sentence is pre-recorded or live. That is why it should be possible to solve the identification problem with advanced machine learning methods. As far as we know, IVR detect
Data Loading...