Automatic speech recognition: a survey
- PDF / 1,138,240 Bytes
- 47 Pages / 439.37 x 666.142 pts Page_size
- 40 Downloads / 377 Views
Automatic speech recognition: a survey Mishaim Malik 1 & Muhammad Kamran Malik 2 & Khawar Mehmood 3 & Imran Makhdoom 4 Received: 31 May 2020 / Revised: 4 September 2020 / Accepted: 13 October 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
Recently great strides have been made in the field of automatic speech recognition (ASR) by using various deep learning techniques. In this study, we present a thorough comparison between cutting-edged techniques currently being used in this area, with a special focus on the various deep learning methods. This study explores different feature extraction methods, state-of-the-art classification models, and vis-a-vis their impact on an ASR. As deep learning techniques are very data-dependent different speech datasets that are available online are also discussed in detail. In the end, the various online toolkits, resources, and language models that can be helpful in the formulation of an ASR are also proffered. In this study, we captured every aspect that can impact the performance of an ASR. Hence, we speculate that this work is a good starting point for academics interested in ASR research. Keywords Speech recognition . ASR . Automatic speech recognition . Feature extraction . Classification models . Language models
* Mishaim Malik [email protected] Muhammad Kamran Malik [email protected] Khawar Mehmood [email protected] Imran Makhdoom [email protected]
1
Punjab University College of Information Technology (PUCIT), Lahore, Pakistan
2
Faculty of Punjab University College of Information Technology (PUCIT), Lahore, Pakistan
3
School of Engineering and Information Technology, University of New South Wales (UNSW) Canberra at ADFA, Canberra, Australia
4
Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
Multimedia Tools and Applications
1 Introduction Speech is the most natural, efficient and preferred mode of communication between humans. Therefore it can be assumed that people are more comfortable using speech as a mode of input for various machines rather than such other primitive modes of communication as keypads and keyboards. Automatic speech recognition (ASR) system helps us achieve this goal. Such a system allows a computer to take the audio file or direct speech from the microphone as an input and convert it into the text; preferably in the script of the spoken language. An ideal ASR should be able to “perceive” the given input, “recognize” the spoken words and then subsequently use the recognized words as an input to another machine so that some “action” can be performed on it [42, 126, 160]. Retrospectively, we consider ASRs to be the future means of communication between humans and machines. Human speech and accents have huge variations, and this variation in speech patterns is one of the biggest obstacles in creating an autonomous speech recognition system. Bilingual or multilingual people tend to show more of these variations in their speech patterns than people who s
Data Loading...