Keyword Spotting Out of Continuous Speech
Successful Automatic Speech Recognition (ASR) technology has been a research aspiration for the past five decades. Ideally, computers would be able to transform any type of human speech into an accurate textual transcription. Today’s ASR technology genera
- PDF / 85,409 Bytes
- 6 Pages / 439.37 x 666.142 pts Page_size
- 3 Downloads / 193 Views
Keyword Spotting Out of Continuous Speech
1.1
Introduction
Successful Automatic Speech Recognition (ASR) technology has been a research aspiration for the past five decades. Ideally, computers would be able to transform any type of human speech into an accurate textual transcription. Today’s ASR technology generates fairly good results using structured speech with relatively low Signal to Noise Ratios (SNR), but performance degrades when using spontaneous speech in real-life noisy environments (Murveit et al. 1992; Young 1996; Furui 2003; Deng and Huang 2004). Performance that is acceptable for commercial applications can be achieved using large training corpora of speech and text. However, there are still problems that need to be resolved. One of the main problems is the mismatch between training and testing (reallife) conditions (Young 1996; Baker et al. 2009; Tsao et al. 2009; Furui et al. 2012; Saon and Chien 2012). Types of mismatches include: background noise, channel distortion, Out of Vocabulary (OOV) words (when speakers use words not in the recognition vocabulary), foreign accent speech, etc. Various methods and algorithms for minimizing this mismatch between training and testing have been suggested and implemented (Mammone et al. 1996; Sankar and Lee 1996; Huo et al. 1997; Matrouf and Gauvain 1997; Viikki and Laurila 1998; Hirsch and Pearce 2000; Barras et al. 2002; Parada et al. 2010; Kai et al. 2012), while in parallel, larger amounts of representative speech (usually from live deployments) have been injected into the training process using automatic procedures that do not necessitate manual transcription of the data (Kamm and Meyer 2002; Evermann et al. 2005; Heigold et al. 2012). The leading approach in ASR today is searching for the most probable sequence of words that describes the input speech. The search uses: (1) acoustical models representing the phonemes of the target language; (2) a lexicon of the recognition vocabulary words represented as sequences of phonemes; and (3) a Language Model (LM) specifying the word transition probabilities. ASR is performed by inputting a sequence of vectors estimated from the input speech signal to the A. Moyal et al., Phonetic Search Methods for Large Speech Databases, SpringerBriefs in Speech Technology, DOI 10.1007/978-1-4614-6489-1_1, # The Author(s) 2013
1
2
1 Keyword Spotting Out of Continuous Speech
Knowledge Sources Acoustic Models Input Speech
Front-End Processing
Language Model
Acoustic feature vectors
Decoder
Lexicon
Most probable word sequence
Fig. 1 Speech recognition engine
engine, and then using the combined information from the knowledge sources – the acoustical models, lexicon and LM – to search for the most probable sequence of words. A high level description of a speech recognition engine is illustrated in Fig. 1. The search for the most probable sequence of words can be represented using the following notation: O ¼ fo1 ; . . . ; oT g – A sequence of vectors representing the speech signal (the output of the front-end processin
Data Loading...