Keyword Spotting Out of Continuous Speech

Successful Automatic Speech Recognition (ASR) technology has been a research aspiration for the past five decades. Ideally, computers would be able to transform any type of human speech into an accurate textual transcription. Today’s ASR technology genera

PDF / 85,409 Bytes
6 Pages / 439.37 x 666.142 pts Page_size
3 Downloads / 203 Views

DOWNLOAD

REPORT

Keyword Spotting Out of Continuous Speech

1.1

Introduction

Successful Automatic Speech Recognition (ASR) technology has been a research aspiration for the past five decades. Ideally, computers would be able to transform any type of human speech into an accurate textual transcription. Today’s ASR technology generates fairly good results using structured speech with relatively low Signal to Noise Ratios (SNR), but performance degrades when using spontaneous speech in real-life noisy environments (Murveit et al. 1992; Young 1996; Furui 2003; Deng and Huang 2004). Performance that is acceptable for commercial applications can be achieved using large training corpora of speech and text. However, there are still problems that need to be resolved. One of the main problems is the mismatch between training and testing (reallife) conditions (Young 1996; Baker et al. 2009; Tsao et al. 2009; Furui et al. 2012; Saon and Chien 2012). Types of mismatches include: background noise, channel distortion, Out of Vocabulary (OOV) words (when speakers use words not in the recognition vocabulary), foreign accent speech, etc. Various methods and algorithms for minimizing this mismatch between training and testing have been suggested and implemented (Mammone et al. 1996; Sankar and Lee 1996; Huo et al. 1997; Matrouf and Gauvain 1997; Viikki and Laurila 1998; Hirsch and Pearce 2000; Barras et al. 2002; Parada et al. 2010; Kai et al. 2012), while in parallel, larger amounts of representative speech (usually from live deployments) have been injected into the training process using automatic procedures that do not necessitate manual transcription of the data (Kamm and Meyer 2002; Evermann et al. 2005; Heigold et al. 2012). The leading approach in ASR today is searching for the most probable sequence of words that describes the input speech. The search uses: (1) acoustical models representing the phonemes of the target language; (2) a lexicon of the recognition vocabulary words represented as sequences of phonemes; and (3) a Language Model (LM) specifying the word transition probabilities. ASR is performed by inputting a sequence of vectors estimated from the input speech signal to the A. Moyal et al., Phonetic Search Methods for Large Speech Databases, SpringerBriefs in Speech Technology, DOI 10.1007/978-1-4614-6489-1_1, # The Author(s) 2013

1

2

1 Keyword Spotting Out of Continuous Speech

Knowledge Sources Acoustic Models Input Speech

Front-End Processing

Language Model

Acoustic feature vectors

Decoder

Lexicon

Most probable word sequence

Fig. 1 Speech recognition engine

engine, and then using the combined information from the knowledge sources – the acoustical models, lexicon and LM – to search for the most probable sequence of words. A high level description of a speech recognition engine is illustrated in Fig. 1. The search for the most probable sequence of words can be represented using the following notation: O ¼ fo1 ; . . . ; oT g – A sequence of vectors representing the speech signal (the output of the front-end processin

Data Loading...

Keyword Spotting Out of Continuous Speech

Recommend Documents

Keyword Spotting Methods

Image Based Retrieval and Keyword Spotting in Documents

Graph-Based Keyword Spotting in Historical Handwritten Documents

A Hybrid Representation of Word Images for Keyword Spotting

Zone-based keyword spotting in Bangla and Devanagari documents

Very Fast Keyword Spotting System with Real Time Factor Below 0.01

A New Lightweight CRNN Model for Keyword Spotting with Edge Computing Devices

A Pitch and Noise Robust Keyword Spotting System Using SMAC Features with Prosody Modification

An Efficient Joint Training Framework for Robust Small-Footprint Keyword Spotting

A Continuous Word Segmentation of Bengali Noisy Speech

Keyword Search

Support System for Lecture Captioning Using Keyword Detection by Automatic Speech Recognition