Using Spasmodic Closure Patterns to Simplify Visual Voice Activity Detection

  • PDF / 1,093,448 Bytes
  • 8 Pages / 595.276 x 790.866 pts Page_size
  • 97 Downloads / 228 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Using Spasmodic Closure Patterns to Simplify Visual Voice Activity Detection Ananth Goyal1  Received: 16 June 2020 / Accepted: 9 November 2020 © Springer Nature Singapore Pte Ltd 2020

Abstract While speaking, humans exhibit a number of recognizable patterns; most notably, the repetitive nature of mouth movement from closed to open. The following paper presents a novel method to computationally determine when video data contains a person speaking through the recognition and tally of lip facial closures within a given interval. A combination of HaarFeature detection and eigenvectors are used to recognize when a target individual is present, but by detecting and quantifying spasmodic lip movements and comparing them to the ranges seen in true positives, we are able to predict when true speech occurs without the need for complex facial mappings. Although the results are within a reasonable accuracy range when compared to current methods, the comprehensibility and simple nature of the approach used can reduce the strenuousness of current techniques and, if paired with synchronous audio recognition methods, can streamline the future of voice activity detection as a whole. Keywords  Voice Activity Detection · Computer Vision · Lip Detection

Introduction Current voice activity detection (VAD) relies heavily on audio cues. The non-convoluted approach to use auditory benchmarks to predict when speech is occurring has made its way into several recent studies [1−3]. However, VAD struggles to detect true speech when multiple speakers are involved or a strong background noise is present [4], because of the inability effectively to compute and isolate the audio signals from each other. Visual voice activity detection (VVAD), a subset of VAD, can be used in tandem with auditory techniques or as a standalone method. Given the current progress with intelligent systems, face detection software, and its contemporary subset, facial recognition, have practically become a standard in modern technology [5]. In the past decade alone, several new methods to detect faces and its individual components have surfaced [6, 7]. The usage of automated tracking algorithms, such as active lip shape models [8] and variancebased techniques [9], have made it easy to detect when an

* Ananth Goyal [email protected] 1



individual is speaking. Applications of visual voice activity detection (VVAD) range from automated video extraction, anti-cheating software, speaker recognition in an audio intense situation, machine learning training for complex facial tracking, etc. The following paper presents a comprehensible approach to detect when speech is occurring, along with detecting when a specific individual is speaking among many (conference calls, video chats, ceremonies, lectures, etc.) Although the proposed method, Lip Closure Quantification (LCQ) is independent, it requires two separate algorithms prior to its activation. The first is face detection and recognition, to ensure that the algorithm will only run when the target spe