All Your Voices are Belong to Us: Stealing Voices to Fool Humans and Machines
In this paper, we study voice impersonation attacks to defeat humans and machines. Equipped with the current advancement in automated speech synthesis, our attacker can build a very close model of a victim’s voice after learning only a very limited number
- PDF / 1,168,579 Bytes
- 23 Pages / 439.37 x 666.142 pts Page_size
- 17 Downloads / 207 Views
Abstract. In this paper, we study voice impersonation attacks to defeat humans and machines. Equipped with the current advancement in automated speech synthesis, our attacker can build a very close model of a victim’s voice after learning only a very limited number of samples in the victim’s voice (e.g., mined through the Internet, or recorded via physical proximity). Specifically, the attacker uses voice morphing techniques to transform its voice – speaking any arbitrary message – into the victim’s voice. We examine the aftermaths of such a voice impersonation capability against two important applications and contexts: (1) impersonating the victim in a voice-based user authentication system, and (2) mimicking the victim in arbitrary speech contexts (e.g., posting fake samples on the Internet or leaving fake voice messages). We develop our voice impersonation attacks using an off-the-shelf voice morphing tool, and evaluate their feasibility against state-of-theart automated speaker verification algorithms (application 1) as well as human verification (application 2). Our results show that the automated systems are largely ineffective to our attacks. The average rates for rejecting fake voices were under 10–20% for most victims. Even human verification is vulnerable to our attacks. Based on two online studies with about 100 users, we found that only about an average 50 % of the times people rejected the morphed voice samples of two celebrities as well as briefly familiar users.
1
Introduction
A person’s voice is one of the most fundamental attributes that enables communication with others in physical proximity, or at remote locations using phones or radios, and the Internet using digital media. However, unbeknownst to them, people often leave traces of their voices in many different scenarios and contexts. To name a few, people talk out loud while socializing in caf´es or restaurants, teaching, giving public presentations or interviews, making/receiving known and, sometimes unknown, phone calls, posting their voice snippets or audio(visual) clips on social networking sites like Facebook or YouTube, sending voice cards to their loved ones [11], or even donating their voices to help those with vocal impairments [14]. In other words, it is relatively easy for someone, potentially The first two authors are equally contributing. c Springer International Publishing Switzerland 2015 G. Pernul et al. (Eds.): ESORICS 2015, Part II, LNCS 9327, pp. 599–621, 2015. DOI: 10.1007/978-3-319-24177-7 30
600
D. Mukhopadhyay et al.
with malicious intentions, to “record” a person’s voice by being in close physical proximity of the speaker (using, for example, a mobile phone), by social engineering trickeries such as making a spam call, by searching and mining for audiovisual clips online, or even by compromising servers in the cloud that store such audio information. The more popular a person is (e.g., a celebrity or a famous academician), the easier it is to obtain his/her voice samples. We study the implications of such a commonplace
Data Loading...