The B-Subtle framework: tailoring subtitles to your needs

  • PDF / 1,507,668 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 7 Downloads / 176 Views

DOWNLOAD

REPORT


The B-Subtle framework: tailoring subtitles to your needs Miguel Ventura1 • Jessica Veiga1 • Luisa Coheur1 • Sandra Gama1

Accepted: 20 September 2020 / Published online: 11 October 2020  Springer Nature B.V. 2020

Abstract Large amounts of subtitles, from movies and TV shows, can be easily found on the web, for free, in almost every language. Several corpora, built from subtitles, with different annotations and purposes, are currently available. Considering that new sets of subtitles are constantly being released, we propose B-Subtle, an open source framework that allows the automatic creation of corpora constituted of sequential pairs of dialogue turns, gathered from subtitles. With the help of a configuration file, the B-Subtle framework permits to enrich subtitles and dialogue turns with extra information (such as movie genre or the polarity of an utterance); in addition, it allows different types of filtering to be applied to both subtitle files and dialogue turns. Therefore, with B-Subtle, each one can create his/her own corpus, tailored to his/her needs. Moreover, in order to replicate the process in a future experiment, the user just needs to save the configuration file. In this paper, we describe B-Subtle and demonstrate how to build different corpora with it. Keywords Subtitles  Corpora  Framework to build corpora  Enrich and filter data

& Luisa Coheur [email protected] Miguel Ventura [email protected] Jessica Veiga [email protected] Sandra Gama [email protected] 1

Instituto Superior Te´cnico, Universidade de Lisboa/INESC-ID, Lisbon, Portugal

123

1144

M. Ventura et al.

1 Introduction Subtitles can be found in many languages, without charge, in large quantities (for instance, OpenSubtitles 2018 corpus aggregates more than 3.7 million subtitles), covering multiple types of discourse, such as dialects and slang. Several corpora extracted from subtitles or movie scripts are currently being used by Natural Language Processing (NLP) researchers, in tasks such as statistical analysis (Banchs 2012; Paetzold and Specia 2016), machine translation (Lison et al. 2018), creation of knowledge-bases (Tandon et al. 2015), movie summarization (Gorinski and Lapata 2018), violence prediction (Martinez et al. 2019), or for building conversational agents as the ones described in Banchs and Li (2012), and (Ameixa et al. 2014). Examples of corpora based on subtitles are Subtle (Ameixa et al. 2014), Movie-DiC (Banchs 2012), and the Cornell Movie-Dialogs Corpus.1 Subtle results from the mapping of a subset of OpenSubtitles,2 from 2014, into a set of turns (each two consecutive lines within a dialogue are a potencial interaction—Fig. 1). Subtle is constituted of two datasets, one with more than 5500K pairs of dialogues turns in English, and the other with more than 3300K pairs of dialogues turns in Portuguese, built from around 6K English and 4K Portuguese subtitle files, respectively (Magarreiro et al. 2014). Subtle is available upon request and has been used in differen