Mapping languages: the Corpus of Global Language Use

  • PDF / 633,635 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 20 Downloads / 191 Views

DOWNLOAD

REPORT


Mapping languages: the Corpus of Global Language Use Jonathan Dunn1

© Springer Nature B.V. 2020

Abstract This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the paper evaluates a language identification model that supports more local languages with smaller sample sizes than alternative off-theshelf models. Improved language identification is essential for moving beyond majority languages. Given the focus on language mapping, the paper analyzes how well this digital language data represents actual populations by (i) systematically comparing the corpus with demographic ground-truth data and (ii) triangulating the corpus with an alternate Twitter-based dataset. In total, the corpus contains 423 billion words representing 148 languages (with over 1 million words from each language) and 158 countries (again with over 1 million words from each country), all distilled from Common Crawl web data. The main contribution of this paper, in addition to describing this publicly-available corpus, is to provide a comprehensive analysis of the relationship between two sources of digital data (the web and Twitter) as well as their connection to underlying populations. Keywords Language mapping · Geo-referenced corpus · Geographic corpus · Language identification · Demographic mapping · Register similarity

& Jonathan Dunn [email protected] 1

University of Canterbury, Christchurch, New Zealand

123

J. Dunn

1 Gathering global language data This paper describes a corpus of global language use that is drawn from webcrawled data and systematically compared with both Twitter data and census-based demographic data. The purpose is to both (i) represent regional varieties of languages using a consistent collection methodology and (ii) provide a data-driven resource for understanding what languages are used where. As shown by the webas-corpus paradigm (Baroni et al. 2009; Majlı˘s and Zabokrtsk´y 2012; Goldhahn et al. 2012; Benko 2014), raw web data contains observations of language use that can be leveraged to create linguistic corpora. Further, these web-based corpora have been shown to represent local language use (Davies and Fuchs 2015; Cook and Brinton 2017) and can be compared with Twitter-based corpora which have themselves been shown to represent local language use (Grieve et al. 2019). The Corpus of Global Language Use (CGLU: now at version 4.2)1 sifts through data from 147 billion web pages in order to distill a corpus of approximately 423 billion words representing 148 languages and 158 countries with at least 1 million words each. This includes 1916 language-country sub-corpora with at least 1 million words and 68 sub-corpora with at least 1 billion words. While previous iterations of this corpus have been used in existing work