The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions

  • PDF / 536,669 Bytes
  • 15 Pages / 439.37 x 666.142 pts Page_size
  • 67 Downloads / 174 Views

DOWNLOAD

REPORT


The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions Richard Futrell1 • Edward Gibson2 • Harry J. Tily3 • Idan Blank4 • Anastasia Vishnevetsky2 Steven T. Piantadosi5 • Evelina Fedorenko2



Ó The Author(s) 2020

Abstract It is now a common practice to compare models of human language processing by comparing how well they predict behavioral and neural measures of processing difficulty, such as reading times, on corpora of rich naturalistic linguistic materials. However, many of these corpora, which are based on naturally-occurring text, do not contain many of the low-frequency syntactic constructions that are often required to distinguish between processing theories. Here we describe a new corpus consisting of English texts edited to contain many low-frequency syntactic constructions while still sounding fluent to native speakers. The corpus is annotated with hand-corrected Penn Treebank-style parse trees and includes self-paced reading time data and aligned audio recordings. We give an overview of the content of the corpus, review recent work using the corpus, and release the data. Keywords Cognitive modeling  Reading time  Psycholinguistics

1 Introduction It is becoming a standard practice to evaluate theories of human language processing by comparing their ability to predict behavioral and neural reactions to fixed standardized corpora of naturalistic text. This method has been used to study & Richard Futrell [email protected] 1

University of California, Irvine, USA

2

Massachusetts Institute of Technology, Cambridge, USA

3

Viome, Inc., Seattle, USA

4

University of California, Los Angeles, USA

5

University of California, Berkeley, USA

123

R. Futrell et al.

several dependent variables which are believed to be indicative of human language processing difficulty, including word fixation time in eyetracking (Kennedy et al. 2013), word reaction time in self-paced reading (Roark et al. 2009; Frank et al. 2013), BOLD signal in fMRI data (Bachrach et al. 2009), and event-related potentials (Dambacher et al. 2006; Frank et al. 2015). The more traditional approach to evaluating psycholinguistic models has been to collect psychometric measures on hand-crafted experimental stimuli designed to tease apart detailed model predictions. While this approach makes it easy to compare models on their accuracy for specific constructions and phenomena, it is hard to get a sense of how models compare on their coverage of a broad range of phenomena. Comparing model predictions over standardized texts makes it is easier to evaluate coverage. Although the corpus approach has these advantages, the existing corpora currently used are based on naturally-occurring text, which is unlikely to include the kinds of sentences which can crucially distinguish between theories. Many of the most puzzling phenomena in psycholinguistics, and the phenomena which have been used to test models, have only been observed in extremely rare constructions, such as multiply nested object-extracted