Treebanks Building and Using Parsed Corpora

Linguists and engineers in Natural Language Processing tend to use electronic corpora more and more. Most research has long been limited to raw (unannotated) texts or to tagged texts (annotated with parts of speech only), but these approaches suffer from

  • PDF / 42,454,532 Bytes
  • 411 Pages / 453.48 x 680.28 pts Page_size
  • 29 Downloads / 164 Views

DOWNLOAD

REPORT


Text, Speech and Language Technology V O L U M E 20

Series Editors Nancy Ide, Vassar College, New York Jean Veronis, Universite de Provence and CNRS, France Editorial Board Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT & T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T. Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universität Autonoma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMSI-CNRS, France

The titles published in this series are listed at the end of this volume.

Treebanks Building and Using Parsed Corpora Edited by Anne Abeille Universite Paris 7, Paris, France

Springer Science+Business Media, LLC

A C.I.P. Catalogue record for this book is available from the Library of Congress.

ISBN 978-1-4020-1335-5 ISBN 978-94-010-0201-1 (eBook) DOI 10.1007/978-94-010-0201-1

Printed on acid-free paper

All Rights Reserved © 2003 Springer Science+Business Media New York Originally published by Kluwer Academic Publishers 2003 Softcover reprint of the hardcover 1st edition 2003 No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.

Contents

Preface

XI

Introduction Anne Ab eilie I Bu ild ing Treebanks 2 Using treebanks Part I

xiii Xv

xix

Building treebanks

E NGLISH TREEBANKS Chapter I TH E P ENN TR EEBANK: AN OVERVIEW Ann Taylor, Mitchell Marcus, Beatrice Santorini I The annotation schemes 2 Methodology 3 Conclusion s

5

6 16 20

Chapter 2 THOUGHTS ON TWO DECADES OF DRAWING TREES Geoffrey Sampson I Historical background 2 Building treeb ank s 3 Exploiting the S USANNE Treebank 4 Small is beautiful 5 Annotating a spoke n corpus 6 Using the CHRISTl NE Corpus 7 Conclusion

23 23

26 29 33 35

38

40

Chapter 3

43

BA NK OF ENGLISH AND BEYO ND Timo Jiirvinen I Introduction 2 Annotating 200 million words 3 ENGCG Syntax 4 FDG parser 5 Conclusion

43 44

52 54 56

v

VI

TREEBANKS

Chapter 4 COMPLETING PARSED CORPORA FROM CORRECTION TO EYOLUTION

Sean Wallis I Introduction 2 Conventional post-correction 3 A paradigm shift: transverse correction 4 Critique

61 61 63 65 68

GERMAN TREEBANKS

Chapter 5 SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS

73

Thorsten Brants, Wojeieeh Skut, Hans Uszkoreit I Introduction 2 Treebank development 3 Corpus annotation 4 Applications 5 Conclusions Appendix: Tagsets

73 74 77 83 83 87

Chapter 6 ANNOTATION OF ERROR TYPES FOR A GERMAN NEWSGROUP CORPUS

Markus Beeker, Andrew Bredenkamp, Berthold Crysmann, Juditn Klein I Introduction 2 Corpus Description 3 Annotation Strategy 4 Annotation Tools 5 Evaluation 6 First Results 7 Conclusion

89 89 90 9