Treebanks Building and Using Parsed Corpora
Linguists and engineers in Natural Language Processing tend to use electronic corpora more and more. Most research has long been limited to raw (unannotated) texts or to tagged texts (annotated with parts of speech only), but these approaches suffer from
- PDF / 42,454,532 Bytes
- 411 Pages / 453.48 x 680.28 pts Page_size
- 29 Downloads / 164 Views
Text, Speech and Language Technology V O L U M E 20
Series Editors Nancy Ide, Vassar College, New York Jean Veronis, Universite de Provence and CNRS, France Editorial Board Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT & T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T. Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universität Autonoma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMSI-CNRS, France
The titles published in this series are listed at the end of this volume.
Treebanks Building and Using Parsed Corpora Edited by Anne Abeille Universite Paris 7, Paris, France
Springer Science+Business Media, LLC
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN 978-1-4020-1335-5 ISBN 978-94-010-0201-1 (eBook) DOI 10.1007/978-94-010-0201-1
Printed on acid-free paper
All Rights Reserved © 2003 Springer Science+Business Media New York Originally published by Kluwer Academic Publishers 2003 Softcover reprint of the hardcover 1st edition 2003 No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
Contents
Preface
XI
Introduction Anne Ab eilie I Bu ild ing Treebanks 2 Using treebanks Part I
xiii Xv
xix
Building treebanks
E NGLISH TREEBANKS Chapter I TH E P ENN TR EEBANK: AN OVERVIEW Ann Taylor, Mitchell Marcus, Beatrice Santorini I The annotation schemes 2 Methodology 3 Conclusion s
5
6 16 20
Chapter 2 THOUGHTS ON TWO DECADES OF DRAWING TREES Geoffrey Sampson I Historical background 2 Building treeb ank s 3 Exploiting the S USANNE Treebank 4 Small is beautiful 5 Annotating a spoke n corpus 6 Using the CHRISTl NE Corpus 7 Conclusion
23 23
26 29 33 35
38
40
Chapter 3
43
BA NK OF ENGLISH AND BEYO ND Timo Jiirvinen I Introduction 2 Annotating 200 million words 3 ENGCG Syntax 4 FDG parser 5 Conclusion
43 44
52 54 56
v
VI
TREEBANKS
Chapter 4 COMPLETING PARSED CORPORA FROM CORRECTION TO EYOLUTION
Sean Wallis I Introduction 2 Conventional post-correction 3 A paradigm shift: transverse correction 4 Critique
61 61 63 65 68
GERMAN TREEBANKS
Chapter 5 SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS
73
Thorsten Brants, Wojeieeh Skut, Hans Uszkoreit I Introduction 2 Treebank development 3 Corpus annotation 4 Applications 5 Conclusions Appendix: Tagsets
73 74 77 83 83 87
Chapter 6 ANNOTATION OF ERROR TYPES FOR A GERMAN NEWSGROUP CORPUS
Markus Beeker, Andrew Bredenkamp, Berthold Crysmann, Juditn Klein I Introduction 2 Corpus Description 3 Annotation Strategy 4 Annotation Tools 5 Evaluation 6 First Results 7 Conclusion
89 89 90 9