Natural Language Information Retrieval
The last decade has been one of dramatic progress in the field of Natural Language Processing (NLP). This hitherto largely academic discipline has found itself at the center of an information revolution ushered in by the Internet age, as demand for human-
- PDF / 44,880,724 Bytes
- 407 Pages / 480.582 x 695.042 pts Page_size
- 16 Downloads / 258 Views
Text, Speech and Language Technology VOLUME7
Series Editors Nancy Ide, Vassar College, New York Jean Veronis, Universite de Provence and CNRS, France
Editorial Board Harald Baa yen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT& T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universitat Autonoma de Barcelona , Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMS/-CNRS, France
The titles published in this series are listed at the end of this volume.
Natural Language Information Retrieval Edited by
Tomek Strzalkowski General Electric, Research & Development
SPRINGER-SCIENCE+BUSINESS MEDIA, B.V.
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN 978-94-017-2388-6 (eBook) ISBN 978-90-481-5209-4 DOI 10.1007/978-94-017-2388-6
Printed on acid-free paper
All Rights Reserved ©1999 Springer Science+Business Media Dordrecht Originally published by Kluwer Academic Publishers in 1999 No part of the material protected by this 1 were given to the text categorization algorithms for training.
EXTRACTION-BASED TEXT CATEGORIZATION 1. 2. 3. 4.
exploded murder of assassination of was killed 5. was kidnapped 6. attack on 7. was injured 8. exploded in 9. death of 10. took_place 11. caused 12. claimed 13. was wounded Figure 5.
179
14. occurred 15. 16. 17. 18. 19. 20. 21.
22 . 23. 24.
25.
was located took_place on responsibility for occurred on was wounded in destroyed was murdered one of kidnapped exploded on died
The top 25 extraction patterns
AutoSlog-TS is the first system that can generate extraction patterns using only raw text as input. AutoSlog-TS needs both relevant and irrelevant sample texts to decide which patterns are most strongly associated with the domain. Not coincidentally, the preclassified corpus needed for AutoSlog-TS is exactly the same input that is required for the text categorization algorithms. We exploit the preclassified texts by processing them twice: once to generate extraction patterns and once to apply the extraction patterns to the texts. The extracted information, in the form of signatures and role fillers, is then analyzed statistically to identify classification terms that are highly correlated with a category.
4. Word-augmented relevancy signatures Augmenting relevancy signatures with semantic features produced much better results than relevancy signatures alone in the MUC-4 terrorism domain (Riloff and Lehnert, 1994) . But there was a price to pay. Augmented relevancy signatures need a semantic feature hierarchy and a dictionary of words tagged with semantic features . Consequently, using augmented relevancy signatures in a new domain requires an initial time investment that might not be acceptable for many applications. To eliminate the need for semantic features, we investigated whether the role fillers could be represented using lexical item