Wikipedia Mining for an Association Web Thesaurus Construction

Wikipedia has become a huge phenomenon on the WWW. As a corpus for knowledge extraction, it has various impressive characteristics such as a huge amount of articles, live updates, a dense link structure, brief link texts and URL identification for concept

PDF / 794,026 Bytes
13 Pages / 430 x 660 pts Page_size
14 Downloads / 247 Views

DOWNLOAD

REPORT

Abstract. Wikipedia has become a huge phenomenon on the WWW. As a corpus for knowledge extraction, it has various impressive characteristics such as a huge amount of articles, live updates, a dense link structure, brief link texts and URL identiﬁcation for concepts. In this paper, we propose an eﬃcient link mining method pﬁbf (Path Frequency - Inversed Backward link Frequency) and the extension method “forward / backward link weighting (FB weighting)” in order to construct a huge scale association thesaurus. We proved the eﬀectiveness of our proposed methods compared with other conventional methods such as cooccurrence analysis and TF-IDF.

1

Introduction

A thesaurus is a kind of dictionary that deﬁnes semantic relatedness among words. Although the eﬀectiveness is widely proved by various research areas such as natural language processing (NLP) and information retrieval (IR), automated thesaurus dictionary construction (esp. machine-understandable) is one of the most diﬃcult issues. Of course, the simplest way to construct a thesaurus is human-eﬀort. Thousands of contributors have spend much time to construct high quality thesaurus dictionaries in the past. However, since it is diﬃcult to maintain such huge scale thesauri, they do not support new concepts in most cases. Therefore, A large number of studies have been made on automated thesaurus construction based on NLP. However, issues due to complexity of natural language, for instance the ambiguous/synonym term problems still remain on NLP. We still need an eﬀective method to construct a high-quality thesaurus automatically avoiding these problems. We noticed that Wikipedia, a collaborative wiki-based encyclopedia, is a promising corpus for thesaurus construction. According to statistics of Nature, Wikipedia is about as accurate in covering scientiﬁc topics as the Encyclopedia Britannica [1]. It covers concepts of various ﬁelds such as Arts, Geography, History, Science, Sports or Games. It contains more than 1.3 million articles (Sept. 2006) and it is becoming larger day by day. Because of the huge scale concept network with a wide-range topic coverage, it is natural to think that Wikipedia can be used as a knowledge extraction corpus. In fact, we already proved that it B. Benatallah et al. (Eds.): WISE 2007, LNCS 4831, pp. 322–334, 2007. c Springer-Verlag Berlin Heidelberg 2007

Wikipedia Mining for an Association Web Thesaurus Construction

323

can be used for accurate association thesaurus construction[2]. Further, several researches have already proved the importance and eﬀectiveness of Wikipedia Mining[3,4,5,6]. However, what seems lacking in these methods is the deep consideration for improving accuracy and scalability. After a number of continuous experiments, we realized that there are possibilities to improve the accuracy because the accuracy changes depending on particular situations. Further, none of previous researches has focused on scalability. WikiRelate [4], for instance, measures the relatedness between two given terms by analyzing (searchi

Data Loading...

Wikipedia Mining for an Association Web Thesaurus Construction

Recommend Documents

Analysis of Web Log Mining Based on Association Rule

Web Mining

Web Mining: From Web to Semantic Web First European Web Mining F

Web Structure Mining

Web Usage Mining

A Tool for Web Usage Mining

Web Data Mining

Cross-language Web Mining

Ontology Based Web Mining for Information Gathering

Mining Spatial Association Patterns

Web Content Mining

Mining Web Data