Wikipedia Mining for an Association Web Thesaurus Construction
Wikipedia has become a huge phenomenon on the WWW. As a corpus for knowledge extraction, it has various impressive characteristics such as a huge amount of articles, live updates, a dense link structure, brief link texts and URL identification for concept
- PDF / 794,026 Bytes
- 13 Pages / 430 x 660 pts Page_size
- 14 Downloads / 159 Views
Abstract. Wikipedia has become a huge phenomenon on the WWW. As a corpus for knowledge extraction, it has various impressive characteristics such as a huge amount of articles, live updates, a dense link structure, brief link texts and URL identification for concepts. In this paper, we propose an efficient link mining method pfibf (Path Frequency - Inversed Backward link Frequency) and the extension method “forward / backward link weighting (FB weighting)” in order to construct a huge scale association thesaurus. We proved the effectiveness of our proposed methods compared with other conventional methods such as cooccurrence analysis and TF-IDF.
1
Introduction
A thesaurus is a kind of dictionary that defines semantic relatedness among words. Although the effectiveness is widely proved by various research areas such as natural language processing (NLP) and information retrieval (IR), automated thesaurus dictionary construction (esp. machine-understandable) is one of the most difficult issues. Of course, the simplest way to construct a thesaurus is human-effort. Thousands of contributors have spend much time to construct high quality thesaurus dictionaries in the past. However, since it is difficult to maintain such huge scale thesauri, they do not support new concepts in most cases. Therefore, A large number of studies have been made on automated thesaurus construction based on NLP. However, issues due to complexity of natural language, for instance the ambiguous/synonym term problems still remain on NLP. We still need an effective method to construct a high-quality thesaurus automatically avoiding these problems. We noticed that Wikipedia, a collaborative wiki-based encyclopedia, is a promising corpus for thesaurus construction. According to statistics of Nature, Wikipedia is about as accurate in covering scientific topics as the Encyclopedia Britannica [1]. It covers concepts of various fields such as Arts, Geography, History, Science, Sports or Games. It contains more than 1.3 million articles (Sept. 2006) and it is becoming larger day by day. Because of the huge scale concept network with a wide-range topic coverage, it is natural to think that Wikipedia can be used as a knowledge extraction corpus. In fact, we already proved that it B. Benatallah et al. (Eds.): WISE 2007, LNCS 4831, pp. 322–334, 2007. c Springer-Verlag Berlin Heidelberg 2007
Wikipedia Mining for an Association Web Thesaurus Construction
323
can be used for accurate association thesaurus construction[2]. Further, several researches have already proved the importance and effectiveness of Wikipedia Mining[3,4,5,6]. However, what seems lacking in these methods is the deep consideration for improving accuracy and scalability. After a number of continuous experiments, we realized that there are possibilities to improve the accuracy because the accuracy changes depending on particular situations. Further, none of previous researches has focused on scalability. WikiRelate [4], for instance, measures the relatedness between two given terms by analyzing (searchi
Data Loading...