Construction of a Probabilistic Hierarchical Structure Based on a Japanese Corpus and a Japanese Thesaurus

The purpose of this study is to construct a probabilistic hierarchical structure of categories based on a statistical analysis of Japanese corpus data and to verify the validity of the structure by conducting a psychological experiment. At first, the co-o

  • PDF / 547,192 Bytes
  • 16 Pages / 430 x 660 pts Page_size
  • 82 Downloads / 163 Views

DOWNLOAD

REPORT


Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro, Tokyo, Japan Nissay Information Technology Co. Ltd. 5-37-1, Kamata, Tokyo, Japan

Abstract. The purpose of this study is to construct a probabilistic hierarchical structure of categories based on a statistical analysis of Japanese corpus data and to verify the validity of the structure by conducting a psychological experiment. At first, the co-occurrence frequencies of adjectives and nouns within modification relations were extracted from a Japanese corpus. Secondly, a probabilistic hierarchical structure was constructed based on the probability, P (category|noun), representing the category membership of the nouns, and utilizing categorization information in a thesaurus and a soft clustering method (Rose’s method [1]) with co-occurrence frequencies as initial values. This method makes it possible to identify the constructed hierarchical structure. In order to examine the validity of the constructed hierarchy, a psychological experiment was conducted. The results of the experiment verified the psychological validity of the hierarchical structure.

1

Introduction

There are many kinds of thesauruses. For example, to list just a few in Japanese, there are the EDR concept classification dictionary [2], Goitaikei (A Japanese Lexicon) [3], BUNRUI-GOI-HYO [4]. Generally, thesauri are referred to when people seek more appropriate words or expressions in their writings. They have also become to be utilized in many fields because of their comprehensiveness. For example, natural language processing is one of field that actively uses this kind of resource. However, thesauri do not necessarily reflect the knowledge structure of human beings. Thesauri merely indicate what kind of semantic category a certain noun belongs to. Generally, human beings can distinguish nouns that are strongly associated as being representative of a category from other nouns that are only weakly associated. For example, human beings tend to regard both “sparrow” and “robin” as strongly associated nouns to the “bird” category. On the other hand, “penguin” is regarded as being only weakly associated to the “bird” category. In such cases, thesauri would only enumerate “sparrow”, “robin” and “penguin” as belonging to the “bird” category, but would not include any indication of the degree of association for each noun. In other words, thesauri do not contain probabilistic information that indicates the degree of a noun’s association with the representativeness of a category, that is to say, the association T. Tokunaga and A. Ortega (Eds.): LKR 2008, LNAI 4938, pp. 132–147, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Construction of a Probabilistic Hierarchical Structure

133

probability of a noun belonging to a category. Such probabilistic information is important in order to construct more human-like knowledge structures, such as a hierarchical structure of concepts that contains probabilistic information. Accordingly, this study constructs a probabilistic hierarchical structure of nouns that realize