Integrating heterogeneous thesauruses for Chinese synonyms
- PDF / 185,518 Bytes
- 3 Pages / 612.284 x 802.205 pts Page_size
- 42 Downloads / 213 Views
Integrating heterogeneous thesauruses for Chinese synonyms Jianbing ZHANG, Peng WU, Yingjie ZHANG, Shujian HUANG
, Xinyu DAI, Jiajun CHEN
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China c Higher Education Press 2020
This section compares CCD [4] with Cilin [5] from the aspects
of vocabulary, taxonomy, and organization. Vocabulary CCD [4] is a Chinese thesaurus on the basis of WordNet [1]. It collects 125,929 words (or phrases) into 99,642 semantic classes (called CSynsets). All CSynsets in CCD show the common concepts between Chinese and English, but miss a large number of Chinese-specific words. While Cilin [5] is a thesaurus with a wide range of Chinese-specific phenomena. Both content and function words are collected. There are 77,457 words in Cilin, with only about half of them covered by CCD. However, Cilin may not collect all senses of a word, because its vocabulary is focused on the synonyms. As a result of the different purposes of CCD and Cilin, two “monosemous" words with the same shape in them respectively may convey totally different meanings. Taxonomy CCD follows WordNet and divides all CSynsets into 45 categories. The CSynsets in the same category are under the same part-of-speech (POS). While all the concepts in Cilin are separated into 12 classes, 94 medium classes (MC), and 1,425 subdivisions. Organization CCD follows WordNet, whose basic unit is CSynset. Words in a CSynset are considered as synonyms. CSynsets are divided into 45 categories and linked by semantic relations. Some relations contain hierarchical information. If we consider CSynsets as nodes and hierarchical relations as edges, CCD can be represented as a forest-like graph. While Cilin is more close to the traditional thesaurus rather than CCD. A five-level hierarchical organization is preliminary defined by linguists, which are Class, MC, Subdivision, Word Group (WG) and Word Set (WS) from top to down. The first word in a WG is called title word (TW), which is the most representative one in the WS. The meanings of other words in the same WG are basically same as the TW. These differences are represented by the division of word sets. So words in the same WS have a very high degree of semantic similarity. CCD shows clear and rich semantic relations but misses many Chinese-specific words, while Cilin covers a wide range of Chinese words without clear relations between them. Thus, we could build a better Chinese synonym resource by integrating the structure of CCD and the Chinese-specific words of Cilin together. To this end, we propose two methods. Firstly we ignore the inner structure of CCD and Cilin, and integrate them by direct mapping; Secondly we consider their structure and integrate them by hierarchical mapping.
Received July 12, 2019; accepted December 26, 2019
3
E-mail: [email protected]
We first present the direct mapping procedure, which is a
1
Introduction
Lexical semantic resource plays an important role in natural language processing. So far, many lexical semantic
Data Loading...