Parallel Dictionary Compression Using Grid Technologies
This paper introduces a novel algorithm which approaches dictionary compression without the preliminary knowledge of the grammatical rules. Any type of languages except for incorporating ones can be processed by this solution in an effective way. The algo
- PDF / 450,857 Bytes
- 8 Pages / 430 x 660 pts Page_size
- 32 Downloads / 159 Views
tract. This paper introduces a novel algorithm which approaches dictionary compression without the preliminary knowledge of the grammatical rules. Any type of languages except for incorporating ones can be processed by this solution in an effective way. The algorithm cuts words derived from the same stem into base word, prefix and suffix groups from which a hierarchical dictionary is constructed allowing spell checking, possible stem determination, and efficient distributed parallel pattern matching. By eliminating the severe redundancy in the word’s simple treerepresentation, the compression ratio can be significantly better than by using conventional techniques.
1
Introduction
Nowadays, with the spread of different embedded systems, the need of an efficient, transparent dictionary compression is becoming more intense. The development of input methods is evolving from the unaccustomed formal commands to the natural human language. This is mainly caused by the fact that the amount of digitally exchanged information is accelerating in a tremendous rate. This information mainly consists of three parts: audio, video and text. In most cases the problems of audio and video compression have been extensively analyzed and partially solved by the industry due to the demanding public need. Since the demand for natural language support in electronic equipments is also increasing, it is indispensable to develop an effective method to compress and store languages. Describing languages has several difficulties [2,3]. First, every language has its own specialities, which means that the structure of the languages is diverse, meanwhile the grammar varies a lot too. Second, the words derived from a stem can not be determined by the grammatical rules and the grammatical category of the stem only, the meaning has to be taken into account too [4]. This renders a grammatical rule based generative algorithm nearly useless. Third, the size of the uncompressed dictionary is extreme(5–40 GB), and it would be desirable to use the dictionary in an environment where the resources are limited. This means that the dictionary has to be compact enough to fit into the device, and has to be accessible through a low-cost methods, since the available computation resources I. Lirkov, S. Margenov, and J. Wa´ sniewski (Eds.): LSSC 2007, LNCS 4818, pp. 492–499, 2008. c Springer-Verlag Berlin Heidelberg 2008
Parallel Dictionary Compression Using Grid Technologies
493
are limited too. These contradictionary requirements have to be simultaneously met in order to create a viable and widely usable system. The main aim of this paper is to solve the problems of currently used dictionary compression methods, and provide a distributed multi-language dictionary which in the first step facilitates word level storage, but can be extended to support sentence level rules. The algorithm is able to extract the grammatical or meaning based rules from the input. This eliminate the dictionary’s direct dependency form the generative grammatical rules. These extracted rules does not
Data Loading...