Comparison of Different Similarity Functions on Hindi QA System
This paper discusses a comparative analysis of different similarity measures for Hindi question answering system using machine learning approach from information retrieval and classification perspectives. Many machine learning tasks require similarity fun
- PDF / 591,900 Bytes
- 7 Pages / 439.37 x 666.142 pts Page_size
- 54 Downloads / 157 Views
Abstract This paper discusses a comparative analysis of different similarity measures for Hindi question answering system using machine learning approach from information retrieval and classification perspectives. Many machine learning tasks require similarity functions that evaluate likeness between examinations. Similarity computations are particularly important for clustering that depends on precise estimate of the distance between data points. This framework is considered for data matching for multiphrase words and misspelled words.
Keywords Hindi question answering system Machine learning Data mining Similarity functions Text similarity measure N-gram approach Jaccard coefficient similarity Euclidean similarity measure Jaro–Wrinkler
1 Introduction A question answering system includes a process of data matching that aims to interpret whether two data occurrences represent the same entity. This approximate data matching process is relying on similarity functions [1]. Similarity measures have become an extremely popular tool in machine learning. One of the problems that occur in QA system using machine learning is data mining. Data is an essential entity or fact of our concern, but we should know how to retrieve or extract useful
B. Sneha (&) Department of Computer Science and Engineering, Banasthali Vidyapith, Banasthali, India e-mail: [email protected] D. Mohit V. Zorawar Singh Department of Computer Engineering, National Institute of Technology, Kurukshetra, India e-mail: [email protected] V. Zorawar Singh e-mail: [email protected] © Springer Science+Business Media Singapore 2016 S.C. Satapathy et al. (eds.), Proceedings of International Conference on ICT for Sustainable Development, Advances in Intelligent Systems and Computing 408, DOI 10.1007/978-981-10-0129-1_68
657
658
B. Sneha et al.
entity from the large volumes of raw data. Data mining techniques help us in accomplishing this [1]. Data mining depends upon distance estimate between observations. The concept of similarity can be different depending on particular domain, task, or dataset available. It is desirable to learn similarity functions from training data to seize the correct notion of distance for a particular task available in a given domain. Another key application that can be benefit from using learnable similarity functions is clustering [2].
2 Different Similarity Functions A text document can be modeled in many ways, “bag-of-words” being the most prominent representation [3] in IR and data mining. A phrase count is maintained in a bag and each word is made to correspond to an aspect in the followed data space. Consequently, the word appearing in the document with a high frequency, contributes a high weight. This weight can be raised if stemming is applied as N-variants of a base word add up. Accurate clustering requires an error-free definition of the closeness between a pair of topic, concerning of either the pairwise comparison. In our work, first we use N-gram approach on dataset. In [4], A.K. Patid
Data Loading...