Tamil Paraphrase Detection Using Encoder-Decoder Neural Networks

Detecting paraphrases in Indian languages require critical analysis on the lexical, syntactic and semantic features. Since the structure of Indian languages differ from the other languages like English, the usage of lexico-syntactic features vary between

  • PDF / 618,067 Bytes
  • 13 Pages / 439.37 x 666.142 pts Page_size
  • 48 Downloads / 217 Views

DOWNLOAD

REPORT


Abstract. Detecting paraphrases in Indian languages require critical analysis on the lexical, syntactic and semantic features. Since the structure of Indian languages differ from the other languages like English, the usage of lexico-syntactic features vary between the Indian languages and plays a critical role in determining the performance of the system. Instead of using various lexico-syntactic similarity features, we aim to apply a complete end-to-end system using deep learning networks with no lexicosyntactic features. In this paper we exploited the encoder-decoder model of deep neural network to analyze the paraphrase sentences in Tamil language and to classify. In this encoder-decoder model, LSTM, GRU units and gNMT are used as layers along with attention mechanism. Using this end-to-end model, there is an increase in f1-measure by 0.5% for the subtask-1 when compared to the state-of-the-art systems. The system was trained and evaluated on DPIL@FIRE2016 Shared Task dataset. To our knowledge, ours is the first deep learning model which validates the training instances of both the subtask-1 and subtask-2 dataset of DPIL shared task. Keywords: Tamil paraphrase detection · Deep learning · Encoder-decoder model · Sequence-to-sequence (Seq-2-Seq)

1

Introduction

Recent advances in deep neural network architecture has motivated many researchers in natural language processing to apply for different tasks such as PoS, language modeling, NER, relation extraction, paraphrase detection, semantic role labeling etc., Paraphrase detection is one of primary and important task for many NLP applications. The two sentences which convey the same meaning in a language are said to be semantically correct or paraphrases. Paraphrases can be detected, extracted and can be generated. Paraphrase detection is an important task in paraphrase generation and extraction system. In paraphrase generation, the paraphrase detection improves the quality by picking up the best semantic equivalent paraphrase from the list of generated sentences. In paraphrase extraction, the detection of paraphrase plays a vital role in validating c IFIP International Federation for Information Processing 2020  Published by Springer Nature Switzerland AG 2020 A. Chandrabose et al. (Eds.): ICCIDS 2020, IFIP AICT 578, pp. 30–42, 2020. https://doi.org/10.1007/978-3-030-63467-4_3

Tamil Paraphrase Detection Using Encoder-Decoder Neural Networks

31

the extracted sentences. Apart from this, paraphrase detection is also utilized in document summarization, plagiarism detection, question-answering system etc. Paraphrases in sentences are detected in earlier systems using different traditional machine learning techniques. Recently paraphrase detection was explored in some regional Indian languages. One of the shared task which was first of its kind for Indian languages – Detecting Paraphrases in Indian languages – DPIL@FIRE2016 [9,10] highlighted the performance of different systems using traditional machine learning techniques. The issues associated with these systems are: