Paraphrase detection using LSTM networks and handcrafted features
- PDF / 697,618 Bytes
- 14 Pages / 439.642 x 666.49 pts Page_size
- 37 Downloads / 377 Views
Paraphrase detection using LSTM networks and handcrafted features Hassan Shahmohammadi1 · MirHossein Dezfoulian1 · Muharram Mansoorizadeh1 Received: 14 March 2020 / Revised: 23 August 2020 / Accepted: 29 September 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Paraphrase detection is one of the fundamental tasks in the area of natural language processing. Paraphrase refers to those sentences or phrases that convey the same meaning but use different wording. It has a lot of applications such as machine translation, text summarization, QA systems, and plagiarism detection. In this research, we propose a new deep-learning based model which can generalize well despite the lack of training data for deep models. After preprocessing, our model can be divided into two separate modules. In the first one, we train a single Bi-LSTM neural network to encode the whole input by leveraging its pretrained GloVe word vectors. In the second module, three sets of handcrafted features are used to measure the similarity between each pair of sentences, some of which are introduced in this research for the first time. Our final model is formed by incorporating the handcrafted features with the output of the Bi-LSTM network. Evaluation results on MSRP and Quora datasets show that it outperforms almost all the previous works in terms of f-measure and accuracy on MSRP and achieves comparable results on Quora. On the Quora-question pair competition launched by Kaggle, our model ranked among the top 24% solutions between more than 3000 teams. Keywords Paraphrase detection · Short text similarity · Deep learning · Feature engineering · Information fusion
1 Introduction With the ever increasing textual data on social media platforms such as Twitter and Facebook, measuring the semantic similarity of short texts is becoming more important, and Muharram Mansoorizadeh
[email protected] Hassan Shahmohammadi [email protected] MirHossein Dezfoulian [email protected] 1
Bu-Ali Sina University, Hamedan, Iran
Multimedia Tools and Applications
hence, related NLP tasks have been gaining a lot of attention. One of such tasks is paraphrase detection which tries to measure the semantic equivalence of two pieces of text. It is a critical task in many NLP applications such as machine translation, text summarization, QA systems, and plagiarism detection. In this research, we propose a new model that achieves a decent performance, despite the lack of sufficient training data for deep-learning based models. Our model can be divided into three parts. Preprocessing step is the first part that prepares the sentences for the next step. In the second part, terms are mapped to their numerical representations using GloVe word embedding [31]. The output of the embedding layer is then fed into a Bi-LSTM neural network [16] to encode the whole sentence by leveraging its word vectors. In the third part, three sets of fine-grained handcrafted features are provided to measure the similarity between each pair of s
Data Loading...