Topics extraction in incremental short texts based on LSTM

  • PDF / 893,253 Bytes
  • 9 Pages / 595.276 x 790.866 pts Page_size
  • 48 Downloads / 271 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Topics extraction in incremental short texts based on LSTM Xubo Zhang1   · Li Zhang1 Received: 8 April 2020 / Revised: 4 August 2020 / Accepted: 25 September 2020 © Springer-Verlag GmbH Austria, part of Springer Nature 2020

Abstract With the development of online social media, the topic extraction of short text has become an important research field. How to extract the topic, especially new topics that have not been recognized, from increasing and updated short texts has attracted the attention of scholars. This paper focuses on constructing a system based on long short-term memory (LSTM) model in deep learning. Firstly, the short text is converted to a word vector matrix by the word2vec model. After that, two models based on LSTM were designed. One is used to recognize whether the text belongs to an existing topic or a new one. The other identifies whether two text samples belong to the same topic or not. Finally, a hierarchical clustering model is used to find the number of new topics based on the output information of the two LSTM models. The experimental results show that the system constructed in this paper can identify new text topics well and achieve good algorithm performance. Keywords  New topic · Existing topic · Topic extraction · LSTM

1 Introduction With the development of online social media (OSM) platforms, an increasing number of short texts are uploaded on a variety of media, such as Twitter, product review communities, and so on. As these texts are an expression of users’ interest, the analysis and utilization of short texts have aroused the interest of academia and industry (Kušen et al. 2019; Grégoire et al. 2014). The text topic is a highly abstract summary of the text. Once the text topic has been understood, these discrete and disordered text data can be grasped and utilized efficiently. This can help companies by providing more interest or information about their users. Therefore, an increasing number of scholars are studying the field of text topic extraction, which has become one of the most essential and fundamental technologies in natural language processing (NLP) and is widely used in emergency situations handling (Kejriwal and Zhou 2020; Interdonato et al. 2019), the news (de Souza et al. 2020; Park

* Xubo Zhang [email protected] Li Zhang [email protected] 1



University of International Business and Economics, Beijing, China

et al. 2020), product review analysis (Santos et al. 2020), and other aspects. Text classification methods based on machine learning have gradually developed and become the mainstream methods of text topic extraction. Generally, text topic classification means dividing the text into a specific category or several categories under the pre-given set of subject category labels according to the content of the text, which is usually expressed by the text topics. At present, supervised, unsupervised, and semi-supervised learning in machine learning have been applied to text classification. Many text classifiers use supervised techniques, like