A novel ensemble statistical topic extraction method for scientific publications based on optimization clustering
- PDF / 3,814,868 Bytes
- 46 Pages / 439.642 x 666.49 pts Page_size
- 65 Downloads / 205 Views
A novel ensemble statistical topic extraction method for scientific publications based on optimization clustering Ammar Kamal Abasi1 · Ahamad Tajudin Khader1 · Mohammed Azmi Al-Betar2,3 · Syibrah Naim4 · Sharif Naser Makhadmeh1 · Zaid Abdi Alkareem Alyasseri1,5 Received: 13 October 2019 / Revised: 26 July 2020 / Accepted: 29 July 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract The automatic topic extraction (TE) from scientific publications provides a very compact summary of the clusters’ contents. This often helps in locating information easily. TE enables us to define the boundaries of the scientific fields. Text Document Clustering (TDC) represents, in general, the first step of topic identification to identify the documents, which address a related subject matter. Metaheuristics are typically used as efficient approaches for TDC. The multi-verse optimizer algorithm (MVO) involves a stochastic populationbased algorithm. It has been recently proposed and successfully utilized to tackle many hard optimization problems. In the TE process, the focus of each statistical TE method is placed on various language feature space aspects. The aim of this paper is to design a novel ensemble method for an automatic TE from a collection of scientific publications based on MVO as the clustering algorithm. The automatic TE, which is used in our approach, is term frequency-inverse document frequency (TF-IDF), most frequent based keyword extraction (TF), co-occurrence statistical information-based keyword extraction (CSI), TextRank (TR), and mutual information (MI). A group of candidate topics can be provided by each automatic TE method for the proposed ensemble method. Next, the ensemble approach prunes the candidate topics’ set via the application of a specific filtering heuristic. Then, their scores are recalculated based on the prescribed metrics. After that, for selecting a set of topics for certain scientific publications, dynamic threshold functions are applied. The findings emphasized the refined candidate set’s efficiency, as well as effectiveness. The results also showed that the system’s quality has been improved by new topics. The proposed method achieved better precision, as well as recall on a similar dataset compared to the state-of-the-art TE methods. Keywords Topic extraction · Ensemble methods · Multi-Verse optimizer · Scientific text clustering · Metaheuristic algorithm
Ammar Kamal Abasi
ammar [email protected]
Extended author information available on the last page of the article.
Multimedia Tools and Applications
1 Introduction Scholarly publications are, in general, reliable sources of data. Through publications, the growth rates of knowledge and science are investigated. Scientific communication can be realized through scholarly publications. Written documents, therefore, involve scientific explanations and scientific knowledge, which constitute the existing scientific literature [14]. In the current digital era, gigantic online scientific publications in text d
Data Loading...