Enhance AdaBoost Algorithm by Integrating LDA Topic Model

AdaBoost is an ensemble method, which is considered to be one of the most influential algorithms for multi-label classification. It has been successfully applied to diverse domains for its tremendous simplicity and accurate prediction. To choose the weak

PDF / 1,776,874 Bytes
11 Pages / 439.37 x 666.142 pts Page_size
3 Downloads / 439 Views

DOWNLOAD

REPORT

School of Computer, National University of Defense Technology, Changsha, China greferry@gmail.com, xinwenjiang@sina.com 2 Network Service Center, Beijing Institude of Technology, Beijing, China {Lizq,guohongchen}@bit.edu.cn

Abstract. AdaBoost is an ensemble method, which is considered to be one of the most inﬂuential algorithms for multi-label classiﬁcation. It has been successfully applied to diverse domains for its tremendous simplicity and accurate prediction. To choose the weak hypotheses, AdaBoost has to examine the whole features individually, which will dramatically increase the computational time of classiﬁcation, especially for large scale datasets. In order to tackle this problem, we a introduce Latent Dirichlet Allocation (LDA) model to improve the eﬃciency and eﬀectiveness of AdaBoost by mapping word-matrix into topic-matrix. In this paper, we propose a framework integrating LDA and AdaBoost, and test it with two Chinese Language corpora. Experiments show that our method outperforms the traditional AdaBoost using BOW model. Keywords: AdaBoost

1

· Ensemble method · Text categorization

Introduction

AdaBoost is an adaptive Boosting algorithm [6] with accurate prediction and great simplicity. It has become one of the most inﬂuential ensemble methods for classiﬁcation task. The core idea of AdaBoost is to generate a committee of weak hypotheses and combine them with weights. In each iteration, AdaBoost will enhance the performance depending on the accuracy of previous classiﬁers. Ferreira and Figueiredo [4] review the AdaBoost algorithm in details, and its variants have been exploited in diverse domains such as text categorization, face detection, remote sensing image detection, barcode recognition, and banknote number recognition [15]. AdaBoost was designed only for binary classiﬁcation, while AdaBoost.MH [11] is an extension of AdaBoost to be ﬁt for multi-class multi-label classiﬁcation. In AdaBoost.MH, a “pivot term” will be selected if the term has been found harder to classiﬁed by previous classiﬁers in each iteration. An improved version of AdaBoost.MH, called MP-Boost, is proposed by Esuli et at. [3]. In each iteration of boosting process, MP-Boost selects several “pivod terms” and one for each category instead of one for all categories in AdaBoost.MH. This mechanism outperforms in eﬀectiveness and eﬃciency. c Springer International Publishing Switzerland 2016 Y. Tan and Y. Shi (Eds.): DMBD 2016, LNCS 9714, pp. 27–37, 2016. DOI: 10.1007/978-3-319-40973-3 3

28

F. Gai et al.

Both methods mentioned above have to scan the whole feature space to select the pivot term or terms, which is obviously sensitive to the number of features. For Text Categorization (TC), traditional methods utilize Vector Space Model (VSM) and Bag-of-Words (BOW) to represent the original corpus, and form a high-dimensional sparse matrix. However, it is a time consume task in case of large scale dataset. In order to accelerate the boosting process, it is necessary to use feature selection or feature extraction techniques,

Data Loading...

Enhance AdaBoost Algorithm by Integrating LDA Topic Model

Recommend Documents

AdaBoost

Fast Iris localization using Haar-like features and AdaBoost algorithm

PS-LDA: A Course Item Model for Tutorial Personalized Recommendation

An extractive text summarization approach using tagged-LDA based topic modeling

Covid-19 Public Opinion Analysis Based on LDA Topic Modeling and Data Visualization

Financial Topic Detection Algorithm Based on Multi-feature Fusion

A CS-AdaBoost-BP model for product quality inspection

Improving biterm topic model with word embeddings

A systematic review to identify the effects of tea by integrating an intelligence-based hybrid text mining and topic mod

Antispam Topic Crawler Algorithm Based on Anti Spoofing

Research on Hot Topic Discovery Technology of Micro-blog Based on Biterm Topic Model

LDA (Linear Discriminant Analysis)