Enhance AdaBoost Algorithm by Integrating LDA Topic Model

AdaBoost is an ensemble method, which is considered to be one of the most influential algorithms for multi-label classification. It has been successfully applied to diverse domains for its tremendous simplicity and accurate prediction. To choose the weak

  • PDF / 1,776,874 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 3 Downloads / 412 Views

DOWNLOAD

REPORT


School of Computer, National University of Defense Technology, Changsha, China [email protected], [email protected] 2 Network Service Center, Beijing Institude of Technology, Beijing, China {Lizq,guohongchen}@bit.edu.cn

Abstract. AdaBoost is an ensemble method, which is considered to be one of the most influential algorithms for multi-label classification. It has been successfully applied to diverse domains for its tremendous simplicity and accurate prediction. To choose the weak hypotheses, AdaBoost has to examine the whole features individually, which will dramatically increase the computational time of classification, especially for large scale datasets. In order to tackle this problem, we a introduce Latent Dirichlet Allocation (LDA) model to improve the efficiency and effectiveness of AdaBoost by mapping word-matrix into topic-matrix. In this paper, we propose a framework integrating LDA and AdaBoost, and test it with two Chinese Language corpora. Experiments show that our method outperforms the traditional AdaBoost using BOW model. Keywords: AdaBoost

1

· Ensemble method · Text categorization

Introduction

AdaBoost is an adaptive Boosting algorithm [6] with accurate prediction and great simplicity. It has become one of the most influential ensemble methods for classification task. The core idea of AdaBoost is to generate a committee of weak hypotheses and combine them with weights. In each iteration, AdaBoost will enhance the performance depending on the accuracy of previous classifiers. Ferreira and Figueiredo [4] review the AdaBoost algorithm in details, and its variants have been exploited in diverse domains such as text categorization, face detection, remote sensing image detection, barcode recognition, and banknote number recognition [15]. AdaBoost was designed only for binary classification, while AdaBoost.MH [11] is an extension of AdaBoost to be fit for multi-class multi-label classification. In AdaBoost.MH, a “pivot term” will be selected if the term has been found harder to classified by previous classifiers in each iteration. An improved version of AdaBoost.MH, called MP-Boost, is proposed by Esuli et at. [3]. In each iteration of boosting process, MP-Boost selects several “pivod terms” and one for each category instead of one for all categories in AdaBoost.MH. This mechanism outperforms in effectiveness and efficiency. c Springer International Publishing Switzerland 2016  Y. Tan and Y. Shi (Eds.): DMBD 2016, LNCS 9714, pp. 27–37, 2016. DOI: 10.1007/978-3-319-40973-3 3

28

F. Gai et al.

Both methods mentioned above have to scan the whole feature space to select the pivot term or terms, which is obviously sensitive to the number of features. For Text Categorization (TC), traditional methods utilize Vector Space Model (VSM) and Bag-of-Words (BOW) to represent the original corpus, and form a high-dimensional sparse matrix. However, it is a time consume task in case of large scale dataset. In order to accelerate the boosting process, it is necessary to use feature selection or feature extraction techniques,