Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features

In this paper, we present a comparative study of news documents classification using various supervised machine learning methods and different combinations of key-phrases (word N-grams extracted from text) and visual features (extracted from a representat

  • PDF / 551,126 Bytes
  • 12 Pages / 439.37 x 666.142 pts Page_size
  • 64 Downloads / 154 Views

DOWNLOAD

REPORT


Department of Computer Science, Jerusalem College of Technology - Lev Academic Center, 9116001 Jerusalem, Israel [email protected], [email protected] 2 Centre for Research and Technology Hellas, Information Technologies Institute, Thermi, Thessaloniki, Greece {dliparas,moumtzid,stefanos,ikom}@iti.gr

Abstract. In this paper, we present a comparative study of news documents classification using various supervised machine learning methods and different combinations of key-phrases (word N-grams extracted from text) and visual features (extracted from a representative image from each document). The application domain is news documents written in English that belong to four categories: Health, Lifestyle-Leisure, Nature-Environment and Politics. The use of the N-gram textual feature set alone led to an accuracy result of 81.0 %, which is much better than the corresponding accuracy result (58.4 %) obtained through the use of the visual feature set alone. A competition between three classification methods, a feature selection method, and parameter tuning led to improved accuracy (86.7 %), achieved by the Random Forests method. Keywords: Document classification N-gram features  Supervised learning

 Feature selection   Visual features

Key-phrases



1 Introduction During the last years, news agencies and newspapers face the challenge of automatically classifying news documents into a set of categories. This challenge becomes even more attractive when the documents contain not only text but also images. One such typical news document is depicted in Fig. 1. Moreover, in light of the explosion in the number of available news documents, the issue of fast and error-free classification of such documents is becoming more critical. Classification using supervised learning is a task that is supervised by a set of examples with class assignments and the goal is to assign documents to one or more predefined categories [1]. Many supervised machine learning (ML) methods have been applied to document classification. The classification models are automatically built from annotated corpora. Comprehensive overviews of classification are given by [2–4]. © Springer International Publishing Switzerland 2015 J. Cardoso et al. (Eds.): KEYWORD 2015, LNCS 9398, pp. 64–75, 2015. DOI: 10.1007/978-3-319-27932-9_6

Classification Using Various Machine Learning Methods

65

Although many news documents include images in addition to text, most of the classification approaches make use of only textual data, in order to build the models. Therefore, it is interesting to perform a comparative study of news documents classification using different ML methods and different combinations of textual and visual feature sets, in order to see whether the addition of the visual features can improve the classification performance.

Fig. 1. Web-based news document from the guardian entitled: the man in the digital mask (http://www.theguardian.com/technology/2015/sep/10/the-man-in-the-digital-mask-bill-shannon)

In this paper, we explore domain-based classification of news