Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features

In this paper, we present a comparative study of news documents classification using various supervised machine learning methods and different combinations of key-phrases (word N-grams extracted from text) and visual features (extracted from a representat

PDF / 551,126 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
64 Downloads / 182 Views

DOWNLOAD

REPORT

Department of Computer Science, Jerusalem College of Technology - Lev Academic Center, 9116001 Jerusalem, Israel [email protected], [email protected] 2 Centre for Research and Technology Hellas, Information Technologies Institute, Thermi, Thessaloniki, Greece {dliparas,moumtzid,stefanos,ikom}@iti.gr

Abstract. In this paper, we present a comparative study of news documents classiﬁcation using various supervised machine learning methods and different combinations of key-phrases (word N-grams extracted from text) and visual features (extracted from a representative image from each document). The application domain is news documents written in English that belong to four categories: Health, Lifestyle-Leisure, Nature-Environment and Politics. The use of the N-gram textual feature set alone led to an accuracy result of 81.0 %, which is much better than the corresponding accuracy result (58.4 %) obtained through the use of the visual feature set alone. A competition between three classiﬁcation methods, a feature selection method, and parameter tuning led to improved accuracy (86.7 %), achieved by the Random Forests method. Keywords: Document classiﬁcation N-gram features Supervised learning

Feature selection Visual features

Key-phrases

1 Introduction During the last years, news agencies and newspapers face the challenge of automatically classifying news documents into a set of categories. This challenge becomes even more attractive when the documents contain not only text but also images. One such typical news document is depicted in Fig. 1. Moreover, in light of the explosion in the number of available news documents, the issue of fast and error-free classiﬁcation of such documents is becoming more critical. Classiﬁcation using supervised learning is a task that is supervised by a set of examples with class assignments and the goal is to assign documents to one or more predeﬁned categories [1]. Many supervised machine learning (ML) methods have been applied to document classiﬁcation. The classiﬁcation models are automatically built from annotated corpora. Comprehensive overviews of classiﬁcation are given by [2–4]. © Springer International Publishing Switzerland 2015 J. Cardoso et al. (Eds.): KEYWORD 2015, LNCS 9398, pp. 64–75, 2015. DOI: 10.1007/978-3-319-27932-9_6

Classiﬁcation Using Various Machine Learning Methods

65

Although many news documents include images in addition to text, most of the classiﬁcation approaches make use of only textual data, in order to build the models. Therefore, it is interesting to perform a comparative study of news documents classiﬁcation using different ML methods and different combinations of textual and visual feature sets, in order to see whether the addition of the visual features can improve the classiﬁcation performance.

Fig. 1. Web-based news document from the guardian entitled: the man in the digital mask (http://www.theguardian.com/technology/2015/sep/10/the-man-in-the-digital-mask-bill-shannon)

In this paper, we explore domain-based classiﬁcation of news

Data Loading...

Classification Using Various Machine Learning Methods and Combinations of Key-Phrases and Visual Features

Recommend Documents

The Analysis of EEG Signal and Comparison of Classification Algorithms Using Machine Learning Methods

Classification and prediction of diabetes disease using machine learning paradigm

Breast Cancer Classification Using Machine Learning Algorithms

Speaker Classification I Fundamentals, Features, and Methods

Software Requirements Classification and Prioritisation Using Machine Learning

Visual Knowledge Discovery and Machine Learning

On the Effectiveness of Using Various Machine Learning Methods for Forecasting Dangerous Convective Phenomena

Rice plant disease classification using color features: a machine learning paradigm

Automatic Visual Quality Assessment of Biscuits Using Machine Learning

Global and Individual Treatment Effects Using Machine Learning Methods

Machine Learning-Based Classification of Heart Sound Using Hilbert Transform

Classification of Hepatic Disease Using Machine Learning Algorithms