Supervised Web Document Classification Using Discrete Transforms, Active Hypercontours and Expert Knowledge

In this paper, a new method of supervised classification of documents is proposed. It utilizes discrete trasforms to extract features from classified objects and adopts adaptive potential active hypercontours (APAH) for document classification. The idea o

  • PDF / 532,368 Bytes
  • 19 Pages / 430 x 660 pts Page_size
  • 2 Downloads / 162 Views

DOWNLOAD

REPORT


Institute of Computer Science, Technical University of Lodz Wolczanska 215, 93-005 Lodz, Poland [email protected], [email protected] 2 Systems Research Institute, Polish Academy of Sciences Newelska 6, 01-447 Warsaw, Poland

Abstract. In this paper, a new method of supervised classification of documents is proposed. It utilizes discrete trasforms to extract features from classified objects and adopts adaptive potential active hypercontours (APAH) for document classification. The idea of APAH generalizes classic contour methods of image segmentation. It has two main advantages: it can use almost any knowledge during the search for an optimal classification function and it can operate in a feature space where only metric is defined. Here, both of them are utilized - the first one by using expert knowledge about significance of documents from training set and the second one by inducing new metrics in feature spaces. The method has been evaluated on the subset of open directory project (ODP) database and compared with k-NN, the well known classification technique.

1

Introduction

The rapid development of Web Intelligence (WI) [1,2,3,4,5,6,7,8,9] technologies leads to the growth of the amount of reliable knowledge that can be used for the efficiency improvement of many standard tasks in artificial intelligence, which in turn WI can benefit from. This imposes the necessity to either create new methods that are able to effectively adapt knowledge coming from different sources or modify the existing techniques in order to satisfy Web Intelligence requirements. The presented approach joins experiences gained from the domains that have been considered separately so far, giving mechanisms capable of utilizing external knowledge in an efficient and flexible way. The paper is organized as follows: in section 2 the problem of classification of documents is stated, in section 3 integral spatial transformations using kernel methods for feature extraction are described and in section 4 the adaptive potential active hypercontour algorithm used for construction of an optimal classifier is presented. The next two sections focus on the presentation of data used in the experiments and the discussion of obtained results respectively. The paper concludes with the summary of the proposed method. N. Zhong et al. (Eds.): WImBI 2006, LNAI 4845, pp. 305–323, 2007. c Springer-Verlag Berlin Heidelberg 2007 

306

2 2.1

P.S. Szczepaniak, A. Tomczyk, and M. Pryczek

Supervised and Unsupervised Document Classification Classification

The classification problem can be formulated as the task of assigning a proper label l from the finite set of labels L (where e.g. L = {1, . . . , L} and L is a number of classes) to each object o from the given set of objects O. Such an assignment can formally be described as a classification function (classifier ) k : O → L (each object o ∈ O receives a unique label l ∈ L). Because there are many functions k ∈ K that map O into L (where K denotes a set of all possible classifiers in a given problem) the problem of construction