Multilabel graph-based classification for missing labels

  • PDF / 680,428 Bytes
  • 20 Pages / 595.276 x 790.866 pts Page_size
  • 90 Downloads / 273 Views

DOWNLOAD

REPORT


Multilabel graph-based classification for missing labels Yasunobu Sumikawa1 · Tatsurou Miyazaki2 Received: 5 March 2019 / Revised: 17 August 2020 / Accepted: 23 September 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Assigning several labels to digital data is becoming easier as this can be achieved in a collaborative manner with Internet users. However, this process is still a challenge, especially in cases where several labels are assigned to each datum, as some suitable labels may be missed. The missing labels lead to inaccuracies in classification. In this study, we propose a novel graph-based multi-label classifier that exhibits stability for obtaining high-accuracy results; this is achieved even where there are missing labels in training data. The core process of our algorithm is to smoothen the label values of the training data from their top-k similar data by propagating their values and averaging them to generate values for the missing labels in the training data. In experimental evaluations, we used multi-labeled document and image datasets to evaluate classifiers, and then measured micro-averaged F-scores for eight classifiers. Even though we incrementally removed correct labels from the two datasets, the proposed algorithm tended to maintain the F-scores, whereas other classifiers decreased the scores. In addition, we evaluated the algorithm using Wikipedia, which comprises a real dataset that includes missing labels, in order to determine how well the algorithm predicted the correct labels and how useful it was for manual annotations, as initial decisions. We have confirmed that LPAC is useful for not only automatic annotation, but also the facilitation of decision making in the initial manual category assignment. Keywords Multi-label classification · Label propagation · Digital document classification · Digital image classification

1 Introduction Thanks to the growing size of the Web and digital archiving technology, we can now access numerous digital documents, images, and other types of data. This situation is good for enhancing our experiences of using the Web; for example, it is easy to study the history of any country, to find big pictures about relationships between people, and so on. On the other hand, it is becoming increasingly demanding to organize digital data to access them quickly. Defining categories and dividing digital data into these categories play key roles in digital data organization. For example, the categorization of digital documents is useful for constructing thematic timelines or event lists.

B

Yasunobu Sumikawa [email protected] Tatsurou Miyazaki [email protected]

1

University Education Center, Tokyo Metropolitan University, Hachioji, Tokyo, Japan

2

Department of Information Sciences, Tokyo University of Science, Noda, Chiba, Japan

As the amount of data increases, the categorization schemes dynamically change due to the revision of the hierarchical structure and the definition of new categories. When these categorization schemes chan