STED-Net: Self-taught encoder-decoder network for unsupervised feature representation
- PDF / 2,273,507 Bytes
- 19 Pages / 439.642 x 666.49 pts Page_size
- 45 Downloads / 155 Views
STED-Net: Self-taught encoder-decoder network for unsupervised feature representation Songlin Du1,2 · Takeshi Ikenaga3 Received: 9 December 2019 / Revised: 22 July 2020 / Accepted: 26 August 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Compared with the great successes achieved by supervised learning, e.g. convolutional neural network (CNN), unsupervised feature learning is still a highly-challenging task suffering from no training labels. Because of no training labels for reference, blindly reducing the gap between features and image semantics is the most challenging problem. This paper proposes a Self-Taught Encoder-Decoder Network (STED-Net), which consists of a representation sub-network and a classification sub-network, for unsupervised feature learning. On one hand, the representation sub-network maps images to feature representation. On the other hand, using the features generated by representation sub-network, classification subnetwork simultaneously maps feature representation to class representation and estimates pseudo labels by clustering feature representation. By minimizing the distance between class representation and the estimated pseudo labels, STED-Net teaches the features to represent class information. Through the self-taught feature representation, the gap between features and image semantics is reduced, and the features are promoted to be more and more “class-aware”. The whole learning process of the STED-Net does not refer to any groundtruth class labels. Experimental results on widely-used image classification datasets prove that STED-Net achieves state-of-the-art classification performance compared with existing supervised and unsupervised feature learning models. Keywords Feature representation · Unsupervised learning · Self-taught learning · Autoencoder
Songlin Du
[email protected] 1
School of Automation, Southeast University, Nanjing, 210096, China
2
Key Laboratory of Measurement and Control of Complex Systems of Engineering, Ministry of Education, Nanjing, 210096, China
3
Graduate School of Information, Production and Systems, Waseda University, Kitakyushu, 808-0135, Japan
Multimedia Tools and Applications
1 Introduction Effective feature representation, which aims to represent observation data in a lowerdimensional feature space, is one of the most essential reasons of the great successes achieved in the fields of machine learning and computer vision [3]. Feature representation is the fundamental technique in building intelligent systems, such as document recognition [26], person re-identification [8], and salient object detection [42]. The past several decades witnessed the fast development and upgrading of feature representation methods. Early feature representation methods [1, 2, 4, 15, 27] were manually designed to extract information from low-level image textures. For example, local binary pattern (LBP) [1] and scaleinvariant feature transform (SIFT) [27] extracts local features from images by calculating binary intensity differenc
Data Loading...