Audiovisual cross-modal material surface retrieval

  • PDF / 1,823,258 Bytes
  • 9 Pages / 595.276 x 790.866 pts Page_size
  • 6 Downloads / 134 Views



(0123456789().,-volV)(0123456789(). ,- volV)


Audiovisual cross-modal material surface retrieval Zhuokun Liu1 • Huaping Liu2

Wenmei Huang1 • Bowen Wang1 • Fuchun Sun2

Received: 25 November 2018 / Accepted: 29 August 2019 Ó Springer-Verlag London Ltd., part of Springer Nature 2019

Abstract Cross-modal retrieval is developed rapidly because it can process the data among different modalities. Aiming at solving the problem that the text and image sometimes cannot perform the true and accurate analysis of the material, a system of audiovisual cross-modal retrieval on material surface is proposed. First, we use local receptive fields-based extreme learning machine to extract sound and image features, and then the sound and image features are mapped to the subspace using canonical correlation analysis and retrieved by Euclidean distance. Finally, the process of audiovisual cross-modal retrieval is realized by the system. The experimental results show that the proposed system has a good application effect on wood. The designed system provides a new idea for research in the field of material identification. Keywords Cross-modal retrieval  Local receptive fields-based extreme learning machine  Canonical correlation analysis  Material analysis

1 Introduction With the complexity and diversity of multimedia information, cross-modal retrieval has become an important topic all over the world because of its ability to process data in different modalities [1]. The relatively mature fields of cross-modal retrieval are mainly computer vision, pattern recognition, text–image retrieval, etc. Most of the early research focuses on the two modalities of image and text. The color and texture features reflected by the image and the description of an object by the text sometimes cannot show us enough information. For example, when we do online shopping, sometimes we cannot construct characteristic of the product in our brain completely from the text and picture information obtained from the Internet , which leading to the possibility of purchasing goods which are inconsistent with the demand. In the field of seabed and

& Huaping Liu [email protected] Zhuokun Liu [email protected] 1

State Key Laboratory of Reliability and Intelligence of Electrical Equipment, Tianjin 300130, China


Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

space exploration, people may not distinguish the material information of an unknown object by the video and image returned by the camera alone because they are easily affected by environmental factors. The addition of sound modality sometimes can solve the problem that the text and image information is insufficient in some aspects. In recent years, there are many methods for extracting sound features such as mel-frequency cepstral coefficient (MFCC) [2], linear predictive coding (LPC) [3], hidden Markov models (HMM) [4] and so on. Most of the feature extractions are artificial design and do not have the