Context Change Detection for an Ultra-Low Power Low-Resolution Ego-Vision Imager

With the increasing popularity of wearable cameras, such as GoPro or Narrative Clip, research on continuous activity monitoring from egocentric cameras has received a lot of attention. Research in hardware and software is devoted to find new efficient, st

  • PDF / 2,551,745 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 109 Downloads / 205 Views

DOWNLOAD

REPORT


Univerist` a di Bologna, Bologna, Italy {f.paci,l.benini}@unibo.it 2 Universit` a di Modena e Reggio Emilia, Modena, Italy {lorenzo.baraldi,giuseppe.serra,rita.cucchiara}@unimore.it 3 ETH Z¨ urich, Z¨ urich, Switzerland [email protected]

Abstract. With the increasing popularity of wearable cameras, such as GoPro or Narrative Clip, research on continuous activity monitoring from egocentric cameras has received a lot of attention. Research in hardware and software is devoted to find new efficient, stable and longtime running solutions; however, devices are too power-hungry for truly always-on operation, and are aggressively duty-cycled to achieve acceptable lifetimes. In this paper we present a wearable system for context change detection based on an egocentric camera with ultra-low power consumption that can collect data 24/7. Although the resolution of the captured images is low, experimental results in real scenarios demonstrate how our approach, based on Siamese Neural Networks, can achieve visual context awareness. In particular, we compare our solution with hand-crafted features and with state of art technique and propose a novel and challenging dataset composed of roughly 30000 low-resolution images. Keywords: Egocentric vision learning

1

·

ULP camera

·

Low-resolution

·

Deep

Introduction and Related Works

Understanding everyday life activities is gaining more and more attention in the research community. This has triggered a number of interesting applications, ranging from health monitoring, memory rehabilitation, lifestyle analysis to security and entertainment [13,14,31,32]. These are mainly based on two sources of data: sensor and visual data. Sensor data, such as GPS, light, temperature and acceleration have been extensively used for activity monitoring [15,22,25]: among others, Kwapisz et al. [17] describe how a smartphone can be used to perform activity recognition simply by keeping it in the pocket. Guan et al. [11] present a semi-supervised learning algorithm for action understanding based on 40 accelerometers strapped loosely to common trousers. Although sensor data c Springer International Publishing Switzerland 2016  G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part I, LNCS 9913, pp. 589–602, 2016. DOI: 10.1007/978-3-319-46604-0 42

590

F. Paci et al.

can be easily collected for days, thanks to low energy consumption, its ability to recognize complex activities and the context around the user is low. On the other hand, computer vision can indeed capture much richer contextual information which has been successfully used to recognize more complex activities [1,18,29]. Recently, several works that consider vision tasks from the egocentric perspective have been presented. Poleg et al. [26] propose a temporal segmentation that identifies 12 different activities (e.g. head motion, sitting, walking etc.). Castro et al. [5] present an approach based on the combination of a Convolutional Neural Network and a Random Decision Forest; this approach is able to recognize images automatically