Perfect Accuracy with Human-in-the-Loop Object Detection

Modern state-of-the-art computer vision systems still perform imperfectly in many benchmark object recognition tasks. This hinders their application to real-time tasks where even a low but non-zero probability of error in analyzing every frame from a came

  • PDF / 4,670,510 Bytes
  • 15 Pages / 439.37 x 666.142 pts Page_size
  • 75 Downloads / 313 Views

DOWNLOAD

REPORT


Abstract. Modern state-of-the-art computer vision systems still perform imperfectly in many benchmark object recognition tasks. This hinders their application to real-time tasks where even a low but non-zero probability of error in analyzing every frame from a camera quickly accumulates to unacceptable performance for end users. Here we consider a visual aid to guide blind or visually-impaired persons in finding items in grocery stores using a head-mounted camera. The system uses a humanin-the-decision-loop approach to instruct the user how to turn or move when an object is detected with low confidence, to improve the object’s view captured by the camera, until computer vision confidence is higher than the highest mistaken confidence observed during algorithm training. In experiments with 42 blindfolded participants reaching for 25 different objects randomly arranged on shelves 15 times, our system was able to achieve 100 % accuracy, with all participants selecting the goal object in all trials. Keywords: Scene understanding · Quality of life technologies · Sensory substitution · Mobile and wearable systems · Applications for the visually impaired · Egocentric and first-person vision · Computer vision · Object detection

1

Introduction and Background

People who are blind have more difficulty navigating the world than those with sight, even in places they have been before [8,23]. This is a condition that affects 39 million people worldwide [32]. Much progress has been achieved in developing electronic travel aids to assist them as technology has advanced. One method is to convert images to soundscapes which some subjects can learn to interpret well enough to differentiate places, and to identify and locate some objects [27]. Others include localization in an environment using stereo cameras, accelerometers, and even wifi access points [6,13]. Advances have also been made to traditional aids such as canes, by developing electronic replacements using, e.g., sonar to increase their warning range or grant the same feedback but without a physical cane [20,31], and replacing guide dogs with robots [16]. Among these devices c Springer International Publishing Switzerland 2016  G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part II, LNCS 9914, pp. 360–374, 2016. DOI: 10.1007/978-3-319-48881-3 25

Human-in-the-Loop Object Detection

361

many utilize computer vision to help with navigation, text reading, and object recognition. [1,18–20,29]. Many advances have been made in computer vision, yet, even state of the art algorithms have not yet been able to achieve perfect accuracy on standard datasets [7,12,28]. Our algorithm’s success is founded in the areas of dynamic thresholding and active vision [2]. Active vision is the process of changing views to better identify what is being looked at. This can be through changing the pose of the camera or choosing a region of interest with a larger field of view and then attempting identification within that region using a zoomed-in image [3, 5,11,30]. Dynamic thresholding is any recognitio