Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily d
- PDF / 3,835,809 Bytes
- 17 Pages / 439.37 x 666.142 pts Page_size
- 56 Downloads / 288 Views
Carnegie Mellon University, Pittsburgh, USA [email protected] 2 Inria, Rocquencourt, France 3 University of Washington, Seattle, USA 4 The Allen Institute for AI, Seattle, USA http://allenai.org/plato/charades/
Abstract. Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 s, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.
1
Introduction
Large scale visual learning fueled by huge datasets has changed the computer vision landscape [1,2]. Given the source of this data, it’s not surprising that most of our current success is biased towards static scenes and objects in Internet images. As we move forward into the era of AI and robotics, however, new questions arise. How do we learn about different states of objects (e.g., cut vs. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part I, LNCS 9905, pp. 510–526, 2016. DOI: 10.1007/978-3-319-46448-0 31
Hollywood in Homes: Crowdsourcing Data Collection
511
whole)? How do common activities affect changes of object states? In fact, it is not even yet clear if the success of the Internet pre-trained recognition models will transfer to real-world settings where robots equipped with our computer vision models should operate. Shifting the bias from Internet images to real scenes will most likely require collection of new large-scale datasets representing activities of our boring everyday life: getting up, getting dressed, putting groceries in fridge, cutting vegetables and so on. Such datasets will allow us to develop new representations and to learn
Data Loading...