Tarasha Khurana1 | Achal Dave1 | Deva Ramanan1,2 |
---|
Monocular object detection and tracking have improved drastically in recent years, but rely on a key assumption: that objects are visible to the camera. Many offline tracking approaches reason about occluded objects post-hoc, by linking together tracklets after the object re-appears, making use of reidentification (ReID). However, online tracking in embodied robotic agents (such as a self-driving vehicle) fundamentally requires object permanence, which is the ability to reason about occluded objects before they re-appear. In this work, we re-purpose tracking benchmarks and propose new metrics for the task of detecting invisible objects, focusing on the illustrative case of people. We demonstrate that current detection and tracking systems perform dramatically worse on this task. We introduce two key innovations to recover much of this performance drop. We treat occluded object detection in temporal sequences as a short-term forecasting challenge, bringing to bear tools from dynamic sequence prediction. Second, we build dynamic models that explicitly reason in 3D, making use of observations produced by state-of-the-art monocular depth estimation networks. To our knowledge, ours is the first work to demonstrate the effectiveness of monocular depth estimation for the task of tracking and detecting occluded objects. Our approach strongly improves by 11.4% over the baseline in ablations and by 5.0% over the state-of-the-art in F1 score.
We have also publicly released our 5-min talk at ICCV 2021 and the accompanying supplementary video for qualitative analysis of the proposed method.
We thank Gengshan Yang for his help with generating 3D visuals, Patrick Dendorfer for incorporating our metrics with the MOT challenge server, and Xueyang Wang for sharing the low-resolution version of the PANDA dataset. We thank Laura Leal-Taixé and Simon Lucey for insightful discussions, participants of the human vision experiment (Adithya Murali, Jason Zhang, Jessica Lee, Kushagra Mahajan, Mehar Khurana, Radhika Kannan, Rashmi Salamani, Steve Yadlowsky, Vaishaal Shankar, and Vidhi Jain) and internal reviewers at CMU (Alireza Golestaneh, David Held, Jack Li, Kangle Deng, and Yi-ting Chen) for reviewing early drafts. This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research, the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001117C0051, and the National Science Foundation (NSF) under grant number IIS-1618903.