Predicting Long-horizon Futures by Conditioning on Geometry and Time

Carnegie Mellon University

TL;DR

We learn to forecast diverse and long-horizon future geometries.
We use only 1000 pseudo-depth sequences of in-the-wild dynamic scenes to do this using 2D diffusion models.


Given three past pseudo-depth frames, our method can generate an arbitrary frame in future, queried by a future frame timestamp. Here, we show long-horizon forecasting predictions from our method for upto 10s in future (10 frames at 1FPS) on the TAO dataset.


Abstract

Our work explores the task of generating future sensor observations conditioned on the past. We are motivated by `predictive coding' concepts from neuroscience as well as robotic applications such as self-driving vehicles. Predictive video modeling is challenging because the future may be multi-modal and learning at scale remains computationally expensive for video processing. To address both challenges, our key insight is to leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We repurpose image models for video prediction by conditioning on new frame timestamps. Such models can be trained with videos of both static and dynamic scenes. To allow them to be trained with modestly-sized datasets, we introduce invariances by factoring out illumination and texture by forcing the model to predict (pseudo) depth, readily obtained for in-the-wild videos via off-the-shelf monocular depth networks. In fact, we show that simply modifying networks to predict grayscale pixels already improves the accuracy of video prediction. Given the extra controllability with timestamp conditioning, we propose sampling schedules that work better than the traditional autoregressive and hierarchical sampling strategies. Motivated by probabilistic metrics from the object forecasting literature, we create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes and a large vocabulary of objects. Our experiments illustrate the effectiveness of learning to condition on timestamps, and show the importance of predicting the future with invariant modalities.


Video



Diverse futures

Since we use a diffusion model which can learn the multi-modal distribution of possible futures, we can sample multiple futures given the same input. Here, we show 5 different futures given the same input. Gray: past input, Blue: future predictions.






Comparison between sampling schedules

Since our diffusion model conditions on a continuous valued timestamp, we can generate sequences of future depth in multiple ways. Traditionally, autoregressive and hierarchical sampling have been the most widely used, but we show two more ways to sample the future: with direct and mixed sampling. Mixed sampling performs the best. From left to right, we show autoregressive, hierarchical, direct and mixed sampling. Gray: past input, Blue: future predictions.



Acknowledgements

We thank Nupur Kumari and Jason Zhang for insightful comments on this work. Tarasha Khurana is supported by a funding from Bosch Research.

BibTeX

@misc{khurana2024predicting,
          title={{Predicting Long-horizon Futures by Conditioning on Geometry and Time}},
          author={Tarasha Khurana and Deva Ramanan},
          year={2024},
          eprint={2404.11554},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
      }