Zhoutong Zhang Thesis Defense: PursuingMid-Level Vision from Casual Videos


Zhoutong Zhang


Zhoutong Zhang
zoom link: https://mit.zoom.us/j/93539750472

Given the rising abundance and availability of casually shot videos, we would like to ask if this type of data is helpful for solving mid-level vision problems. In this defense, I'll introduce our recent progress towards this problem.

We start from the partial problem of reconstructing consistent and geometrically correct depth maps for videos of arbitrary object motion, using monocular videos. This is challenging for both traditional methods and deep-learning based methods: Traditional methods usually struggle with recovering depth of the dynamic part of the scene, yet deep-learning based methods, though able to give plausible depth for all the pixels in each frame, suffer from temporal inconsistencies. Our solution, without relying on motion segmentations as input, predicts consistent depth maps by explicitly modeling motion.

We then proceed to the problem of recovering camera poses, which is required as inputs for our previous method. State-of-the-art methods often fail on casually shot videos, mostly due to the challenges of handling moving objects and limited camera baselines. I'll present our solution towards these challenges, and present a method that jointly recovers camera poses, focal lengths and plausible depth maps using only video frames as input.

We end the talk by showing how those two methods could help solving mid-level vision problems. We first introduce the notion of video canonicalization, a framework that is helpful for both analyzing various previous works and constructing new methods. In particular, we show a method, designed under this framework, that solves the video version of the checker shadow illusion, a typical mid-level vision problem. This method relies on our inferred camera poses and depths as input, and performs self-supervised decomposition that separates the checker and shadow from the board. Finally, we raise the hypothesis that video canonicalization can be used as a large-scale pre-training task for mid-level vision. As an on going work, we show positive signals that support this hypothesis.