CoNeRF is the first method that enables explicit control of generated images. We can easily synchronize metronomes beating with different tempo, stabilize camera, change our facial expression and much more!
We present generated video sequences from models trained on datasets used in the paper. We directly compare visualizations generated from our method with baslines: Ours-$\mathcal{M}$ and HyperNeRF+$\pi$. We generate each sequence by interpolating between extreme points of attributes ($-1$ and $+1$) and then between randomly sampled values. And the same time, we freely orbit the camera around the central object. Our method generates the most realistic images while providing the expected controllability of the output.
Our approach also enables resynchronization of metronomes beating with different rates while keeping the original camera motion.
Training sequence
Validation sequence
We extend neural 3D representations to allow for intuitive and interpretable user control beyond novel view rendering (i.e. camera control). We allow the user to annotate which part of the scene one wishes to control with just a small number of mask annotations in the training images. Our key idea is to treat the attributes as latent variables that are regressed by the neural network given the scene encoding. This leads to a few-shot learning framework, where attributes are discovered automatically by the framework, when annotations are not provided. We apply our method to various scenes with different types of controllable attributes (e.g. expression control on human faces, or state control in movement of inanimate objects). Overall, we demonstrate, to the best of our knowledge, for the first time novel view and novel attribute re-rendering of scenes from a single video.
@inproceedings{kania2022conerf, title = {{CoNeRF: Controllable Neural Radiance Fields}}, author = {Kania, Kacper and Yi, Kwang Moo and Kowalski, Marek and Trzci{\'n}ski, Tomasz and Tagliasacchi, Andrea}, booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, year = {2022} }
We thank Thabo Beeler, JP Lewis, and Mark J. Matthews for their fruitful discussions, and Daniel Rebain for helping with processing the synthetic dataset. The work was partly supported by National Sciences and Engineering Research Council of Canada (NSERC), Compute Canada, and Microsoft Mixed Reality & AI Lab.