GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion

1Technical University of Munich 2Toyota Motor Europe NV/SA 3Woven by Toyota

Given a short, monocular video captured by a commodity device such as a smartphone, GAF reconstructs a 3D Gaussian head avatar, which can be re-animated and rendered into photo-realistic novel views. Our key idea is to distill the reconstruction constraints from a multi-view head diffusion model in order to extrapolate to unobserved views and expressions.

Abstract

We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing that GAF outperforms the previous state-of-the-art methods in novel view synthesis and novel expression animation. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.

Video

Method Overview

Given a sequence of RGB images from monocular cameras \( \mathcal{I} = \{ \mathbf{I}_i \} \), our objective is to reconstruct dynamic head avatars by optimizing an animatable Gaussian splatting representation \( \mathcal{O} \), which is deformed to each frame as \( \mathcal{O}_i \) by the tracked FLAME mesh \( \mathcal{M}_i \) of \( \mathbf{I}_i \). We optimize \( \mathcal{O} \) by minimizing an input view reconstruction loss \( \mathcal{L}_{rec} \), plus a view sampling loss \( \mathcal{L}_{view} \). \( \mathcal{L}_{view} \) compares novel-view renderings of \( \mathcal{O}_i \) from four random viewpoints \( \mathbf{I}_i^{view} \), with pseudo ground truths \( \mathbf{\hat{I}}_i^{view} \), predicted by a multi-view head latent diffusion model. \( \mathbf{\hat{I}}_i^{view} \) are generated by iteratively denoising 4-view latents, guided by the input image \( \mathbf{I}_i \) and normal maps \( \mathbf{N}_i \) rendered from \( \mathcal{M}_i \). A latent upsampler module enhances facial details before decoding the denoised latent into an RGB image.

Comparisons against Baselines

Compared to state-of-the-art methods, our approach reconstructs unseen side facial regions in the inputs and consistently produces more favorable and consistent renderings from hold-out views.

Result Gallery

Ablation Studies

Ablation Studies on different types of diffusion priors. (a) Input; (b) Ground truth; Comparisons between method variants of (c) No diffusion; using (d) Pretrained Stable Diffusion; (e) Personalized Stable Diffusion; (f) Pose-conditioned multi-view diffusion; (g) Our multi-view diffusion using Score Distillation Sampling (SDS) loss; (h) Ours without latent upsampler $\times$2; (i) Ours. Our normal map-conditioned multi-view diffusion priors enable more photo-realistic novel views with identity and appearance consistency, by constraining novel views using pseudo-image ground truths, which are decoded from iteratively denoised latents followed by a latent upsampler.

Results of Multi-view Head Diffusion

Given a single image as input, our Multi-view Head Latent Diffusion can generate identity-preserved, and view-consistent multi-view portrait images.

BibTeX

@article{tang2024gaf,
      title={GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion},
      author={Tang, Jiapeng and Davoli, Davide and Kirschstein, Tobias and Schoneveld, Liam and and Niessner, Matthias},
      booktitle={arxiv},
      year={2024}
    }