We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing that GAF outperforms the previous state-of-the-art methods in novel view synthesis and novel expression animation. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.
Given a sequence of RGB images from monocular cameras \( \mathcal{I} = \{ \mathbf{I}_i \} \), our objective is to reconstruct dynamic head avatars by optimizing an animatable Gaussian splatting representation \( \mathcal{O} \), which is deformed to each frame as \( \mathcal{O}_i \) by the tracked FLAME mesh \( \mathcal{M}_i \) of \( \mathbf{I}_i \). We optimize \( \mathcal{O} \) by minimizing an input view reconstruction loss \( \mathcal{L}_{rec} \), plus a view sampling loss \( \mathcal{L}_{view} \). \( \mathcal{L}_{view} \) compares novel-view renderings of \( \mathcal{O}_i \) from four random viewpoints \( \mathbf{I}_i^{view} \), with pseudo ground truths \( \mathbf{\hat{I}}_i^{view} \), predicted by a multi-view head latent diffusion model. \( \mathbf{\hat{I}}_i^{view} \) are generated by iteratively denoising 4-view latents, guided by the input image \( \mathbf{I}_i \) and normal maps \( \mathbf{N}_i \) rendered from \( \mathcal{M}_i \). A latent upsampler module enhances facial details before decoding the denoised latent into an RGB image.