DPHMs: Diffusion Parametric Head Models for Depth-based Tracking

1Technical University of Munich 2Sony Semiconductor Solutions Europe

3Technical University of Darmstadt

We present DPHMs, a diffusion parametric head model which is used for robust head reconstruction and expression tracking from monocular depth sequences. Leveraging the DPHM diffusion prior, we effectively constrain the identity and expression codes on the underlying latent manifold when fitting to noisy and partial observations of commodity depth sensors.

Abstract

We introduce Diffusion Parametric Head Models (DPHMs), a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models, such as NPHMs, can now excel in representing high-fidelity head geometries, tracking and reconstruction heads from real-world single-view depth sequences remains very challenging, as the fitting to partial and noisy observations is under-constrained. To tackle these challenges, we propose a latent diffusion-based prior to regularize volumetric head reconstruction and tracking. This prior-based regularizer effectively constrains the identity and expression codes to lie on the underlying latent manifold which represents plausible head shapes. To evaluate the effectiveness of the diffusion-based prior, we collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions. We compare our method to state-of-the-art tracking methods, and demonstrate improved head identity reconstruction as well as robust expression tracking.

Video

DPHM Kinect Dataset

We collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions.

Comparisons with baselines

Our approach demonstrates the ability to reconstruct realistic head avatars with hairs and accurately capture intricate facial expressions such as extreme mouth movements and eyelid movements.

Result Gallery

Overview

Given a sequence of depth maps \( I \) of \( N \) frames, our objective is to reconstruct a full-head avatar \( O \) including its expression transitions. To achieve this, we optimize the parametric latent \( \mathcal{Z} = { \mathbf{z}^{id}, \mathbf{z}^{ex}_1 , ..., \mathbf{z}^{ex}_N } \) of NPHM that can be decoded into continuous signed distance fields \( O \) by identity and expression decoders. To align with the observations, we calculate data terms \( L_{sdf} \) and Lnorm between \( I \) and \( O \). However, high-level noise still makes navigating the latent optimization extremely challenging. At the core of our method is an effective latent regularization using diffusion priors; we add Gaussian noises to \( \mathcal{Z} \) and then pass them into identity and expression diffusion models to predict perturbed noise \( \mathbf{\epsilon} \) for updating \( \mathcal{Z} \). The diffusion regularizer guides \( \mathbf{z}^{id} \) and \( \mathbf{z}^{ex}_i \) towards the individual manifold of their distributions via \( \mathbf{\epsilon}^{id} \) and \( \mathbf{\epsilon}^{ex} \), ensuring plausible head geometry reconstruction and robust tracking. To enhance temporal coherence, \( L_{temp} \) penalizes inconsistency between \(\mathbf{z}^{ex}_i\) of nearby frames.

BibTeX

@article{tang2024dphms,
      title={DPHMs: Diffusion Parametric Head Models for Depth-based Tracking},
      author={Tang, Jiapeng and Dai, Angela and Nie, Yinyu and Markhasin, Lev and Thies, Justus and Niessner, Matthias},
      booktitle={Proceedings of the ieee/cvf conference on computer vision and pattern recognition},
      year={2024}
    }