FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint

Abstract

We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints.

We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently inject the expression latents into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking.

To train our model, we curate a combinatio of real-world and synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.

Method

Pipeline Overview. Our method generates a video of the reference subject animated by the body pose and facial expressions from the driving images, while following the specified camera trajectory. The model consists of three main components: (1) a condition fusion layer that combines noise maps, the reference image, and camera pose annotations, and body mesh tracking as input to DiT; (2) an expression encoder that extracts and aggregates per-frame expression codes from the driving images; and (3) a video diffusion model based on Wan-DiT blocks with adaptive layer normalization (AdaLN), which applies scale and shift transformations conditioned on the per-frame expression codes and frame-agnostic timestep embedding.

Condition Fusion Layer. We extract reference image latents, ray maps, and normal video latents from the template mesh to represent identity, viewpoint, and pose, respectively. They are concatenated with noise latents as input to the diffusion model.

Ablation Study

We evaluate the importantance of ours designs by removing each of them from our pipeline :

C1: Trained without the DynamicSweep dataset (videos containing both expression and camera changes).
C2: Without normal maps from body mesh reconstruction for head pose control.
C3: Without expression latents; uses 2D landmarks for expression control instead.

Input C1 C2 C3 Ours final GT

Related Works

Check out the following works which also generates portrait animation from a single image:

HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation (video diffusion based talking face generation)

GAGAvatar: Generalizable and Animatable Gaussian Head Avatar (feed-forward Gaussian head avatar reconstruction)

CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models (morphable multi-view diffusion models for head avatars)

BibTeX

@inproceedings{,
        author    = {Anonymous Authors},
        title     = {{FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint}},
        booktitle = {arxiv},
        year      = {2025},
    }