DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

1Technical University of Munich 2Sony Semiconductor Solutions Europe

3Technical University of Darmstadt

We present DiffuScene, a diffusion model for diverse and realistic indoor scene synthesis.

It can facilitate various down-stream applications: scene completion from partial scenes (left); scene arrangements of given objects (middle); scene generation from a text prompt describing partial scene configurations. (right)

Unconditional Scene Synthesis

Abstract

We present DiffuScene for indoor 3D scene synthesis based on a novel scene configuration denoising diffusion model. It generates 3D instance properties stored in an unordered object set and retrieves the most similar geometry for each object configuration, which is characterized as a concatenation of different attributes, including location, size, orientation, semantics, and geometry features. We introduce a diffusion network to synthesize a collection of 3D indoor objects by denoising a set of unordered object attributes. Unordered parametrization simplifies and eases the joint distribution approximation. The shape feature diffusion facilitates natural object placements, including symmetries. Our method enables many downstream applications, including scene completion, scene arrangement, and text-conditioned scene synthesis. Experiments on the 3D-FRONT dataset show that our method can synthesize more physically plausible and diverse indoor scenes than state-of-the-art methods. Extensive ablation studies verify the effectiveness of our design choice in scene diffusion model.

Video

Overview

Given a 3D scene S, we obtain its fully-connected scene graph x0, by parametrizing each object as a graph node storing all object attributes i.e., location, size, orientation, class label, and latent shape code.

Based on a set of all possible x0, we propose DiffuScene, a denoising diffusion probabilistic model for 3D scene graph generation. In the forward process, we gradually add noise to x0 until we obtain a standard Gaussian noise xT. In the reverse process i.e. generative process, a denoising network iteratively cleans the noisy graph using ancestral sampling. Finally, we use the denoised object features to perform shape retrieval for realistic scene synthesis.

Scene Completion

Scene completion from partial scenes with only 3 objects given as inputs.

Partial Scenes

ATISS

ATISS-2

Ours-1

Ours-2

Scene Re-arrangement

Scene re-arrangements of collections of random objects.

Noisy Scenes

ATISS

LEGO

Ours

Text-conditioned Scene Synthesis

Text-conditioned scene synthesis. The input text only describes a partial scene configuration.

Text

Reference

ATISS

Ours

BibTeX


    @inproceedings{tang2024diffuscene,
      title={Diffuscene: Denoising diffusion models for gerative indoor scene synthesis},
      author={Tang, Jiapeng and Nie, Yinyu and Markhasin, Lev and Dai, Angela and Thies, Justus and Nie{\ss}ner, Matthias},
      booktitle={Proceedings of the ieee/cvf conference on computer vision and pattern recognition},
      year={2024}
    }