Intrinsic Image Diffusion for Indoor Single-view Material Estimation

CVPR 2024

Peter Kocsis
TU Munich
Vincent Sitzmann
MIT EECS
Matthias Nießner
TU Munich

Abstract

We present Intrinsic Image Diffusion, a generative model for appearance decomposition of indoor scenes. Given a single input view, we sample multiple possible material explanations represented as albedo, roughness, and metallic maps. Appearance decomposition poses a considerable challenge in computer vision due to the inherent ambiguity between lighting and material properties and the lack of real datasets. To address this issue, we advocate for a probabilistic formulation, where instead of attempting to directly predict the true material properties, we employ a conditional generative model to sample from the solution space. Furthermore, we show that utilizing the strong learned prior of recent diffusion models trained on large-scale real-world images can be adapted to material estimation and highly improves the generalization to real images. Our method produces significantly sharper, more consistent, and more detailed materials, outperforming state-of-the-art methods by 1.5dB on PSNR and by 45% better FID score on albedo prediction. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.

Results


Applications

Material Editing

Lighting Editing

Method

Material Diffusion

Material Diffusion We train a conditional diffusion model to predict albedo and BRDF properties (roughness and metallic) given a single input image. We adapt the learned prior of Stable Diffusion [28] by fine-tuning it on the synthetic InteriorVerse [40] dataset. (i) First, we separately encode the ground-truth (GT) albedo and BRDF properties with a fixed encoder to obtain the material feature maps. We also encode the conditioning image with a trainable encoder. (ii) We add noise to the material features and use our conditional diffusion model to predicted the noise. (iii) The training is supervised with L2 loss between the original and predicted noise. (iv) Using the predicted noise, the predicted material properties can be decoded separately.

Lighting Optimization

Lighting Optimization
Using our predicted material, we fit 48 point light sources and a global pre-integrated environment lighting to the scene using a reconstruction loss.

Citation


          @inproceedings{kocsis2024iid,
              author = {Kocsis, Peter and Sitzmann, Vincent and Nie\{ss}ner, Matthias},
              title = {Intrinsic Image Diffusion for Indoor Single-view Material Estimation},
              journal = {Conference on Computer Vision and Pattern Recognition (CVPR)},
              year = {2024}}