IntrinsiX: High-Quality PBR Generation using Image Priors

arXiv 2025

Peter Kocsis
TU Munich
Lukas Höllein
TU Munich
Matthias Nießner
TU Munich

Code

Abstract

We introduce IntrinsiX, a novel method that generates high-quality intrinsic images from text description. In contrast to existing text-to-image models whose outputs contain baked-in scene lighting, our approach predicts physically-based rendering (PBR) maps. This enables the generated outputs to be used for content creation scenarios in core graphics applications that facilitate re-lighting, editing, and texture generation tasks. In order to train our generator, we exploit strong image priors, and pre-train separate models for each PBR material component (albedo, roughness, metallic, normals). We then align these models with a new cross-intrinsic attention formulation that concatenates key and value features in a consistent fashion. This allows us to exchange information between each output modality and to obtain semantically coherent PBR predictions. To ground each intrinsic component, we propose a rendering loss which provides image-space signals to constrain the model, thus facilitating sharp details also in the output BRDF properties. Our results demonstrate detailed intrinsic generation with strong generalization capabilities that outperforms existing intrinsic image decomposition methods used with generated images by a significant margin. Finally, we show a series of applications, including re-lighting, editing, and text-conditioned room-scale PBR texture generation.

Results

Our method generates intrinsic components, such as normal, albedo, roughness and metallic maps, given a text prompts. The decomposed image prior can be used for various applications, such as re-lighting, material editing, and texture generation.

We can use physically-based rendering to render our generated scene under arbitrary lighting conditions.

Applications

Editable Image Generation

Our generated PBR maps can be edited and utilized in standard physically-based rendering frameworks to produce RGB renderings. Here, we place a light source on top of the scene at constant elevation and rotate it around the vertical axis. From top to bottom we show, (1): RGB renderings with different light source positions; (2): a manual edit of the albedo (desaturate the moon color); (3): a lower roughness value (more specular reflections); (4): a higher metallic value (more glossy, mirror-like reflections).

Scene Texturing

Given a scene geometry, first, we condition our method on the rendered normal maps to produce the remaining PBR maps. Through iterative optimization, we obtain realistic PBR textures for the whole scene. Then, we similarly optimize for normal map textures to obtain fine geometric details, conditioned on rendered material maps. This showcases the potential of \textit{direct} PBR map generation to democratize scene texturing from only text as input.

Industrial...

Greek...

Tuscan...

Baroque...

Method

Material Diffusion We generate the intrinsic properties of an image given text as input. Left: we train 3 different LoRAs for a pretrained, latent text-to-image model, corresponding to the intrinsic properties (albedo, normal, and roughness + metallic) on curated synthetic datasets. We facilitate communication between all 4 modalities through cross-intrinsic attention to predict PBR maps corresponding to the same image. A novel rendering loss using importance-based light sampling ensures that we can render high-quality RGB images from physically realistic PBR maps. Right: after training, we jointly denoise and decode all 4 PBR maps and can prompt our model with diverse, out-of-distribution descriptions. Using our predicted material, we fit 48 point light sources and a global pre-integrated environment lighting to the scene using a reconstruction loss.

Baseline Comparisons

We compare against two recent intrinsic image decomposition methods, IID and RGBX. We generate an image with FLUX-dev 1.0, then use the baseline methods to obtain the PBR components. Finally, we rerender the scenes under different lighting conditions.

We use a diverse set of text prompts to produce our PBR maps, as well as the input RGB images for the baseline methods. This highlights our models’ capability to retain the generalized prior of the pretrained text-to-image model. Our method better captures the semantic meaning of the individual intrinsic properties. For example, there are no baked-in lighting effects in the albedo, and the metallic/roughness maps are sharper with more intricate details. This leads to more realistic renderings and lighting effects.

Citation


          @article{kocsis2025intrinsix,
              author = {Kocsis, Peter and H\"{o}llein, Lukas and Nie\{ss}ner, Matthias},
              title = {IntrinsiX: High-Quality PBR Generation using Image Priors},
              journal = {arXiv preprint},
              year = {2025}}