Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers


NeurIPS 2025 (ratings: 5, 5, 5, 5)

Gangwei Xu1,2,*     Haotong Lin3,*     Hongcheng Luo2     Xianqi Wang1     Jingfeng Yao 1     Lianghui Zhu1     Yuechuan Pu2     Cheng Chi2    
Haiyang Sun2     Bing Wang2     Guang Chen2     Hangjun Ye2     Sida Peng3     Xin Yang1
1HUST     2Xiaomi EV     3Zhejiang University

* denotes co-first author




We present Pixel-Perfect Depth, a monocular depth estimation model with pixel-space diffusion transformers. Compared to existing discriminative and generative models, its estimated depth maps can produce high-quality, flying-pixel-free point clouds, without any post-processing.


Abstract


This work presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces flying pixels at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.


Validation of flying pixels in different types of VAEs


Comparison with VAEs
We present qualitative comparisons showing that increasing the latent dimension in VAEs fails to eliminate flying pixels. VAE-d4 (SD2) denotes the reconstruction of ground truth depth maps using the VAE from Stable Diffusion 2, with a latent dimension of 4, which is also used in Marigold. VAE-d16 (SD3.5) uses the VAE from Stable Diffusion 3.5, which has a latent dimension of 16.

Overview of Pixel-Perfect Depth


Network
Given an input image, we concatenate it with noise and feed it into the proposed Cascade DiT. Meanwhile, the image is also processed by a pretrained encoder from Vision Foundation Models to extract high-level semantics, forming our Semantics-Prompted DiT. We perform diffusion generation directly in pixel space without using any VAE.

Qualitative Comparisons with Representative Depth Foundation Models


Qualitative Point Cloud Results in Complex Scenes
Our model preserves more fine-grained details than Depth Anything v2 and MoGe 2, while demonstrating significantly higher robustness compared to Depth Pro.

Qualitative Point Cloud Results in Complex Scenes


Qualitative Point Cloud Results in Complex Scenes
Our model produces significantly fewer flying pixels compared to other depth estimation models, with depth maps overlaid on the point clouds for visualization.

Qualitative Ablations for Semantics-Prompted DiT (SP-DiT)


Qualitative Ablations for Semantics-Prompted DiT (SP-DiT)
From top to bottom: input images from five benchmarks, results without SP-DiT, and results with SP-DiT. Without SP-DiT, the DiT model struggles with preserving global semantics and generating fine-grained visual details.

Qualitative comparisons with MoGe


Qualitative comparisons with MoGe
We provide qualitative comparisons with MoGe. Top: input images are taken from four test sets: Hypersim, DIODE, ScanNet, and ETH3D. Middle: results of MoGe. Bottom: our results. As a discriminative model, MoGe, like previous discriminative models, also suffers from flying pixels at edges and details.

Citation


@inproceedings{xu2025ppd,
  title={Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers},
  author={Xu, Gangwei and Lin, Haotong and Luo, Hongcheng and Wang, Xianqi and Yao, Jingfeng and Zhu, Lianghui and Pu, Yuechuan and Chi, Cheng and Sun, Haiyang and Wang, Bin and Chen, Guang and Ye, Hangjun and Peng, Sida and Yang, Xin},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}