A UNet that enhances spatial understanding capabilities of the StableDiffusion 1.5 text-to-image
diffusion model. This model demonstrates significant improvements in generating images with specific
spatial relationships between objects.
Explicit spatial relationships (e.g., "a photo of A to the right of B")
Training Details
Training Data
Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine
~28,000 curated object pairs from COCO
Enforces criteria for:
Visual significance
Semantic distinction
Spatial clarity
Object relationships
Visual balance
Training Process
Trained for 24,000 steps
Effective batch size of 4
Learning rate: 5e-6
Optimizer: AdamW with β₁=0.9, β₂=0.999
Weight decay: 1e-2
Evaluation Results
Metric
StableDiffusion 1.4
+CoMPaSS
VISOR uncond (⬆️)
17.58%
61.46%
T2I-CompBench Spatial (⬆️)
0.08
0.35
GenEval Position (⬆️)
0.04
0.54
FID (⬇️)
12.82
10.89
CMMD (⬇️)
0.5548
0.3235
Citation
If you use this model in your research, please cite:
@inproceedings{zhang2025compass,
title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
booktitle={ICCV},
year={2025}
}