Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining
Paper
•
2506.00633
•
Published
•
1
diff_model_demo.py from the code release for a one-off generation from text.autoencoder_epoch273.pt — 3D VAE for latent compression/decoding.unet_rflow_200ep.pt — Diffusion UNet trained with rectified flow.CLIP3D_Finding_Impression_30ep.pt — CLIP3D weights for encoding reports.from huggingface_hub import hf_hub_download
repo_id = "yourname/text2ct-weights" # replace with the actual repo id
autoencoder_path = hf_hub_download(repo_id, "autoencoder_epoch273.pt")
unet_path = hf_hub_download(repo_id, "unet_rflow_200ep.pt")
clip_path = hf_hub_download(repo_id, "CLIP3D_Finding_Impression_30ep.pt")
# Use these in the code release configs:
# trained_autoencoder_path -> autoencoder_path
# existing_ckpt_filepath / model_filename -> unet_path
# clip_weights (for report embeddings) -> clip_path
If you use these weights or code, please cite the paper:
@article{molino2025text,
title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining},
author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio},
journal={arXiv preprint arXiv:2506.00633},
year={2025}
}