Text-to-CT Model Weights

Checkpoints for “Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining” (Molino et al., 2025).

Model Card for Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Model Description

Authors: Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
Model type: 3D latent diffusion (RFlow) + 3D VAE + CLIP3D text encoder for CT generation.
License: Apache 2.0 (same as code release).
Sources: Code https://github.com/cosbidev/Text2CT | Paper https://arxiv.org/abs/2506.00633
Demo: Use diff_model_demo.py from the code release for a one-off generation from text.

Intended Use

Direct use: Research/experimentation on text-conditioned 3D CT synthesis; generating synthetic data for benchmarking or augmentation.
Downstream use: Fine-tuning or integration into broader research pipelines.
Out of scope: Clinical decision-making, diagnostic use, or deployment without proper validation and approvals.

Risks & Limitations

Trained on CT-RATE; may encode dataset biases and is not validated for clinical use.
Synthetic outputs may contain artifacts; do not use for patient care.

Files included

autoencoder_epoch273.pt — 3D VAE for latent compression/decoding.
unet_rflow_200ep.pt — Diffusion UNet trained with rectified flow.
CLIP3D_Finding_Impression_30ep.pt — CLIP3D weights for encoding reports.

How to Get Started (Python)

from huggingface_hub import hf_hub_download
repo_id = "yourname/text2ct-weights"  # replace with the actual repo id

autoencoder_path = hf_hub_download(repo_id, "autoencoder_epoch273.pt")
unet_path = hf_hub_download(repo_id, "unet_rflow_200ep.pt")
clip_path = hf_hub_download(repo_id, "CLIP3D_Finding_Impression_30ep.pt")

# Use these in the code release configs:
# trained_autoencoder_path -> autoencoder_path
# existing_ckpt_filepath / model_filename -> unet_path
# clip_weights (for report embeddings) -> clip_path

Training Data (for these weights)

CT-RATE dataset (public on Hugging Face) for CT volumes and reports.

Training Procedure (summary)

CLIP3D trained for vision-language alignment on CT+reports.
VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi.
Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings.

Evaluation

See paper for quantitative and qualitative results.

Further Information

1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans.

Environmental Impact

Not reported. Training used multi-GPU setup;.

Citation

If you use these weights or code, please cite the paper:

@article{molino2025text,
  title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining},
  author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio},
  journal={arXiv preprint arXiv:2506.00633},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Text-to-3D

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train dmolino/text2ct-weights

Paper for dmolino/text2ct-weights

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Paper • 2506.00633 • Published May 31, 2025 • 1