Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

(TL;DR) We propose CCA as a finetuning technique for AR visual models so that they can generate high-quality images without CFG, cutting sampling costs by half. CCA and CFG have the same theoretical foundations and thus similar features, though CCA is inspired from LLM alignment instead of guided sampling.

Features of CCA:

High performance. CCA can vastly improve guidance-free performance of all tested AR visual models, largely removing the need for CFG. (Figure below)
Convenient to deploy. CCA does not require any additional datasets other than the one used for pretraining.
Fast to train. CCA requires only finetuning pretrained models for 1 epoch to achieve ideal performance (~1% computation of pretraining).
Consistency with LLM Alignment. CCA is theoretically foundationed on existing LLM alignment methods, and bridges the gap between visual-targeted guidance and language-targeted alignment, offering a unified framework for mixed-modal modeling.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ChenDRAG/CCA_LlamaGen

Base model

FoundationVision/LlamaGen

Finetuned

(1)

this model

Papers for ChenDRAG/CCA_LlamaGen

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

Paper • 2410.09347 • Published Oct 12, 2024 • 5

Noise Contrastive Alignment of Language Models with Explicit Rewards

Paper • 2402.05369 • Published Feb 8, 2024 • 2