arxiv:2512.24097

Factorized Learning for Temporally Grounded Video-Language Models

Published on Dec 30, 2025

· Submitted by

Wenzheng Zeng on Jan 1

National University of Singapore

Upvote

Authors:

Abstract

Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D^2VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.

View arXiv page View PDF Project page GitHub 12 Add to collection

Community

wenzhengzeng

Paper submitter about 14 hours ago

We tackle temporally grounded video-language understanding from a factorized perspective.

Some key takeaways:

[1] We emphasize the distinct yet causally dependent nature of temporal grounding and textual response.

[2] Our study highlights the importance of explicit event-level visual semantic capture in enhancing both grounding and textual response quality.

[3] We also propose a new Factorized Preference Optimization (FPO) scheme that jointly optimizes temporal and textual factors. A factorized data synthesis approach is also proposed to support FPO.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.24097 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.24097 in a Space README.md to link it from this page.