Logo Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Xiangyu Zeng*,1,2, Zhiqiu Zhang*,1,2, Yuhan Zhu*,1,2, Xinhao Li*,1, Zikang Wang*,3,2, Changlian Ma1,2, Qingyu Zhang1, Zizheng Huang1, Kun Ouyang4, Tianxiang Jiang5,2, Ziang Yan6,2, Yi Wang2, Hongjie Zhang2, Yali Wang7,2, Limin Wang†,1,2

1Nanjing University  2Shanghai AI Laboratory  3Shanghai Jiao Tong University  4Peking University
5University of Science and Technology of China  6Zhejiang University  7SIAT, Chinese Academy of Sciences
*Equal contribution  Corresponding author

Abstract

Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3's strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.

Motivation: Thinking with Videos

Current Multimodal Large Language Models (MLLMs) struggle with long videos because they typically rely on uniform frame sampling and single-turn inference. This approach often dilutes critical visual evidence within redundant background content.

Video-o3 introduces a paradigm shift by mimicking human behavior. Instead of watching a video passively, it actively explores the content. The model iteratively discovers salient visual clues, inspects key segments with fine-grained detail, and adaptively terminates the search once sufficient evidence is acquired.

Overview of Video-o3

Overview of Video-o3. Guided by the user query, the model actively identifies and localizes critical visual clues using native interleaved tool invocation. It autonomously decides whether to continue searching or to conclude the reasoning process.

Key Features:

  • Goal-Driven Exploration: Unlike models that scan the whole video coarsely, Video-o3 starts with a coarse scan and iteratively focuses on informative segments.
  • Native Interleaved Tool Use: The model supports "clue seeking" and "answer reasoning" within a single shared context, rather than decoupled modules.

Method: Native Interleaved Tool Invocation

Video-o3 is designed to solve the challenges of attention dispersion and contextual efficiency in long-video processing. The framework operates on a Think-and-Tool cycle. The model generates structured directives containing temporal windows and visual token quotas. It dynamically invokes the VideoCrop tool to inspect target segments with adaptive spatiotemporal resolution.

Architectural Overview

Architectural Overview. Video-o3 dynamically executes tool invocations based on previous reasoning to scrutinize specific clue segments. The Vision Encoder uses adaptive flexible sampling, while the LLM Decoder manages the interleaved "Think," "Tool," and "Answer" tokens.

Core Technical Innovations:

  • Task-Decoupled Attention Masking: To prevent attention dispersion, TDAM isolates per-step concentration. During clue seeking, the model attends only to the global context; during answering, it focuses on high-resolution tool observations.
  • Verifiable Trajectory-Guided Reward: To control context length and cost, we introduce a reward mechanism that balances exploration coverage with reasoning efficiency, encouraging the model to terminate precisely when evidence is sufficient.

Data Construction: Pipeline and Seeker-173K

Training a model to perform native interleaved tool invocation requires high-quality exploration trajectories, which are scarce in existing datasets. To address this, we developed a scalable automated data synthesis pipeline.

The Data Construction Pipeline

The Data Construction Pipeline. We transform "Video-Question-Answer" triplets into explicit tool exploration trajectories via a four-stage process: Clue Localization, Validity Verification, Trajectory Generation, and Logical Consistency Checks.

About Seeker-173K:

  • Structure: The dataset is stratified into a four-quadrant taxonomy based on evidence cardinality and visual saliency, covering tasks from "Single-Clue Direct Answering" to complex "Multi-Clue Tool Invocation".
  • Quality: Human verification is enforced through random sampling in all stages. The pipeline rigorously filters out flawed instances, preserving only those with sound logic and factual visual evidence.

Performance

Methods Sizes VideoMME MLVU LVBench LongVideoBench VideoMMMU MMVU Video-Holmes
Avg M-Avg Avg Avg Overall M-Avg Avg
Open-source Single-Turn Video MLLMs
Qwen2.5-VL 7B 65.1 70.2 45.3 56.0 47.4 61.3 34.7
LLaVA-Video 7B 63.3 70.8 - 58.2 - - -
Video-R1 7B 61.4 - - - 52.4 63.8 -
Rewatch-R1 7B 65.6 - 43.3 - 51.9 - 44.3
Video-Thinker 7B - - 37.0 - - - 43.2
Open-source Decoupled Iterative Reasoning Video MLLMs
Video-RTS 7B 63.0 - - 56.6 52.7 66.4 -
Video-MTR 7B 59.0 59.7 38.6 56.4 - - -
LOVE-R1 7B 66.2 67.4 48.2 60.1 - - -
Open-source Native Multi-turn Tool Invocation Video MLLMs
Conan 7B 60.5 63.4 39.2 56.6 - - 44.6
LongVT 7B 64.3 - 41.3 - 45.4 - -
Video-Zoomer 7B 65.2 - 41.5 57.7 52.2 - -
Video-o3 (RL) 7B 66.1 71.9 47.5 59.3 50.0 66.9 46.1
Video-o3 (SFT+RL) 7B 66.5 72.1 47.6 60.5 51.7 67.2 46.5

Comparison of our method with existing approaches on video question answering tasks across various benchmarks. Video-o3 significantly outperforms previous methods in long video understanding benchmarks while also demonstrating strong performance in multiple video inference benchmarks.

Citation

@article{zeng2026video,
  title={Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning},
  author={Zeng, Xiangyu and Zhang, Zhiqiu and Zhu, Yuhan and Li, Xinhao and Wang, Zikang and Ma, Changlian and Zhang, Qingyu and Huang, Zizheng and Ouyang, Kun and Jiang, Tianxiang and others},
  journal={arXiv preprint arXiv:2601.23224},
  year={2026}
}