VideoKR-Qwen2.5-VL-7B

📄 ArXiv  ï½œ  💻 Code  ï½œ  🤗 Collection

About

This repository contains the VideoKR-Qwen2.5-VL-7B model presented in VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (ICML 2026 Spotlight).

VideoKR-Qwen2.5-VL-7B is obtained through a standard SFT → GRPO pipeline on Qwen2.5-VL-7B-Instruct:

  1. Supervised fine-tuning on VideoKR-SFT-201K with CoT rationales → VideoKR-Qwen2.5-VL-7B-SFT
  2. GRPO reinforcement learning on VideoKR-RL-114K with verifiable rewards → this model

VideoKR is the first large-scale training corpus designed for knowledge- and reasoning-intensive video understanding, containing 315K video reasoning examples over 145K newly collected, CC-licensed expert-domain videos across 82 professional subjects.

Links

Resource Link
Training data minuzero/VideoKR-Train
Evaluation data minuzero/VideoKR-Eval
SFT checkpoint (Qwen2.5-VL) minuzero/VideoKR-Qwen2.5-VL-7B-SFT
SFT checkpoint (Qwen3-VL) minuzero/VideoKR-Qwen3-VL-8B-SFT
GRPO checkpoint (Qwen3-VL) minuzero/VideoKR-Qwen3-VL-8B

Performance

Results with 128 input frames. Within each base-model group, bold = best, underline = second best.

Model Video-MME MVBench LongVBench General Avg VideoMMMU MMVU SciVidBench VideoKR-Eval Knowledge Avg
Qwen2.5-VL-7B-Instruct 65.1 66.3 60.9 64.1 51.1 55.7 28.1 32.7 41.9
VideoAuto-R1 66.8 70.2 59.7 65.6 52.1 55.7 32.7 36.5 44.3
VideoKR (SFT + RL) 66.4 68.9 61.3 65.5 52.2 60.5 32.5 41.2 46.6

VideoKR achieves the highest knowledge-intensive average (+4.7 over base, +2.3 over VideoAuto-R1) while remaining competitive on general video reasoning.

Results with 16 input frames (comparison with Video-R1 and VideoRFT):

Model Video-MME MVBench LongVBench General Avg VideoMMMU MMVU SciVidBench VideoKR-Eval Knowledge Avg
Qwen2.5-VL-7B-Instruct 57.1 65.0 55.2 59.1 48.4 52.5 23.1 31.3 38.8
Video-R1 59.7 65.5 55.3 60.2 51.1 53.3 26.6 28.9 40.0
VideoRFT 57.6 61.7 53.6 57.6 51.1 53.6 26.3 29.8 40.2
VideoKR (SFT + RL) 56.6 66.6 57.0 60.1 52.6 59.2 27.3 37.7 44.2

Under the 16-frame setting, VideoKR outperforms Video-R1 and VideoRFT by +4.2 and +4.0 on knowledge-intensive average, respectively.

Evaluation

cd /path/to/VideoKR/lmms_eval
conda activate videokr_eval

export CUDA_VISIBLE_DEVICES=0
export VIDEOKR_MODEL=minuzero/VideoKR-Qwen2.5-VL-7B
export TASKS=videokr_eval
export BATCH_SIZE=1
export RUN_NAME=videokr_eval

bash examples/models/videokr_vllm.sh

Citation

If you find VideoKR useful in your research, please cite our paper:

@misc{fu2026videokrknowledgereasoningintensivevideo,
      title={VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding}, 
      author={Lin Fu and Zheyuan Yang and Yang Wang and Tingyu Song and Arman Cohan and Yilun Zhao},
      year={2026},
      eprint={2606.05259},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.05259}, 
}
Downloads last month
49
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for minuzero/VideoKR-Qwen2.5-VL-7B

Finetuned
(1088)
this model
Quantizations
1 model

Dataset used to train minuzero/VideoKR-Qwen2.5-VL-7B

Collection including minuzero/VideoKR-Qwen2.5-VL-7B

Paper for minuzero/VideoKR-Qwen2.5-VL-7B