Instructions to use minuzero/VideoKR-Qwen2.5-VL-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use minuzero/VideoKR-Qwen2.5-VL-7B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("minuzero/VideoKR-Qwen2.5-VL-7B") model = AutoModelForImageTextToText.from_pretrained("minuzero/VideoKR-Qwen2.5-VL-7B") - Notebooks
- Google Colab
- Kaggle
VideoKR-Qwen2.5-VL-7B
📄 ArXiv | 💻 Code | 🤗 Collection
About
This repository contains the VideoKR-Qwen2.5-VL-7B model presented in VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (ICML 2026 Spotlight).
VideoKR-Qwen2.5-VL-7B is obtained through a standard SFT → GRPO pipeline on Qwen2.5-VL-7B-Instruct:
- Supervised fine-tuning on VideoKR-SFT-201K with CoT rationales → VideoKR-Qwen2.5-VL-7B-SFT
- GRPO reinforcement learning on VideoKR-RL-114K with verifiable rewards → this model
VideoKR is the first large-scale training corpus designed for knowledge- and reasoning-intensive video understanding, containing 315K video reasoning examples over 145K newly collected, CC-licensed expert-domain videos across 82 professional subjects.
Links
| Resource | Link |
|---|---|
| Training data | minuzero/VideoKR-Train |
| Evaluation data | minuzero/VideoKR-Eval |
| SFT checkpoint (Qwen2.5-VL) | minuzero/VideoKR-Qwen2.5-VL-7B-SFT |
| SFT checkpoint (Qwen3-VL) | minuzero/VideoKR-Qwen3-VL-8B-SFT |
| GRPO checkpoint (Qwen3-VL) | minuzero/VideoKR-Qwen3-VL-8B |
Performance
Results with 128 input frames. Within each base-model group, bold = best, underline = second best.
| Model | Video-MME | MVBench | LongVBench | General Avg | VideoMMMU | MMVU | SciVidBench | VideoKR-Eval | Knowledge Avg |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | 65.1 | 66.3 | 60.9 | 64.1 | 51.1 | 55.7 | 28.1 | 32.7 | 41.9 |
| VideoAuto-R1 | 66.8 | 70.2 | 59.7 | 65.6 | 52.1 | 55.7 | 32.7 | 36.5 | 44.3 |
| VideoKR (SFT + RL) | 66.4 | 68.9 | 61.3 | 65.5 | 52.2 | 60.5 | 32.5 | 41.2 | 46.6 |
VideoKR achieves the highest knowledge-intensive average (+4.7 over base, +2.3 over VideoAuto-R1) while remaining competitive on general video reasoning.
Results with 16 input frames (comparison with Video-R1 and VideoRFT):
| Model | Video-MME | MVBench | LongVBench | General Avg | VideoMMMU | MMVU | SciVidBench | VideoKR-Eval | Knowledge Avg |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | 57.1 | 65.0 | 55.2 | 59.1 | 48.4 | 52.5 | 23.1 | 31.3 | 38.8 |
| Video-R1 | 59.7 | 65.5 | 55.3 | 60.2 | 51.1 | 53.3 | 26.6 | 28.9 | 40.0 |
| VideoRFT | 57.6 | 61.7 | 53.6 | 57.6 | 51.1 | 53.6 | 26.3 | 29.8 | 40.2 |
| VideoKR (SFT + RL) | 56.6 | 66.6 | 57.0 | 60.1 | 52.6 | 59.2 | 27.3 | 37.7 | 44.2 |
Under the 16-frame setting, VideoKR outperforms Video-R1 and VideoRFT by +4.2 and +4.0 on knowledge-intensive average, respectively.
Evaluation
cd /path/to/VideoKR/lmms_eval
conda activate videokr_eval
export CUDA_VISIBLE_DEVICES=0
export VIDEOKR_MODEL=minuzero/VideoKR-Qwen2.5-VL-7B
export TASKS=videokr_eval
export BATCH_SIZE=1
export RUN_NAME=videokr_eval
bash examples/models/videokr_vllm.sh
Citation
If you find VideoKR useful in your research, please cite our paper:
@misc{fu2026videokrknowledgereasoningintensivevideo,
title={VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding},
author={Lin Fu and Zheyuan Yang and Yang Wang and Tingyu Song and Arman Cohan and Yilun Zhao},
year={2026},
eprint={2606.05259},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.05259},
}
- Downloads last month
- 49