Add pipeline tag and sample usage (#1)

- Add pipeline tag and sample usage (738047a9b13c2f3021f0367ebf95f02f1edcc8a3)

Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +76 -1

README.md CHANGED Viewed

@@ -1,7 +1,8 @@
 ---
-license: mit
 language:
 - en
 ---
 # Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
@@ -11,6 +12,80 @@ This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s in
 - **GitHub:** [https://github.com/WHB139426/GeoVR-MLLM](https://github.com/WHB139426/GeoVR-MLLM)
 - **Paper:** [https://arxiv.org/abs/2606.05833](https://arxiv.org/abs/2606.05833)
 ## Citation
 If you find this work useful, please consider citing:

 ---
 language:
 - en
+license: mit
+pipeline_tag: video-text-to-text
 ---
 # Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
 - **GitHub:** [https://github.com/WHB139426/GeoVR-MLLM](https://github.com/WHB139426/GeoVR-MLLM)
 - **Paper:** [https://arxiv.org/abs/2606.05833](https://arxiv.org/abs/2606.05833)
+## Sample Usage
+To use this model, you need to clone the [official repository](https://github.com/WHB139426/GeoVR-MLLM) to access the custom modeling files.
+```python
+import torch
+from utils.utils import *
+from transformers import AutoProcessor
+from models.qwen3vl_geo import Qwen3VLForConditionalGeneration
+device = 'cuda:0'
+model_id = "WHB139426/GeoVR-VGGT-Qwen3-VL-2B"
+model = Qwen3VLForConditionalGeneration.from_pretrained(
+    model_id,
+    geometry_encoder_path=None,
+    metric_model_path=None,
+    dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+    add_camera=False,
+    add_scale=False,
+    add_depth=False,
+    distill_geometry_feature=False,
+)
+model.load_geometric_weights(model_id)
+model.to(device)
+num_frames = 32
+processor = AutoProcessor.from_pretrained(model_id)
+processor.video_processor.size = {"longest_edge": 384*num_frames*32*32, "shortest_edge": 4*num_frames*32*32}
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "video", "video": './assets/scene0111_02.mp4',},
+            {"type": "text", "text": "Measuring distance from the nearest points, select the closest object (trash bin, door, table, refrigerator) to the tv. If multiple exist, use the nearest instance.
+Options:
+A. trash bin
+B. door
+C. table
+D. refrigerator
+Answer with the option's letter from the given choices directly."},
+        ],
+    }
+]
+generation_kwargs = {
+    'do_sample': True,
+    'top_p': 0.8,
+    'top_k': 20,
+    'temperature': 0.7,
+    'repetition_penalty': 1.0,
+    'max_new_tokens': 32*1024,
+}
+inputs = processor.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_dict=True,
+    return_tensors="pt",
+    num_frames=num_frames,
+    fps=None,
+    enable_thinking=False,
+).to(model.device)
+with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16):
+    with torch.inference_mode():
+        generated_ids = model.generate(**inputs, **generation_kwargs)
+        output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].strip()
+print(output_text)
+```
 ## Citation
 If you find this work useful, please consider citing: