WHB139426 nielsr HF Staff commited on
Commit
d7da855
·
1 Parent(s): e38e85b

Add pipeline tag and sample usage (#1)

Browse files

- Add pipeline tag and sample usage (738047a9b13c2f3021f0367ebf95f02f1edcc8a3)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +76 -1
README.md CHANGED
@@ -1,7 +1,8 @@
1
  ---
2
- license: mit
3
  language:
4
  - en
 
 
5
  ---
6
 
7
  # Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
@@ -11,6 +12,80 @@ This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s in
11
  - **GitHub:** [https://github.com/WHB139426/GeoVR-MLLM](https://github.com/WHB139426/GeoVR-MLLM)
12
  - **Paper:** [https://arxiv.org/abs/2606.05833](https://arxiv.org/abs/2606.05833)
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ## Citation
15
 
16
  If you find this work useful, please consider citing:
 
1
  ---
 
2
  language:
3
  - en
4
+ license: mit
5
+ pipeline_tag: video-text-to-text
6
  ---
7
 
8
  # Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
 
12
  - **GitHub:** [https://github.com/WHB139426/GeoVR-MLLM](https://github.com/WHB139426/GeoVR-MLLM)
13
  - **Paper:** [https://arxiv.org/abs/2606.05833](https://arxiv.org/abs/2606.05833)
14
 
15
+ ## Sample Usage
16
+
17
+ To use this model, you need to clone the [official repository](https://github.com/WHB139426/GeoVR-MLLM) to access the custom modeling files.
18
+
19
+ ```python
20
+ import torch
21
+ from utils.utils import *
22
+ from transformers import AutoProcessor
23
+ from models.qwen3vl_geo import Qwen3VLForConditionalGeneration
24
+
25
+ device = 'cuda:0'
26
+ model_id = "WHB139426/GeoVR-VGGT-Qwen3-VL-2B"
27
+
28
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
29
+ model_id,
30
+ geometry_encoder_path=None,
31
+ metric_model_path=None,
32
+ dtype=torch.bfloat16,
33
+ attn_implementation="flash_attention_2",
34
+ add_camera=False,
35
+ add_scale=False,
36
+ add_depth=False,
37
+ distill_geometry_feature=False,
38
+ )
39
+ model.load_geometric_weights(model_id)
40
+ model.to(device)
41
+
42
+ num_frames = 32
43
+ processor = AutoProcessor.from_pretrained(model_id)
44
+ processor.video_processor.size = {"longest_edge": 384*num_frames*32*32, "shortest_edge": 4*num_frames*32*32}
45
+
46
+ messages = [
47
+ {
48
+ "role": "user",
49
+ "content": [
50
+ {"type": "video", "video": './assets/scene0111_02.mp4',},
51
+ {"type": "text", "text": "Measuring distance from the nearest points, select the closest object (trash bin, door, table, refrigerator) to the tv. If multiple exist, use the nearest instance.
52
+ Options:
53
+ A. trash bin
54
+ B. door
55
+ C. table
56
+ D. refrigerator
57
+ Answer with the option's letter from the given choices directly."},
58
+ ],
59
+ }
60
+ ]
61
+
62
+ generation_kwargs = {
63
+ 'do_sample': True,
64
+ 'top_p': 0.8,
65
+ 'top_k': 20,
66
+ 'temperature': 0.7,
67
+ 'repetition_penalty': 1.0,
68
+ 'max_new_tokens': 32*1024,
69
+ }
70
+
71
+ inputs = processor.apply_chat_template(
72
+ messages,
73
+ tokenize=True,
74
+ add_generation_prompt=True,
75
+ return_dict=True,
76
+ return_tensors="pt",
77
+ num_frames=num_frames,
78
+ fps=None,
79
+ enable_thinking=False,
80
+ ).to(model.device)
81
+
82
+ with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16):
83
+ with torch.inference_mode():
84
+ generated_ids = model.generate(**inputs, **generation_kwargs)
85
+ output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].strip()
86
+ print(output_text)
87
+ ```
88
+
89
  ## Citation
90
 
91
  If you find this work useful, please consider citing: