gemma-4-31B-it-nvfp4

NVFP4 quantized version of google/gemma-4-31B-it (31B params, server model). Produced and maintained by vrfai.

Quantization Details

This model was quantized using NVIDIA ModelOpt with the following configurations:

Property Value
Base model google/gemma-4-31B-it
Quant method NVIDIA ModelOpt (NVFP4)
Weight scheme 4-bit float, block size 16
Input activation 4-bit float, block size 16
Calibration dataset CNN DailyMail (512 samples, max_seq_len 1024)
Size ~30 GB (vs ~58 GB BF16)

Excluded from Quantization

The following modules are kept in full precision (BF16) to preserve accuracy:

  • lm_head
  • model.embed_vision*
  • All self_attn layers (layers 0–59)

Usage

You can deploy this model using vLLM with the modelopt quantization backend. Please ensure you refer to the vLLM documentation for Gemma 4 for advanced serving options.

vllm serve vrfai/gemma-4-31B-it-nvfp4 \
  --quantization modelopt_fp4 \
  --max-model-len 32768 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --async-scheduling \
  --trust-remote-code

Quantization Script

The recipes and scripts used to quantize this model can be found in the following repository:

Downloads last month
47
Safetensors
Model size
21B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vrfai/gemma-4-31B-it-nvfp4

Quantized
(222)
this model

Collection including vrfai/gemma-4-31B-it-nvfp4