--- library_name: transformers license: apache-2.0 pipeline_tag: text-generation base_model: - mistralai/Ministral-3-14B-Instruct-2512-BF16 tags: - neuralmagic - redhat - llmcompressor - quantized - FP4 --- # Ministral-3-14B-Instruct-2512-NVFP4 ## Model Overview - **Model Architecture:** MistralForCausalLM - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP4 - **Activation quantization:** FP4 - **Intended Use Cases:** - Reasoning. - Function calling. - Subject matter experts via fine-tuning. - Multilingual instruction following. - Translation. - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). - **Release Date:** 05/21/2025 - **Version:** 1.0 - **Model Developers:** RedHat (Neural Magic) ### Model Optimizations This model was obtained by quantizing the weights and activations of [mistralai/Ministral-3-14B-Instruct-2512-BF16](https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512-BF16) to FP4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. ## Deployment This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/Ministral-3-14B-Instruct-2512-NVFP4" number_gpus = 1 sampling_params = SamplingParams(temperature=0.15, top_p=1.0, top_k=20, min_p=0, max_tokens=65536) messages = [ {"role": "user", "content": prompt} ] tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [{"role": "user", "content": "Give me a short introduction to large language model."}] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation
Creation details This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. ```python from datasets import load_dataset from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.utils import dispatch_for_generation MODEL_ID = "mistralai/Ministral-3-14B-Instruct-2512-BF16" model = Mistral3ForConditionalGeneration.from_pretrained(MODEL_ID, device_map="auto") tokenizer = MistralCommonBackend.from_pretrained(MODEL_ID) recipe = QuantizationModifier( targets="Linear", scheme="NVFP4", weight_observer="mse", ignore= ['re:.*lm_head', 're:.*vision_tower.*', 're:.*multi_modal_projector.*', 're:.*self_attn'], ) # Apply quantization. oneshot(model=model, recipe=recipe) # Confirm generations of the quantized model look sane. print("========== SAMPLE GENERATION ==============") dispatch_for_generation(model) input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to( model.device ) output = model.generate(input_ids, max_new_tokens=20) print(tokenizer.decode(output[0])) print("==========================================") # Save to disk in compressed-tensors format. SAVE_DIR = MODEL_ID.split("/")[1] + "-NVFP4" model.save_pretrained(SAVE_DIR, save_compressed = True) tokenizer.save_pretrained(SAVE_DIR) ```
## Evaluation The model was evaluated on the ifeval and mmmu using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), on reasoning tasks using [lighteval](https://github.com/neuralmagic/lighteval/tree/reasoning). [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
Evaluation details **lm-evaluation-harness** ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Ministral-3-14B-Instruct-2512-NVFP4",dtype=auto,gpu_memory_utilization=0.7,max_model_len=262144,enable_chunk_prefill=True,tensor_parallel_size=1 \ --tasks ifeval,mmmu_val \ --apply_chat_template\ --fewshot_as_multiturn \ --batch_size auto ``` **lighteval** litellm_config.yaml ```yaml model_parameters: provider: "hosted_vllm" model_name: "hosted_vllm/RedHatAI/Ministral-3-14B-Instruct-2512-NVFP4" base_url: "http://0.0.0.0:8000/v1" api_key: "" timeout: 1200 concurrent_requests: 16 generation_parameters: temperature: 0.15 max_new_tokens: 65536 top_p: 0.95 seed: 0 ``` ``` lighteval endpoint litellm litellm_config.yaml "aime25" ``` ``` lighteval endpoint litellm litellm_config.yaml "math_500" ``` ``` lighteval endpoint litellm litellm_config.yaml "gpqa:diamond" ```
### Accuracy
Category Benchmark Ministral-3-14B-Instruct-2512-BF16 Ministral-3-14B-Instruct-2512-NVFP4
(this model)
Recovery
Vision MMMU 55.33 52.37 94.65%
OpenLLM v2 IFEval 77.34 63.55 82.17%
Reasoning
(generation)
AIME 2025 36.67 32.5 88.63%
GPQA diamond 58.59 60.94 104.02%
Math-lvl-5 88.6 85.80 93.84%
Average 61.29 59.75 97.49%