Liquid AI
Try LFMDocsLEAPDiscord

🇯🇵 LFM2.5-1.2B-JP-202606-ONNX

ONNX export of LFM2.5-1.2B-JP-202606 for cross-platform deployment via ONNX Runtime, Transformers.js, and the WebGPU stack. Same weights, same chat template — just compiled into ONNX graphs at multiple precisions, including a WebGPU-friendly INT4 + FP16 mix.

LiquidAI/LFM2.5-1.2B-JP-202606 は当社の汎用日本語チャットモデルです。本リポジトリはその ONNX エクスポートで、ONNX Runtime / Transformers.js / WebGPU での実行に対応しています。重みおよびチャットテンプレートは同一です。

📦 Files

File Format Embedding Weights Cache / activations Approx. size
onnx/model.onnx FP32 FP32 FP32 FP32 4.7 GB
onnx/model_fp16.onnx FP16 FP16 FP16 FP16 2.4 GB
onnx/model_q4.onnx INT4 INT4 (GatherBlockQuantized) INT4 (MatMulNBits) FP32 834 MB
onnx/model_q4f16.onnx INT4 + FP16 INT4 + FP16 scales INT4 + FP16 scales FP16 744 MB
onnx/model_q4f32.onnx INT4 (MatMul-only) FP32 (kept) INT4 (MatMulNBits) FP32 1.2 GB
onnx/model_q8.onnx INT8 (MatMul-only) FP32 (kept) INT8 (MatMulNBits) FP32 1.8 GB

model_q4f16.onnx is the recommended variant for WebGPU: INT4 weights with FP16 scales, FP16 KV cache and conv state I/O, FP32 logits via an inserted Cast — the format Transformers.js targets for browser inference.

Each .onnx file ships its weights in one or more .onnx_data chunks (≤ 2 GB each, per the ONNX external-data convention).

🏃 Inference

Transformers.js (browser / Node.js, WebGPU)

import { pipeline } from "@huggingface/transformers";

const generator = await pipeline(
  "text-generation",
  "LiquidAI/LFM2.5-1.2B-JP-202606-ONNX",
  { dtype: "q4f16", device: "webgpu" }
);

const messages = [
  { role: "system", content: "You are a helpful assistant trained by Liquid AI." },
  { role: "user", content: "日本の首都は?" },
];

const output = await generator(messages, {
  max_new_tokens: 256,
  do_sample: true,
  temperature: 0.1,
  top_k: 50,
  repetition_penalty: 1.05,
});
console.log(output[0].generated_text);

ONNX Runtime (Python)

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

REPO = "LiquidAI/LFM2.5-1.2B-JP-202606-ONNX"
tokenizer = AutoTokenizer.from_pretrained(REPO)
session = ort.InferenceSession("onnx/model_q4.onnx", providers=["CPUExecutionProvider"])

# Map ORT type names to numpy dtypes so fp16 / q4f16 variants work too.
ORT_DTYPE = {"tensor(float)": np.float32, "tensor(float16)": np.float16, "tensor(int64)": np.int64}

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "日本の首都は?"}],
    tokenize=False,
    add_generation_prompt=True,
)
input_ids = np.array([tokenizer.encode(prompt, add_special_tokens=False)], dtype=np.int64)
seq_len = input_ids.shape[1]

feed = {
    "input_ids": input_ids,
    "attention_mask": np.ones((1, seq_len), dtype=np.int64),
    "position_ids": np.arange(seq_len, dtype=np.int64).reshape(1, -1),
}
for inp in session.get_inputs():
    if inp.name not in feed:
        shape = [d if isinstance(d, int) else 1 for d in inp.shape]
        feed[inp.name] = np.zeros(shape, dtype=ORT_DTYPE[inp.type])

logits = session.run(None, feed)[0]
next_id = int(np.argmax(logits[0, -1]))
print(tokenizer.decode([next_id]))

For full multi-turn generation with stateful KV cache feedback, see the LiquidONNX inference example (works against this repo unchanged).

🗒️ Model Details

LFM2.5-1.2B-JP-202606 is a general-purpose Japanese-capable chat model:

  • Number of parameters: 1.17B
  • Number of layers: 16 (10 double-gated LIV convolution blocks + 6 GQA blocks)
  • Context length: 32,768 tokens
  • Vocabulary size: 65,536
  • Knowledge cutoff: Mid-2024
  • Languages: English, Japanese
  • Recommended generation parameters:
    • temperature: 0.1
    • top_k: 50
    • repetition_penalty: 1.05

Refer to the base model card for benchmark scores, training details, and use-case recommendations.

Model Description
LFM2.5-1.2B-JP-202606 Original checkpoint in native format. Best for fine-tuning or inference with Transformers and vLLM.
LFM2.5-1.2B-JP-202606-GGUF Quantized format for llama.cpp and compatible tools.
LFM2.5-1.2B-JP-202606-ONNX ONNX Runtime format for cross-platform deployment (ORT, Transformers.js, WebGPU).
LFM2.5-1.2B-JP-202606-MLX-8bit MLX format for Apple Silicon.

We recommend using it for agentic workflows, tool use, structured outputs, bilingual English–Japanese assistants, and on-device personal-assistant applications. It is not recommended for knowledge-intensive tasks. It performs best when given clear, explicit instructions that define the task, expected behavior, and output format.

エージェント型ワークフロー、ツール使用、構造化出力、日英バイリンガルアシスタント、オンデバイスのパーソナルアシスタントでの利用を推奨します。一方で、詳細な知識を要するのタスクには推奨されません。タスク内容、期待される動作、出力形式を明確かつ具体的に指示することで、最も高い性能を発揮します。

Chat Template

LFM2.5 uses a ChatML-like format. See the Chat Template documentation for details.

<|startoftext|><|im_start|>system
You are a helpful assistant trained by Liquid AI.<|im_end|>
<|im_start|>user
日本の首都は?<|im_end|>
<|im_start|>assistant

Use tokenizer.apply_chat_template() to format messages automatically — the included tokenizer.json and chat_template.jinja work unchanged across Transformers, Transformers.js, and ORT.

Tool Use

The same Pythonic function-call protocol as the base model (<|tool_call_start|>[fn(...)]<|tool_call_end|>). See the Tool Use documentation for the full guide.

🛠️ How this export was produced

These ONNX artifacts are produced by the Liquid4All/onnx-export toolchain:

uv run lfm2-export LiquidAI/LFM2.5-1.2B-JP-202606 --precision
# (plus a one-shot Q4 → Q4F16 conversion using lfm2_moe.export.convert_q4_to_fp16)

Each variant is verified against the PyTorch reference on a coherence-test prompt suite before publication.

📬 Contact

Citation

@article{liquidai2025lfm2,
  title={LFM2 Technical Report},
  author={Liquid AI},
  journal={arXiv preprint arXiv:2511.23404},
  year={2025}
}
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LiquidAI/LFM2.5-1.2B-JP-202606-ONNX

Quantized
(7)
this model

Paper for LiquidAI/LFM2.5-1.2B-JP-202606-ONNX