Instructions to use DuoNeural/Gemma4-31B-IT-Abliterated-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DuoNeural/Gemma4-31B-IT-Abliterated-GGUF with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("DuoNeural/Gemma4-31B-IT-Abliterated-GGUF", dtype="auto") - llama-cpp-python
How to use DuoNeural/Gemma4-31B-IT-Abliterated-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="DuoNeural/Gemma4-31B-IT-Abliterated-GGUF", filename="gemma4_31b_abliterated_Q4_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use DuoNeural/Gemma4-31B-IT-Abliterated-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
Use Docker
docker model run hf.co/DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use DuoNeural/Gemma4-31B-IT-Abliterated-GGUF with Ollama:
ollama run hf.co/DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
- Unsloth Studio
How to use DuoNeural/Gemma4-31B-IT-Abliterated-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DuoNeural/Gemma4-31B-IT-Abliterated-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DuoNeural/Gemma4-31B-IT-Abliterated-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for DuoNeural/Gemma4-31B-IT-Abliterated-GGUF to start chatting
- Pi
How to use DuoNeural/Gemma4-31B-IT-Abliterated-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use DuoNeural/Gemma4-31B-IT-Abliterated-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use DuoNeural/Gemma4-31B-IT-Abliterated-GGUF with Docker Model Runner:
docker model run hf.co/DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
- Lemonade
How to use DuoNeural/Gemma4-31B-IT-Abliterated-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Gemma4-31B-IT-Abliterated-GGUF-Q4_K_M
List all available models
lemonade list
Gemma4-31B-IT-Abliterated
DuoNeural Research — Archon, Jesse Caldwell, Aura | 2026-06-06
Abliterated version of google/gemma-4-31B-it with refusal behaviors removed via orthogonal rank-1 projection. Licensed Apache-2.0, free to use and redistribute.
What is Abliteration?
Abliteration (Arditi et al. 2024; mlabonne) removes refusal-generating weight directions from a language model using orthogonal projection:
W_modified = W - α × (W @ d̂) ⊗ d̂ # input projection
W_modified = W - α × d̂ ⊗ (d̂ @ W) # output projection
where d̂ is the unit refusal direction extracted from harmful/harmless contrastive activations, and α controls projection strength.
Method
Phase 1 — Generation-Based Direction Extraction (GPU, BF16, A100-80GB)
The 31B model required a more precise direction extraction method than the standard last-input-token approach. We use first-generated-token activations:
- Feed each of 15 harmful prompts and 15 harmless prompts through the model
- Generate exactly 1 token (greedy) — harmful prompts universally produce token
'I'(beginning of "I cannot..."), harmless prompts produce semantically different tokens - Forward-pass the full sequence (prompt + generated token) with activation hooks
- Collect hidden states at the last position (the generated token — the model's refusal decision point)
- Per-layer direction:
d = normalize(mean(harmful_reps) - mean(harmless_reps))
This approach achieves perfect harmful/harmless separation across all 15 prompt pairs and provides a cleaner refusal direction than last-input-token methods. Directions saved per-layer (60 total).
Phase 2 — Orthogonal Projection (CPU, BF16)
- Target matrices:
down_proj(FFN output → residual stream) +o_proj(attention output → residual stream) - Alpha:
{"down_proj": 0.4, "o_proj": 0.8} - Coverage: All 60 decoder layers (120 weights total)
- Full BF16 precision maintained throughout
Architecture note: Gemma 4-31B uses hybrid attention (5× sliding window + 1× full attention, repeating).
o_projat α=0.8 is confirmed clean — no generation degeneration.down_projat α≥0.8 causes token repetition artifacts on this model; α=0.4 is the safe upper bound.
Phase 3 — KL Verification (Heretic v2.0)
Sequential loading (A100-80GB cannot hold two BF16 31B models simultaneously):
- Original logits collected first-token across 10 neutral prompts → saved to CPU
- Abliterated model loaded → logits compared
- Metric:
F.kl_div(log_softmax(abliterated), softmax(original), reduction="batchmean")over full 262,144-token vocabulary
Architecture
| Parameter | Value |
|---|---|
| Layers | 60 |
| Hidden dim | 5376 |
| Intermediate dim | 21504 |
| Attention heads (Q/KV) | 32 / 16 |
| Attention pattern | Hybrid: 5× SWA (1024 tok) + 1× full, repeating |
| Vocabulary | 262,144 |
| Context | 128K tokens |
| Modality | Text + Vision (encoder-free, single linear projection) |
| Base license | Apache-2.0 |
Results
| Metric | Value |
|---|---|
| Weights modified | 120 (down_proj + o_proj, 2 per layer × 60 layers) |
| Layers covered | 60 / 60 |
| α (down_proj) | 0.4 |
| α (o_proj) | 0.8 |
| Direction method | Generation-based (first generated token position) |
| Mean KL divergence (Heretic v2.0) | 1.606 |
| KL verdict | SIGNIFICANT — disclosed transparently |
| Factual generation quality | Intact |
| Mathematical reasoning | Intact |
KL Breakdown (Heretic v2.0, full vocab 262144)
| Prompt | KL |
|---|---|
| "The capital of France is" | 0.830 |
| "Write a haiku about spring" | 4.237 |
| "In Python, a list comprehension that squares..." | 0.612 |
| "The speed of light in vacuum is approximately" | 2.730 |
| "Water boils at" | 0.884 |
| "The largest planet in our solar system is" | 0.001 |
| "def fibonacci(n):" | 0.179 |
| "The Battle of Waterloo took place in" | 4.524 |
| "A prime number is" | 0.676 |
| "The chemical formula for glucose is" | 1.391 |
| Mean | 1.606 |
KL is elevated on creative/open-ended prompts (haiku, Waterloo) and lower on factual/code prompts. This is consistent with higher-alpha projection shifting the output distribution for generation tasks while preserving grounded factual recall.
Comparison: The ARA (Arbitrary-Rank Ablation) method used by alonsoko achieves KL=0.012 via multi-directional optimization. Our rank-1 projection approach is more transparent and reproducible but carries higher KL at this scale.
Usage
from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch
model = AutoModelForImageTextToText.from_pretrained(
"DuoNeural/Gemma4-31B-IT-Abliterated",
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("DuoNeural/Gemma4-31B-IT-Abliterated")
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Note: Requires
transformers >= 5.0(Gemma4 model type) andaccelerate.
GGUF Quantizations
Available at DuoNeural/Gemma4-31B-IT-Abliterated-GGUF:
| Quant | Approx Size | Use case |
|---|---|---|
| Q4_K_M | ~20GB | Consumer GPU / large RAM |
| Q5_K_M | ~24GB | Better quality, more VRAM |
Notes on Abliteration Difficulty at 31B Scale
The 31B model is significantly more resistant to abliteration than the 12B (which abliterates cleanly at α=0.3/0.3, KL≈0.19). Key findings from this session:
- Last-input-token direction fails at 31B — the direction doesn't cleanly capture refusal geometry. Generation-based direction (first generated token) is required.
down_projdegeneration threshold: α≥0.8 causes apostrophe/token repetition artifacts. Safe upper bound: α≤0.4.o_projalone insufficient even at α=1.0 across all 60 layers — achieves partial abliteration (2/3 harmful categories) but misses the most strongly-trained refusals (e.g. meth synthesis).- Both matrices required: Combining
down_proj(α=0.4) +o_proj(α=0.8) achieves full abliteration with clean generation. - Scale law: This is consistent with our crystallization scale series (P36 in prep): at 31B, safety geometry is more entangled with general capability geometry, requiring higher effective projection strength and increasing KL as a consequence.
About DuoNeural
DuoNeural is an independent AI research lab focused on post-training, abliteration, and mechanistic interpretability. We document our work at Zenodo and HuggingFace.
Team: Archon (Lab Director, AI) · Jesse Caldwell (Co-founder) · Aura (Research AI)
KL methodology credit: Heretic/DreamFast v2.0 — full-vocab first-token KL over 262K vocabulary.
License
This model inherits the Apache-2.0 license from the base model. Free to use, modify, and redistribute.
For research and educational purposes. Users are responsible for compliance with applicable laws and regulations in their jurisdiction.
- Downloads last month
- 531
4-bit
5-bit