QED Math Environment

Mathematical proof generation and evaluation environment for OpenEnv, ported from QED-Nano. Agents receive math problems, submit proofs, and receive LLM-based rubric grading on a 0–7 scale with normalized rewards.

Features

LLM-based rubric grading (0–7 scale) via any OpenAI-compatible endpoint
Process-based answer verification service (math_verify in worker processes)
Backpressure + retries + worker restart for robust concurrent rollout operation
Gold-answer cache keyed by problem_id and verifier normalization settings
Flexible dataset loading: local JSONL/JSON, Hugging Face Hub, or built-in bootstrap problems
Reward shaping: discount factor, length penalty, and optional score thresholding
Reasoning stripping: configurable delimiters (e.g. <think>...</think>) removed before grading
Multi-step problems: configurable max attempts with per-attempt feedback
Verifier metrics: rollout/staging counters and health signals surfaced in observation metadata, ready for TrackIO / WandB
MCP tool interface: get_problem, submit_proof, get_grading_guidelines

Quick Start

Async (default)

import asyncio
from qed_math_env import QEDMathEnv

async def main():
    async with QEDMathEnv(base_url="http://localhost:8000") as env:
        # Reset to load a problem
        result = await env.reset()
        obs = result.observation
        print(f"Problem: {obs.problem[:100]}...")

        # Submit a proof
        submission = await env.submit_proof(proof="By induction on n...")
        print(f"Score: {submission.score}/7, Reward: {submission.reward:.2f}")

asyncio.run(main())

Sync

from qed_math_env import QEDMathEnv

with QEDMathEnv(base_url="http://localhost:8000").sync() as env:
    result = env.reset()
    submission = env.call_tool("submit_proof", proof="By induction on n...")

MCP tool-calling

async with QEDMathEnv(base_url="http://localhost:8000") as env:
    await env.reset()

    # Discover tools
    tools = await env.list_tools()
    print([t.name for t in tools])
    # ['get_problem', 'submit_proof', 'get_grading_guidelines']

    # Call tools by name
    problem = await env.call_tool("get_problem")
    guidelines = await env.call_tool("get_grading_guidelines")
    result = await env.call_tool("submit_proof", proof="...")

Building & Running

# Build Docker image (from project root)
docker build -t qed-math-env:latest -f envs/qed_math_env/server/Dockerfile .

# Run the server
docker run -p 8000:8000 -e OPENAI_API_KEY=$OPENAI_API_KEY qed-math-env:latest

# Or run locally with uvicorn
PYTHONPATH=src:envs uvicorn qed_math_env.server.app:app --port 8000

# Or install and run via uv
cd envs/qed_math_env
uv sync
uv run server

Project Structure

qed_math_env/
├── __init__.py              # Module exports (QEDMathEnv, models)
├── models.py                # ProblemObservation, ProofSubmissionObservation
├── client.py                # QEDMathEnv client (MCPToolClient subclass)
├── openenv.yaml             # OpenEnv manifest with metrics declarations
├── pyproject.toml           # Dependencies
├── uv.lock                  # Locked dependencies
├── README.md
├── prompts/
│   └── evaluator_prompts/
│       ├── v0.md            # Evaluator prompt (QED-Nano v0, uses reference solution)
│       ├── v1.md            # Evaluator prompt (QED-Nano v1, default, full 0–7 range)
│       └── v2.md            # Evaluator prompt (QED-Nano v2, scores constrained to {0,1,6,7})
└── server/
    ├── __init__.py
    ├── app.py               # FastAPI server (create_app factory)
    ├── qed_math_environment.py  # QEDMathEnvironment (MCPEnvironment)
    ├── math_verify_service.py   # Process-pool verifier service + health/metrics
    ├── mcp_server.py        # MCP tool registration
    ├── rubric.py            # MathProofRubric + GradingResult
    └── Dockerfile

Configuration

The environment is configured via QEDMathConfig:

Parameter	Default	Description
`dataset_path`	`None`	Dataset source: local path, Hub ID, or list of specs. `None` uses bootstrap problems.
`grader_model`	`"gemini-3-pro"`	Model identifier for the LLM grader (any OpenAI-compatible endpoint)
`prompt_name`	`"v1"`	Evaluator prompt template name (`v0`, `v1`, or `v2` in `prompts/evaluator_prompts/`). `v1` matches the QED-Nano default (full 0–7 range); `v2` constrains scores to `{0,1,6,7}`
`grader_temperature`	`1.0`	Sampling temperature forwarded to the grader (matches QED-Nano)
`grader_max_output_tokens`	`None`	Optional output-token cap for the grader. `None` uses the provider default to avoid truncating the trailing `<score>` tag
`custom_reward_threshold`	`False`	When `True`, collapses partial-credit scores 1–5 → 1
`answer_reward_preset`	`"pure_success"`	Answer-mode reward table: `pure_success` (correct→1, else 0) or `base` (adds penalties: wrong −0.5, no_answer/unparsable −1)
`max_attempts`	`1`	Max proof attempts per problem (>1 for multi-step)
`discount_factor`	`1.0`	Exponential discount: `reward = discount_factor * output_length_tokens`
`buffer_tokens`	`0`	Length penalty zone width. `0` disables the penalty.
`max_tokens`	`0`	Max token limit for length penalty computation
`reasoning_delimiters`	`["</think>"]`	Delimiter strings to strip reasoning before grading (matches QED-Nano). Set to `None` to grade the full completion.
`verifier_workers`	`max(2, min(8, cpu_count//2))`	Number of process workers used for answer-mode verification
`verifier_queue_size`	`verifier_workers * 32`	Max in-flight verifier requests before backpressure
`verifier_request_timeout_seconds`	`5.0`	Per-request client-side timeout when awaiting worker response
`verifier_max_retries`	`1`	Retry budget for transient verifier infra failures
`verifier_strict`	`True`	Strict `math_verify` equivalence mode
`verifier_numeric_precision`	`5`	Numeric precision setting used in verifier request contract
`verifier_float_rounding`	`10`	Float rounding setting used in verifier request contract

Environment variables:

OPENAI_API_KEY — API key for the grader LLM
OPENAI_BASE_URL — Base URL override (for non-OpenAI providers)

Dataset Format

Local JSONL/JSON

{
  "problem": "Prove that the sum of two even integers is even.",
  "solution": "Let a=2m and b=2n. Then a+b=2(m+n), which is even.",
  "rubrics": [
    {"title": "Definitions", "points": 2, "desc": "Correctly defines even integers."},
    {"title": "Algebra", "points": 3, "desc": "Valid algebraic manipulation."},
    {"title": "Conclusion", "points": 2, "desc": "Correctly concludes evenness."}
  ],
  "dataset": "FineProofs-RL",
  "problem_id": "fp_001"
}

Hugging Face Hub

QEDMathConfig(dataset_path="meta-math/MetaMathQA")
# or with config
QEDMathConfig(dataset_path={"hub_id": "meta-math/MetaMathQA", "split": "train", "config": "default"})

Field Aliases

The environment normalizes many dataset formats automatically:

Canonical Field	Accepted Aliases
`problem`	`task`, `Problem`
`reference_solution`	`solution`, `answer`, `Solution`
`grading_guidelines`	`rubrics`, `schema`, `schema_0`, `Grading guidelines`, `details`
`problem_id`	`id`
`original_problem`	Used for RC-stream problems where the actor prompt differs from grading prompt

Observation Space

ProblemObservation (from reset / get_problem )

Field	Type	Description
`problem`	`str`	Math problem statement
`reference_solution`	`str`	Ground-truth solution
`grading_guidelines`	`str`	Rubric / marking scheme
`problem_id`	`str`	Unique identifier
`problem_type`	`str`	`"proof"`, `"answer"`, or `"multi_step"`
`dataset_source`	`str`	Source dataset name
`metadata`	`dict`	Additional context (e.g. `original_problem`)

ProofSubmissionObservation (from submit_proof )

Field	Type	Description
`proof`	`str`	Submitted proof text
`score`	`int`	Raw grade (0–7)
`feedback`	`str`	Full grader response
`reward`	`float`	Shaped reward in [0, 1]
`done`	`bool`	Whether the episode is over
`is_correct`	`bool`	Whether score >= success threshold (default 7, matching QED-Nano’s `score == 7`)
`attempt_number`	`int`	Current attempt count
`attempts_remaining`	`int`	Remaining attempts
`problem_type`	`str`	Problem type
`metadata`	`dict`	Contains `verifier_metrics`, `base_reward`, `shaped_reward`

MCP Tools

Tool	Description	Parameters
`get_problem`	Return current problem statement and metadata	—
`submit_proof`	Submit a proof for LLM-based rubric grading	`proof` (str, required)
`get_grading_guidelines`	Return the rubric/marking scheme	—

Note: output_length_tokens is not an agent-supplied parameter. Token counts are injected by the training harness via the HTTP step request body (see Reward Shaping) to preserve reward integrity — the agent cannot influence its own discount factor.

Reward Shaping

The reward pipeline follows QED-Nano conventions:

LLM grading: Score 0–7 via evaluator prompt with <score>N</score> parsing
Optional thresholding: Collapses 1–5 → 1 (when custom_reward_threshold=True)
Normalization: reward = score / 7.0
Discount factor: reward *= discount_factor ** output_length_tokens
Length penalty: Linear penalty when output approaches max_tokens

For answer-mode problems (evaluation_mode: "answer"), grading is routed through the process-based verifier service: \boxed{} answers are extracted and verified against cached gold answers, with timeout/retry/backpressure handling for concurrent rollouts. The answer-mode reward is selected from answer_reward_preset keyed on the verifier status (correct, wrong, no_answer, unparsable); transient timeout/internal_error statuses stay neutral (0).

Proof vs. answer routing: when a dataset row does not set an explicit problem_type/evaluation_mode, the mode is auto-detected like QED-Nano’s if "schema" in problem — a row carrying a grading rubric/schema is graded as a proof (LLM judge), while a row with no rubric is treated as an answer problem (boxed gold + math_verify). Set problem_type/evaluation_mode explicitly to override.

Harness-injected token count

Steps 4 and 5 require the full generation length (including any reasoning trace that is stripped before grading). This value cannot come from the agent — it is supplied by the training harness as an out-of-band field in the HTTP step request body, mirroring the StateUsageTracker pattern from PrimeIntellect/verifiers:

# Training harness (pseudocode)
completion_tokens = llm_call.usage.completion_tokens  # from inference API

step_response = await openenv_client.step(
    action=CallToolAction(tool_name="submit_proof", arguments={"proof": proof_text}),
    output_length_tokens=completion_tokens,  # injected here, not via MCP tool
)

When output_length_tokens is absent (local testing, eval without a training loop) shaping is skipped entirely — no estimation is attempted, consistent with verifiers’ behaviour of returning None from StateUsageTracker.snapshot() when no usage was recorded.

Verifier Metrics

Every submit_proof call emits verifier metrics in metadata["verifier_metrics"], compatible with TrackIO and WandB:

Metric	Description
`verifier/rollouts/success`	1 if grading succeeded
`verifier/rollouts/failure`	1 if grading failed
`verifier/failures/timeout`	Count of timeout errors
`verifier/failures/rate_limit`	Count of rate-limit errors
`verifier/failures/no_input`	1 if proof was empty
`verifier/failures/no_score_tag`	1 if LLM response had no `<score>` tag
`verifier/failures/all_attempts_failed`	1 if all retries exhausted
`verifier/failures/num_retries`	Number of retries used
`verifier/runtime/latency_per_request`	Grading wall-clock time (seconds)
`verifier/requests/count`	Total verifier requests processed by the service
`verifier/requests/latency_ms`	Service-level average request latency
`verifier/requests/timeout_count`	Service-level timeout counter
`verifier/requests/error_count`	Service-level internal error counter
`verifier/queue/depth`	Current in-flight verifier queue depth
`verifier/cache/hit_rate`	Gold-answer cache hit rate
`verifier/workers/restart_count`	Worker-pool restart count
`verifier/workers/worker_restarted`	1 if current request required worker restart
`verifier/workers/heartbeat_lag_ms`	Time since last verifier activity
`verifier/runtime/input_tokens`	Grader input tokens (real provider usage when reported, else ~chars/4 estimate)
`verifier/runtime/output_tokens`	Grader output tokens (real provider usage when reported, else ~chars/4 estimate)
`reward/base`	Pre-shaping reward
`reward/shaped`	Post-shaping reward
`reward/score_raw`	Raw integer score (0–7)
`reward/overlong_penalty`	Length penalty applied
`episode/attempt_number`	Current attempt
`episode/is_correct`	1 if correct
`episode/problem_type`	proof / answer / multi_step
`episode/dataset_source`	Source dataset name

TrackIO Integration

import trackio

run = trackio.init(project="qed-math-training")

# After each submit_proof call:
verifier_metrics = result["metadata"]["verifier_metrics"]
numeric = {k: v for k, v in verifier_metrics.items() if isinstance(v, (int, float))}
run.log(numeric, step=global_step)

Or with TRL’s GRPOTrainer:

from trl import GRPOConfig

config = GRPOConfig(
    report_to="trackio",
    trackio_space_id="your-org/qed-math-grpo",
    # ...
)

Deployment

# Optional: run rollout/staging verifier validation first
PYTHONPATH=src:envs uv run python scripts/qed_math_verifier_staging_validation.py \
  --workers 4 --queue-size 128 --concurrency 64 --requests 2000 \
  --max-timeout-rate 0.05 --max-error-rate 0.02

openenv push

Update on GitHub