OpenEnv documentation
QED Math Environment
QED Math Environment
Mathematical proof generation and evaluation environment for OpenEnv, ported from QED-Nano. Agents receive math problems, submit proofs, and receive LLM-based rubric grading on a 0β7 scale with normalized rewards.
Features
- LLM-based rubric grading (0β7 scale) via any OpenAI-compatible endpoint
- Process-based answer verification service (
math_verifyin worker processes) - Backpressure + retries + worker restart for robust concurrent rollout operation
- Gold-answer cache keyed by
problem_idand verifier normalization settings - Flexible dataset loading: local JSONL/JSON, Hugging Face Hub, or built-in bootstrap problems
- Reward shaping: discount factor, length penalty, and optional score thresholding
- Reasoning stripping: configurable delimiters (e.g.
<think>...</think>) removed before grading - Multi-step problems: configurable max attempts with per-attempt feedback
- Verifier metrics: rollout/staging counters and health signals surfaced in observation metadata, ready for TrackIO / WandB
- MCP tool interface:
get_problem,submit_proof,get_grading_guidelines
Quick Start
Async (default)
import asyncio
from qed_math_env import QEDMathEnv
async def main():
async with QEDMathEnv(base_url="http://localhost:8000") as env:
# Reset to load a problem
result = await env.reset()
obs = result.observation
print(f"Problem: {obs.problem[:100]}...")
# Submit a proof
submission = await env.submit_proof(proof="By induction on n...")
print(f"Score: {submission.score}/7, Reward: {submission.reward:.2f}")
asyncio.run(main())Sync
from qed_math_env import QEDMathEnv
with QEDMathEnv(base_url="http://localhost:8000").sync() as env:
result = env.reset()
submission = env.call_tool("submit_proof", proof="By induction on n...")MCP tool-calling
async with QEDMathEnv(base_url="http://localhost:8000") as env:
await env.reset()
# Discover tools
tools = await env.list_tools()
print([t.name for t in tools])
# ['get_problem', 'submit_proof', 'get_grading_guidelines']
# Call tools by name
problem = await env.call_tool("get_problem")
guidelines = await env.call_tool("get_grading_guidelines")
result = await env.call_tool("submit_proof", proof="...")Building & Running
# Build Docker image (from project root)
docker build -t qed-math-env:latest -f envs/qed_math_env/server/Dockerfile .
# Run the server
docker run -p 8000:8000 -e OPENAI_API_KEY=$OPENAI_API_KEY qed-math-env:latest
# Or run locally with uvicorn
PYTHONPATH=src:envs uvicorn qed_math_env.server.app:app --port 8000
# Or install and run via uv
cd envs/qed_math_env
uv sync
uv run serverProject Structure
qed_math_env/
βββ __init__.py # Module exports (QEDMathEnv, models)
βββ models.py # ProblemObservation, ProofSubmissionObservation
βββ client.py # QEDMathEnv client (MCPToolClient subclass)
βββ openenv.yaml # OpenEnv manifest with metrics declarations
βββ pyproject.toml # Dependencies
βββ uv.lock # Locked dependencies
βββ README.md
βββ prompts/
β βββ evaluator_prompts/
β βββ v0.md # Evaluator prompt (QED-Nano v0, uses reference solution)
β βββ v1.md # Evaluator prompt (QED-Nano v1, default, full 0β7 range)
β βββ v2.md # Evaluator prompt (QED-Nano v2, scores constrained to {0,1,6,7})
βββ server/
βββ __init__.py
βββ app.py # FastAPI server (create_app factory)
βββ qed_math_environment.py # QEDMathEnvironment (MCPEnvironment)
βββ math_verify_service.py # Process-pool verifier service + health/metrics
βββ mcp_server.py # MCP tool registration
βββ rubric.py # MathProofRubric + GradingResult
βββ DockerfileConfiguration
The environment is configured via QEDMathConfig:
| Parameter | Default | Description |
|---|---|---|
dataset_path | None | Dataset source: local path, Hub ID, or list of specs. None uses bootstrap problems. |
grader_model | "gemini-3-pro" | Model identifier for the LLM grader (any OpenAI-compatible endpoint) |
prompt_name | "v1" | Evaluator prompt template name (v0, v1, or v2 in prompts/evaluator_prompts/). v1 matches the QED-Nano default (full 0β7 range); v2 constrains scores to {0,1,6,7} |
grader_temperature | 1.0 | Sampling temperature forwarded to the grader (matches QED-Nano) |
grader_max_output_tokens | None | Optional output-token cap for the grader. None uses the provider default to avoid truncating the trailing <score> tag |
custom_reward_threshold | False | When True, collapses partial-credit scores 1β5 β 1 |
answer_reward_preset | "pure_success" | Answer-mode reward table: pure_success (correctβ1, else 0) or base (adds penalties: wrong β0.5, no_answer/unparsable β1) |
max_attempts | 1 | Max proof attempts per problem (>1 for multi-step) |
discount_factor | 1.0 | Exponential discount: reward *= discount_factor ** output_length_tokens |
buffer_tokens | 0 | Length penalty zone width. 0 disables the penalty. |
max_tokens | 0 | Max token limit for length penalty computation |
reasoning_delimiters | ["</think>"] | Delimiter strings to strip reasoning before grading (matches QED-Nano). Set to None to grade the full completion. |
verifier_workers | max(2, min(8, cpu_count//2)) | Number of process workers used for answer-mode verification |
verifier_queue_size | verifier_workers * 32 | Max in-flight verifier requests before backpressure |
verifier_request_timeout_seconds | 5.0 | Per-request client-side timeout when awaiting worker response |
verifier_max_retries | 1 | Retry budget for transient verifier infra failures |
verifier_strict | True | Strict math_verify equivalence mode |
verifier_numeric_precision | 5 | Numeric precision setting used in verifier request contract |
verifier_float_rounding | 10 | Float rounding setting used in verifier request contract |
Environment variables:
OPENAI_API_KEYβ API key for the grader LLMOPENAI_BASE_URLβ Base URL override (for non-OpenAI providers)
Dataset Format
Local JSONL/JSON
{
"problem": "Prove that the sum of two even integers is even.",
"solution": "Let a=2m and b=2n. Then a+b=2(m+n), which is even.",
"rubrics": [
{"title": "Definitions", "points": 2, "desc": "Correctly defines even integers."},
{"title": "Algebra", "points": 3, "desc": "Valid algebraic manipulation."},
{"title": "Conclusion", "points": 2, "desc": "Correctly concludes evenness."}
],
"dataset": "FineProofs-RL",
"problem_id": "fp_001"
}Hugging Face Hub
QEDMathConfig(dataset_path="meta-math/MetaMathQA")
# or with config
QEDMathConfig(dataset_path={"hub_id": "meta-math/MetaMathQA", "split": "train", "config": "default"})Field Aliases
The environment normalizes many dataset formats automatically:
| Canonical Field | Accepted Aliases |
|---|---|
problem | task, Problem |
reference_solution | solution, answer, Solution |
grading_guidelines | rubrics, schema, schema_0, Grading guidelines, details |
problem_id | id |
original_problem | Used for RC-stream problems where the actor prompt differs from grading prompt |
Observation Space
ProblemObservation (from reset / get_problem )
| Field | Type | Description |
|---|---|---|
problem | str | Math problem statement |
reference_solution | str | Ground-truth solution |
grading_guidelines | str | Rubric / marking scheme |
problem_id | str | Unique identifier |
problem_type | str | "proof", "answer", or "multi_step" |
dataset_source | str | Source dataset name |
metadata | dict | Additional context (e.g. original_problem) |
ProofSubmissionObservation (from submit_proof )
| Field | Type | Description |
|---|---|---|
proof | str | Submitted proof text |
score | int | Raw grade (0β7) |
feedback | str | Full grader response |
reward | float | Shaped reward in [0, 1] |
done | bool | Whether the episode is over |
is_correct | bool | Whether score >= success threshold (default 7, matching QED-Nanoβs score == 7) |
attempt_number | int | Current attempt count |
attempts_remaining | int | Remaining attempts |
problem_type | str | Problem type |
metadata | dict | Contains verifier_metrics, base_reward, shaped_reward |
MCP Tools
| Tool | Description | Parameters |
|---|---|---|
get_problem | Return current problem statement and metadata | β |
submit_proof | Submit a proof for LLM-based rubric grading | proof (str, required) |
get_grading_guidelines | Return the rubric/marking scheme | β |
Note:
output_length_tokensis not an agent-supplied parameter. Token counts are injected by the training harness via the HTTP step request body (see Reward Shaping) to preserve reward integrity β the agent cannot influence its own discount factor.
Reward Shaping
The reward pipeline follows QED-Nano conventions:
- LLM grading: Score 0β7 via evaluator prompt with
<score>N</score>parsing - Optional thresholding: Collapses 1β5 β 1 (when
custom_reward_threshold=True) - Normalization:
reward = score / 7.0 - Discount factor:
reward *= discount_factor ** output_length_tokens - Length penalty: Linear penalty when output approaches
max_tokens
For answer-mode problems (evaluation_mode: "answer"), grading is routed through the process-based verifier service: \boxed{} answers are extracted and verified against cached gold answers, with timeout/retry/backpressure handling for concurrent rollouts. The answer-mode reward is selected from answer_reward_preset keyed on the verifier status (correct, wrong, no_answer, unparsable); transient timeout/internal_error statuses stay neutral (0).
Proof vs. answer routing: when a dataset row does not set an explicit problem_type/evaluation_mode, the mode is auto-detected like QED-Nanoβs if "schema" in problem β a row carrying a grading rubric/schema is graded as a proof (LLM judge), while a row with no rubric is treated as an answer problem (boxed gold + math_verify). Set problem_type/evaluation_mode explicitly to override.
Harness-injected token count
Steps 4 and 5 require the full generation length (including any reasoning trace that is stripped before grading). This value cannot come from the agent β it is supplied by the training harness as an out-of-band field in the HTTP step request body, mirroring the StateUsageTracker pattern from PrimeIntellect/verifiers:
# Training harness (pseudocode)
completion_tokens = llm_call.usage.completion_tokens # from inference API
step_response = await openenv_client.step(
action=CallToolAction(tool_name="submit_proof", arguments={"proof": proof_text}),
output_length_tokens=completion_tokens, # injected here, not via MCP tool
)When output_length_tokens is absent (local testing, eval without a training loop) shaping is skipped entirely β no estimation is attempted, consistent with verifiersβ behaviour of returning None from StateUsageTracker.snapshot() when no usage was recorded.
Verifier Metrics
Every submit_proof call emits verifier metrics in metadata["verifier_metrics"], compatible with TrackIO and WandB:
| Metric | Description |
|---|---|
verifier/rollouts/success | 1 if grading succeeded |
verifier/rollouts/failure | 1 if grading failed |
verifier/failures/timeout | Count of timeout errors |
verifier/failures/rate_limit | Count of rate-limit errors |
verifier/failures/no_input | 1 if proof was empty |
verifier/failures/no_score_tag | 1 if LLM response had no <score> tag |
verifier/failures/all_attempts_failed | 1 if all retries exhausted |
verifier/failures/num_retries | Number of retries used |
verifier/runtime/latency_per_request | Grading wall-clock time (seconds) |
verifier/requests/count | Total verifier requests processed by the service |
verifier/requests/latency_ms | Service-level average request latency |
verifier/requests/timeout_count | Service-level timeout counter |
verifier/requests/error_count | Service-level internal error counter |
verifier/queue/depth | Current in-flight verifier queue depth |
verifier/cache/hit_rate | Gold-answer cache hit rate |
verifier/workers/restart_count | Worker-pool restart count |
verifier/workers/worker_restarted | 1 if current request required worker restart |
verifier/workers/heartbeat_lag_ms | Time since last verifier activity |
verifier/runtime/input_tokens | Grader input tokens (real provider usage when reported, else ~chars/4 estimate) |
verifier/runtime/output_tokens | Grader output tokens (real provider usage when reported, else ~chars/4 estimate) |
reward/base | Pre-shaping reward |
reward/shaped | Post-shaping reward |
reward/score_raw | Raw integer score (0β7) |
reward/overlong_penalty | Length penalty applied |
episode/attempt_number | Current attempt |
episode/is_correct | 1 if correct |
episode/problem_type | proof / answer / multi_step |
episode/dataset_source | Source dataset name |
TrackIO Integration
import trackio
run = trackio.init(project="qed-math-training")
# After each submit_proof call:
verifier_metrics = result["metadata"]["verifier_metrics"]
numeric = {k: v for k, v in verifier_metrics.items() if isinstance(v, (int, float))}
run.log(numeric, step=global_step)Or with TRLβs GRPOTrainer:
from trl import GRPOConfig
config = GRPOConfig(
report_to="trackio",
trackio_space_id="your-org/qed-math-grpo",
# ...
)Deployment
# Optional: run rollout/staging verifier validation first
PYTHONPATH=src:envs uv run python scripts/qed_math_verifier_staging_validation.py \
--workers 4 --queue-size 128 --concurrency 64 --requests 2000 \
--max-timeout-rate 0.05 --max-error-rate 0.02
openenv push