τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems
Abstract
A benchmark for agentic recommender systems is introduced that uses verifiable rewards and controlled dialogue constraints to evaluate conversational agent reliability, revealing significant performance gaps among leading models.
As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present τ-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, τ-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.
Community
τ-Rec: A verifiable benchmark for agentic recommender systems. Dataset: https://huggingface.co/datasets/nbharaths/tau-rec
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents (2026)
- Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents (2026)
- ProactBench: Beyond What The User Asked For (2026)
- Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions (2026)
- T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains (2026)
- WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation (2026)
- Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.10156 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper