A way to correctly reproduce a benchmark?

#40
by anonymousmaharaj - opened

Hello dear friend! Thank you very much for this work!

I have been struggling for two days now and cannot correctly reproduce the benchmark specified in the model card (tau2-bench telecom). I just can't get it to work.
Could you tell me how you ran it and with what parameters? I want to run more queries. I think the community would be interested in this.

anonymousmaharaj changed discussion title from Способ корректно воспроизвести Benchmark to A way to correctly reproduce a benchmark?

Happy to help, and thanks! Two days of fighting this is almost always one of two small things: the --jinja flag, or
the served model name not matching what tau2 calls. Here's exactly how I ran it.

  1. Serve the model with llama.cpp (the --jinja flag is essential — without it the Gemma 4 tool-calls come back as raw
    text and every task fails; -a gemma sets the served name so tau2 can find it):

llama-server -m gemma4-v2-Q8_0.gguf -a gemma --host 127.0.0.1 --port 8000 -c 16384 -ngl 99 -fa on --jinja --temp 0.0

  1. Point tau2 at that endpoint (tau2 uses litellm internally; the openai/ prefix + these env vars send it to your
    local server):

OPENAI_API_BASE=http://127.0.0.1:8000/v1
OPENAI_API_KEY=dummy

  1. Run the telecom domain (temp 0, 1 trial — at temp 0 extra trials are identical, so 1 is enough):

tau2 run --domain telecom --agent-llm openai/gemma --user-llm openai/gemma --num-trials 1 --num-tasks 20
--max-concurrency 1 --save-to tau2_telecom_v2

The gotchas that cost people days:

  • --jinja is mandatory. Before trusting any score, open one trace and confirm there are real tool_calls. If you see
    tool calls printed as plain text, your template/--jinja is the problem — fix that first.
  • The gemma in -a gemma (server) must match openai/gemma (tau2 command), and OPENAI_API_BASE must point at the server.
  • If you re-run with the same --save-to, tau2 prompts (y/n) resume?; in a non-interactive shell that crashes with
    EOFError. Delete the save dir first, or use a fresh --save-to name each run.
  • model=gemma isn't mapped yet from litellm is just a cost-estimation warning — harmless, ignore it.
  • Don't use --task-set-name telecom_small — it has a task-split bug; just use --domain telecom --num-tasks N.

Two honest caveats about the number (important if you want to publish):

  • I used the model as its own user-simulator (--user-llm openai/gemma). The user-sim model affects the score, so for a
    clean, comparable benchmark across models, fix --user-llm to one strong model (a cheap cloud API works well) instead
    of self-sim.
  • My published telecom run was effectively N=14 — I hit a handful of infrastructure/API errors that dropped some
    tasks. So the robust signal is the relative pattern (v2 clearly beats base Gemma-4-12B on telecom), more than the
    exact percentage.

To run more queries like you wanted: raise --num-tasks (telecom has more than 20 tasks), and if you want sampling
diversity rather than a single deterministic pass, raise the temperature and --num-trials together. Would love to see
what numbers you get — please share them in the thread.

Sign up or log in to comment