Benchmark Test-Time Scaling of General LLM Agents

🔍 Introduction

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains.

Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling.

📏 General AgentBench

We introduce General AgentBench, a benchmark for evaluating whether agents can compose multiple skills and tools to solve open-ended requests from diverse domains under a unified framework, more closely reflecting real-world user interactions.

General AgentBench covers four task domains, including Search, Coding, Reason, and Tool use. All tools are consolidated through the Model Context Protocol (MCP) into a unified interface, where agents see only a single shared tool pool across all tasks.

Most models experience substantial degradation (10–30% on average) when moving from domain-specific to the general-agent setting.

Main results across general agent benchmark domains

Relative performance change across domains from the Baseline (B) specialized agent setting to the general agent (G) setting with unified context and tools. Negative values indicate performance degradation under the General AgentBench.

Performance summary across general agent benchmark domains

Performance comparison between specialized-agent and general-agent settings across models.

Two Test-Time Scaling Behaviors

We systematically study two primary test-time scaling paradigms for general LLM agents:

Parallel scaling expands the solution space by independently sampling K trajectories for each query.
Sequential scaling increases computational depth by extending the interaction horizon.

We further introduce a self-choice setting to measure the gap between the parallel upper bound (pass@K) and real-world effectiveness: agents must also be capable of evaluating and selecting the best outcome from their own generated trajectories.

Test-time scaling behaviors across five models and four domains

Test-time scaling behaviors of general LLM agents. Top: Parallel scaling expands the solution space through increased sampling. Bottom: Sequential scaling allocates additional computation via longer interaction histories, yet exhibits unstable or diminishing returns.

Sequential Scaling — Context Ceiling

Sequential scaling extends the interaction horizon by injecting additional rounds of feedback. While performance initially improves as agents approach their inherent context length, it plateaus or degrades once context exceeds a critical threshold.

This context ceiling varies by model and domain—for example, approximately 112K tokens for Qwen3-235B and 96K for Gemini 2.5-Flash in the search domain. Beyond it, accumulated history overwhelms the agent's reasoning capacity, leading to instability in long-horizon tasks.

Sequential scaling behavior of Gemini 2.5-Flash and Qwen3-235B across domains.

Parallel Scaling — Verification Gap

Parallel scaling samples multiple independent trajectories, expanding the solution space. While pass@K increases monotonically, the self-choice accuracy—where agents must identify the correct solution from their own generations—consistently lags behind.

This verification gap limits practical utility: agents can generate correct answers but fail to reliably select them. Even using GPT-5 as an external verifier does not close the gap.

Verification gap between generation and self-choice. The dashed and dotted curves represent two self-choice strategies, while the diamond denotes a stronger evaluator, GPT-5.

BibTeX

Coming soon.

Contact

For any questions or feedback, please reach out to xiaochu4 [at] andrew.cmu.edu.

Table of Contents