Blog · May 11, 2026 · Trace browser

Towards Universal Digital Agents

Empirical evidence from OSWorld-Verified: coding agents like Codex and Claude Code can solve many “GUI” tasks through the command line and why we need universal digital agents.


Over the past year, we have witnessed a rapid leap in the capabilities of pure-vision computer-use agents. Frontier models such as Claude Opus 4.7 and GPT-5.5 now achieve striking performance on CUA benchmarks, reaching 78.0% and 78.7% on OSWorld respectively, surpassing the reported human baseline. This progress provides an important foundation for building truly autonomous agents: systems such as OpenClaw, Claude Code, and Codex are increasingly expected not merely to assist users, but to act on their behalf and complete tasks end-to-end. In this vision, the graphical user interface is the final mile. At the same time, the past year has also seen rapid scaling along a second axis: CLI intelligence. Products such as Claude Code and Codex are already reshaping software engineering workflows, solving a substantial fraction of everyday programming and operational tasks through command-line interaction. Together, these two trends suggest a natural path toward general autonomous agents: vision-based GUI competence on one side, and increasingly capable CLI-based agency on the other.

However, the debate around CLI versus GUI agents remains unresolved. It is still unclear which interface, if either, provides the more direct path toward AGI-like autonomous agents, or whether the two must ultimately be combined into a unified system. In this work, we approach this question through the lens of evaluation and benchmarking. By studying how CLI agents behave on tasks originally designed as GUI benchmarks, we aim to provide empirical evidence and analysis for understanding the relationship between these two forms of agency. This perspective may help clarify what each interface is actually testing, where their strengths differ, and how they might be integrated. It also motivates a broader framing: universal digital agents, systems that can fluidly operate across the command line, graphical interfaces, filesystems, applications, and web environments to accomplish user goals.

Our experiment is intentionally simple, but the results may be surprising. We evaluate coding-agent harnesses, including Claude Code with Opus 4.7 and Codex with GPT-5.5, on OSWorld-Verified, a widely adopted benchmark for computer-use agents. Concretely, for each evaluation environment, we deploy the corresponding coding agent inside the environment and directly pass the benchmark task query to the agent for execution, without adding a specialized GUI-control policy. Further details of the evaluation protocol are provided in Appendix A. Table 1 summarizes the resulting performance.

ModelOSWorld-V.Audited
Proprietary CUA models
Claude Opus 4.778.0-
GPT-5.578.7-
CLI agent harnesses
Codex w/ GPT-5.554.071.8
Claude Code w/ Opus 4.749.968.5

Table 1. Main results on OSWorld-Verified. Audit details are provided in Appendix A.

We also collected the full evaluation trajectories for both runs; the embedded viewer below lets readers search, filter, and inspect every recorded case.

Browser cases
710 local trajectories · searchable index
showing 710 / 710 cases