Blog · May 10, 2026 · 20 min read

We Spent a Year Building Verifiable Training Tasks for GUI Agents. Here’s What We Learned.

RLVR has scaled in math and code. We tried to do the same for GUI agents: the problems we hit, the shortcuts that didn't work, and what actually did.

Bowen Wang

⭐TL;DR

Building RLVR training data for GUI agents is hard — there is no natural corpus, no universal verifier, and real applications are inaccessible at rollout scale. We built CUA-Gym: 32,122 verified (task, environment, reward) tuples across 110 desktop and web environments, generated entirely by an automated co-generation pipeline with no human annotation per tuple.

Key Results:

▸OSWorld-Verified: CUA-Gym-A3B lifts Qwen3.5-35B-A3B from 54.5 → 62.1; CUA-Gym-A17B lifts 397B-A17B from 62.2 → 70.2.
▸Out-of-distribution transfer: Both checkpoints improve on WebArena despite zero WebArena tasks in training, validating the synthetic sandbox approach.
▸Emergent efficiency: RL training induces unprompted action batching, reducing effective trajectory length by 33–45% at matched task success.

↗ Project page ↗ Code↗ Paper (coming soon)

Motivation

A year ago, watching RLVR drive generalization gains across SWE-bench and terminal-use, we became convinced that CUA post-training was seriously underexplored. The GUI surface is far richer than code or a shell: every app, every website, every SaaS tool a knowledge worker touches is a potential training environment, and unlike programming tasks, that surface keeps expanding.

The problem is that building verifiable training data for GUI is hard in ways that have no good analog in code. In SWE research, GitHub provides a natural corpus with built-in tests that serve as ground-truth reward signals. GUI has nothing equivalent. Real applications sit behind accounts, rate limits, and terms of service that make them far less accessible than sandboxed repos. And even when you can access them, there is no universal way to check whether a GUI task actually completed successfully. There is no test runner, no programmatic diff. We called these three barriers corpus scarcity, environment inaccessibility, and task unverifiability.

The obvious starting point was human annotation. We ran a study with PhD-level experts who understood the problem well. Each could produce around 3 to 5 verified tasks per working day. RLVR at any meaningful scale needs tens of thousands of tuples, so the numbers simply did not add up.

What was more interesting was what we observed while those experts were working. Almost without exception, they had Claude Code open in a side window, using it to draft reward scripts, running the scripts against the environment, fixing failures, and iterating. The human was not writing logic; they were acting as a harness, manually orchestrating a loop that the agent was already capable of running. The natural question was whether we could close that loop entirely.

Method

An RLVR training tuple for CUA is a triple (t, s, r): a task instruction, a reproducible initial environment state, and a reward function. The challenge is that if you generate these three components independently, there is no guarantee they are consistent with each other. The reward might pass trivially on the initial state, or the environment setup might make the task impossible to complete. Prior approaches tried to filter or verify consistency after the fact. CUA-Gym instead generates all three from a single shared specification, so consistency is guaranteed by construction.

The key design choice is an information barrier between the Generator and Discriminator. The Generator proposes the full (task, environment, reward) triple; the Discriminator then verifies whether the reward actually tests what the task asks, but it cannot see the environment setup script. This forces the reward function to be grounded in the task semantics rather than the implementation details, which closes off the shortcut of simply re-checking what was just constructed.

**Figure 1.** The CUA-Gym data synthesis pipeline. Task instructions and grounded context (left) feed an *Orchestrator* that spawns an adversarial Generator– Discriminator loop under a strict information barrier; accepted tuples pass a two-stage *Filter*.

We first ran this pipeline against the ten desktop applications in the OSWorld environment pool, generating roughly 10,000+ verified tuples. Trained on these, OpenCUA-72B climbed from around 50 to 59 on OSWorld, which validated that automated co-generation could work. For a moment, it felt like the problem was solved.

But ten applications is a benchmark, not a training distribution. The whole premise of CUA post-training is that the GUI surface is vast, and if training is confined to ten apps, the model has experienced a negligible fraction of it. We had validated the pipeline; we had not yet validated the thesis.

Expanding coverage along the environment axis meant confronting the same accessibility problem from the motivation section, now at larger scale. Real users spend their days in Gmail, Slack, Notion, Salesforce, and dozens of domain-specific tools that no benchmark has ever covered. We tried to instrument some of these directly, building wrappers around live services and scripting account provisioning. The approach did not hold at training scale: parallel rollouts require fresh, isolated environment state on demand, and real applications simply cannot provide that.

The insight that unblocked us was that CUA models do not actually need authentic environments. They need surfaces that behave like the real thing at the interaction level, with state that can be injected, inspected, and reset programmatically. We applied the same coding-agent loop from the reward-generation pipeline to environment synthesis: a Plan Agent drafts the application spec and UI traversal tree, a Dev Agent implements it as a self-contained SPA, and a Web Agent runs Playwright against the live mock, feeds discrepancies back, and iterates until behavior converges.

Multi-agent pipeline for mock environment synthesis — **Figure 2.** Multi-agent pipeline for mock-environment synthesis. A *Plan Agent* drafts DESIGN.md and TODO.md from grounded research; a *Dev Agent* implements the SPA; a *Web Agent* verifies live behaviour via Playwright, feeding discrepancies back for N rounds until convergence.

This produced 94 mock web applications covering communication, productivity, finance, developer tooling, and e-commerce. Combined with the 16 desktop applications from OSWorld, that gives us 110 environments and 32,122 verified RLVR tuples, the largest open CUA training corpus to date.

One thing the pipeline made viscerally clear: once reward generation is automated and environments are reliable, the bottleneck shifts entirely to the queries. A good query for CUA is not just a well-formed instruction. It requires careful thought about what capability it exercises, what difficulty level it targets, and what the initial environment context looks like before the agent starts. Generating a rich, grounded environment context alongside a realistic task description turned out to be significantly harder and more impactful than getting the reward annotation right.

We partially addressed this by grounding query generation in tutorials, online forums, and prepared asset files, which helped calibrate task realism and difficulty. But there is substantial room to improve. Scaling truly difficult, long-horizon queries that reflect how people actually use computers remains an open problem, and one we believe deserves deeper design and research attention.

Dataset	Platform	Data size	Env. size	Reward	Open
GUI-Genesis[7]	Mobile	969	1	Programmatic	No
WebArena-Infinity[38]	Web	1,260	10	Programmatic	Yes
InfiniteWeb[36]	Web	600	—	Programmatic	No★
UltraCUA[35]	Desktop	17,000	9	Programmatic	No★
Gym-Anything[1]	Desktop	7,277	193	VLM	Yes
▸CUA-Gym	Desktop + Web	32,122	110	Programmatic	Yes

Table 1. CUA-Gym versus existing CUA RLVR datasets. Highlighted row is ours. Open indicates whether both data and pipeline are publicly released; ★ marks partial release.

CUA-Gym-Hub — 99 mock applications

CUA-Gym-Hub is a suite of 99 self-contained mock applications, each implemented as a single-page React app backed by a unified state-injection / inspection / reset API. Targets are sampled from O*NET occupational taxonomies and the Anthropic Economic Index, biasing coverage toward applications used in real digital knowledge work. Click any thumbnail to open a live session.

Development & Cloud

Other

Results

Each tuple pairs an initial_setup.py script that provisions the environment with a natural-language task and a reward.py script that verifies completion. Interactive examples with replayable trajectories are on the project homepage.

We trained two models using GSPO on the full 32K corpus: CUA-Gym-A3B on the Qwen3.5-35B-A3B base, and CUA-Gym-A17B on Qwen3.5-397B-A17B. The first result that genuinely surprised us was how much the smaller model gained. CUA-Gym-A3B lifted 35B-A3B from 54.5 to 62.1 on OSWorld-Verified, effectively matching the unmodified 397B-A17B base at roughly 10× fewer active parameters. Training had compressed a lot of capability into a much smaller compute budget.

At the larger scale, CUA-Gym-A17B pushed the 397B-A17B base from 62.2 to 70.2. Gains at this scale are typically harder to come by — the base model is already strong, and most RL recipes saturate. The fact that the improvement held, and was consistent across most domains, gave us more confidence that the data distribution was doing real work rather than just tuning to benchmark artifacts. The per-domain breakdown in Figure 4 shows the largest gains on multi-application workflows and office tools, which aligns with where the training data had the broadest coverage.

The result we were most uncertain about going in was WebArena. The training corpus has no WebArena tasks in it — only desktop apps and synthesized web mocks. But both checkpoints improve on the held-out WebArena benchmark (A3B: 40.8 to 44.5, A17B: 54.0 to 56.0), which means the mock environments are teaching something that genuinely transfers to real browser interactions, not just pattern- matching the training pool. That was the clearest validation of the synthetic sandbox approach.

base score

+ CUA-Gym lift

RL regression

axis: 0 — 100

Qwen3.5-35B-A3B

overalln=369
54.5→62.1+7.6 pp
writern=31
50.0→60.0+10.0 pp
calcn=47
39.0→53.9+14.9 pp
impressn=22
55.0→60.0+5.0 pp
gimpn=26
33.0→40.0+7.0 pp
chromen=38
65.0→72.0+7.0 pp
multi_appsn=24
30.0→51.5+21.5 pp
osn=55
70.0→70.0+0.0 pp
vs_coden=23
42.0→55.6+13.6 pp
thunderbirdn=18
45.0→50.0+5.0 pp

Qwen3.5-397B-A17B

overalln=360
62.2→70.2+8.0 pp
writern=23
65.2→91.3+26.1 pp
calcn=47
66.0→80.9+14.9 pp
impressn=47
69.1→82.8+13.7 pp
gimpn=26
50.0→61.5+11.5 pp
chromen=46
63.5→73.9+10.4 pp
multi_appsn=93
45.9→54.7+8.8 pp
osn=24
79.2→87.5+8.3 pp
vs_coden=22
68.2→68.2+0.0 pp
thunderbirdn=15
80.0→66.7-13.3 pp

Figure 4. Per-domain success on OSWorld-Verified for both Qwen3.5-35B-A3B (top) and Qwen3.5-397B-A17B (bottom). Base score (gray) + CUA-Gym lift (oxblood); hatched red marks domains where RL regresses below the base. n is task count.

Model	OSWorld-V.	WebArena
Proprietary models
Claude Sonnet 4.6	72.9	65.6
Claude Opus 4.7	78.0	—
GPT-5.5	78.7	—
Open-source models
EvoCUA-8B	46.1	—
EvoCUA-32B	56.7	—
OpenCUA-32B	34.8	—
OpenCUA-72B	45.0	—
Step-GUI-8B	40.2	—
Kimi-K2.6	73.1	—
Ours
Qwen3.5-35B-A3B	54.5	40.8
Qwen3.5-397B-A17B	62.2	54.0
▸CUA-Gym-A3B	62.1	44.5
▸CUA-Gym-A17B	70.2	56.0

Table 2. Main results. Italic rows are baselines; the highlighted row is our model trained on CUA-Gym data.

Findings

Three things stood out from the analysis beyond the headline numbers. They each answer a question we were genuinely uncertain about going in.

Data scaling curves — **Figure 5.** OSWorld-Verified score and training reward vs. RL step on Qwen3.5-35B-A3B across three data scales (1.4K / 3K / 12K verified tuples), initialized from the same SFT checkpoint.

More data, cleaner signal. We ran three RL experiments at 1.4K, 3K, and 12K tuples to check whether the automated pipeline was producing reward hacking. It was not. Training reward and OSWorld success tracked together across all three scales with no oscillation or collapse, and the 12K curve was still climbing at the end of training. The information barrier was doing its job.

Environment scaling — **Figure 6.** OSWorld-Verified score under teacher distillation across environment counts. The broad setting (80 envs, 75 trajectories each) outperforms the narrow setting (10 envs, 300 each) despite 4x fewer trajectories per environment.

ii.

Environments and trajectories are not interchangeable. The question we wanted to answer was whether expanding to 110 environments was worth the engineering cost, or whether we could have gotten the same gains by just generating more trajectories from fewer apps. The answer was clear: environment diversity and trajectory volume improve performance along distinct axes. You cannot trade one for the other. This retroactively justified a lot of the work we put into building the mock application suite.

Emergent action batching — **Figure 7.** Average tool calls per model step and effective trajectory length during RL training on CUA-Gym-A3B. The policy shifts from roughly 1 call per step at SFT initialization to a stable 1.4 to 1.9 band under RL.

iii.

The model learned to be efficient without being asked. Partway through training, we noticed the policy was packing multiple actions into single turns — clicking through a sequence of menus in one step rather than one at a time. We never optimized for this explicitly. It emerged because RL with group-normalized rewards penalizes long trajectories relative to shorter ones that achieve the same outcome. The result was a 33 to 45 percent reduction in effective trajectory length at matched task success. It is a small reminder that RL optimizes what you measure, and sometimes finds shortcuts you did not anticipate.

Limitations & Future Work

The most immediate limitation is one we already flagged as a bitter lesson: the quality ceiling of CUA-Gym is currently set by the quality of its task queries, and we do not yet have a good answer for how to scale the hard ones. As base models get stronger, the tasks that are too easy for them to learn from will keep creeping upward. Generating genuinely difficult, long-horizon queries that reflect how expert users actually work — multi-step workflows that span applications, require planning, and have ambiguous intermediate states — is an unsolved problem. We think it is probably the most important next step for anyone trying to push CUA post-training further.

On the environment side, there is a version of this work that goes much deeper: building a proper open-source software sandbox ecosystem, maintained by domain experts who understand the applications well enough to write realistic state transitions and edge cases. The 94 mocks in CUA-Gym-Hub are a starting point, but they are all generated by the same pipeline with the same biases. A community-built library of training environments — closer in spirit to what OpenAI Gym was for game-playing agents — would be a meaningful infrastructure contribution to the field.

There is also a scope question we did not address. CUA as a category treats the GUI as the primary interaction surface, but real knowledge workers move fluidly between graphical interfaces and the command line — writing a script, running a query, editing a config file, then switching back to a browser. Modeling this as a unified problem and training an agent that reasons about both surfaces together feels like the right long-term framing, and CUA-Gym's data format is general enough to support it.

Finally, two things we want to pursue with CUA-Gym as the platform. First, the RL rollout infrastructure for GUI agents is still immature: environment reset latency, parallel rollout management, and reward evaluation throughput are all bottlenecks that limit how fast the training loop can run. Better tooling here would benefit everyone working in this space. Second, we have only scratched the surface of the RLVR recipe itself — reward shaping, curriculum design, group composition strategies — and CUA-Gym is a controlled enough environment to run real ablations on these questions. We are planning to use it exactly that way.

Citation

If you find CUA-Gym useful, please cite:

BibTeX

@article{cuagym2026,
  title   = {CUA-Gym: Scaling Verifiable Training Environments
             and Tasks for Computer-Use Agents},
  author  = {Anonymous},
  journal = {arXiv preprint},
  year    = {2026},
}

We Spent a Year Building Verifiable Training Tasks for GUI Agents. Here’s What We Learned.

Data Synthesis Pipeline

Communication & Social

Productivity & Documents

Development & Cloud

Finance & Enterprise

E-commerce & Travel

Analytics & Marketing

Other