Blog · May 10, 2026 · 20 min read

We Spent a Year Building Verifiable Training Tasks for GUI Agents. Here’s What We Learned.

RLVR has scaled in math and code. We tried to do the same for GUI agents: the problems we hit, the shortcuts that didn't work, and what actually did.

Bryan Wang·↗ Project page↗ Code↗ Paper (coming soon)

Motivation

A year ago, watching RLVR drive generalization gains across SWE-bench and terminal-use, we became convinced that CUA post-training was seriously underexplored. The GUI surface is far richer than code or a shell: every app, every website, every SaaS tool a knowledge worker touches is a potential training environment, and unlike programming tasks, that surface keeps expanding.

The problem is that building verifiable training data for GUI is hard in ways that have no good analog in code. In SWE research, GitHub provides a natural corpus with built-in tests that serve as ground-truth reward signals. GUI has nothing equivalent. Real applications sit behind accounts, rate limits, and terms of service that make them far less accessible than sandboxed repos. And even when you can access them, there is no universal way to check whether a GUI task actually completed successfully. There is no test runner, no programmatic diff. We called these three barriers corpus scarcity, environment inaccessibility, and task unverifiability.

The obvious starting point was human annotation. We ran a study with PhD-level experts who understood the problem well. Each could produce around 3 to 5 verified tasks per working day. RLVR at any meaningful scale needs tens of thousands of tuples, so the numbers simply did not add up.

What was more interesting was what we observed while those experts were working. Almost without exception, they had Claude Code open in a side window, using it to draft reward scripts, running the scripts against the environment, fixing failures, and iterating. The human was not writing logic; they were acting as a harness, manually orchestrating a loop that the agent was already capable of running. The natural question was whether we could close that loop entirely.

Method

An RLVR training tuple for CUA is a triple (t, s, r): a task instruction, a reproducible initial environment state, and a reward function. The challenge is that if you generate these three components independently, there is no guarantee they are consistent with each other. The reward might pass trivially on the initial state, or the environment setup might make the task impossible to complete. Prior approaches tried to filter or verify consistency after the fact. CUA-Gym instead generates all three from a single shared specification, so consistency is guaranteed by construction.

The key design choice is an information barrier between the Generator and Discriminator. The Generator proposes the full (task, environment, reward) triple; the Discriminator then verifies whether the reward actually tests what the task asks, but it cannot see the environment setup script. This forces the reward function to be grounded in the task semantics rather than the implementation details, which closes off the shortcut of simply re-checking what was just constructed.

Data Synthesis Pipeline

Task

Task: Can you please send an email to each client listed in the opened Notion page using the template from Gmail? You should attach their transactions in the 'clients.xlsx' table on desktop.

Context: There should be 10 clients in the Notion database page, each with email listed, and an existing email template in the 'Sent' folder of Gmail. The spreadsheet 'clients.xlsx' is a complete document containing all clients during 2024 Q1–2025 Q4…

Task Generation
parallel · 3
web_search
  • Searching on Reddit
  • "how do people use gmail…"
  • ↳ 14 threads · 3 saved
Documents
  • Reading Notion docs
  • + Gmail templates
  • ↳ 9 docs scanned
assets/
  • Reading prepared assets
  • images / pdfs / tables
  • ↳ clients.xlsx loaded
Orchestrator
$Loading Task
$Loading Context
$Label Properties
  • · difficulty
  • · domain
  • · involved apps
$Prepare VM environments
$Spawn Agent Loop
while not consensus:
    g = generator.run()
    reward = discriminator.run()
    if not consensus.check():
        retry
    return (task, setup, reward)
Generator
>Loading Task and Skills
>Build envs from initial_setup.py & golden_patch.py
Initial Env
Golden Env
>Revise & retry if rejected
Discriminator
>Loading Task and Reward Skills
Decompose rewarding criteria
  • · emails exist in Sent...
  • · reward email content match template...
  • · recipients match Notion DB...
>Draft reward.py & verify in real envs.
×
>If not fullfilled ⟶ feedback & retry
Information Isolation
Filter
> LLM Majority Voting
  • consistency92
  • executability88
  • hack-risk95
  • clarity90
  • difficulty76
> Teacher Model Rollouts
State action chain
  • ·Calculate
  • ·Check
  • ·Use VLM-as-a-judge to review
  • ·Alignment check
Pass
Figure 1. The CUA-Gym data synthesis pipeline. Task instructions and grounded context (left) feed an Orchestrator that spawns an adversarial Generator– Discriminator loop under a strict information barrier; accepted tuples pass a two-stage Filter.

We first ran this pipeline against the ten desktop applications in the OSWorld environment pool, generating roughly 10,000+ verified tuples. Trained on these, OpenCUA-72B climbed from around 50 to 59 on OSWorld, which validated that automated co-generation could work. For a moment, it felt like the problem was solved.

But ten applications is a benchmark, not a training distribution. The whole premise of CUA post-training is that the GUI surface is vast, and if training is confined to ten apps, the model has experienced a negligible fraction of it. We had validated the pipeline; we had not yet validated the thesis.

Expanding coverage along the environment axis meant confronting the same accessibility problem from the motivation section, now at larger scale. Real users spend their days in Gmail, Slack, Notion, Salesforce, and dozens of domain-specific tools that no benchmark has ever covered. We tried to instrument some of these directly, building wrappers around live services and scripting account provisioning. The approach did not hold at training scale: parallel rollouts require fresh, isolated environment state on demand, and real applications simply cannot provide that.

The insight that unblocked us was that CUA models do not actually need authentic environments. They need surfaces that behave like the real thing at the interaction level, with state that can be injected, inspected, and reset programmatically. We applied the same coding-agent loop from the reward-generation pipeline to environment synthesis: a Plan Agent drafts the application spec and UI traversal tree, a Dev Agent implements it as a self-contained SPA, and a Web Agent runs Playwright against the live mock, feeds discrepancies back, and iterates until behavior converges.

Multi-agent pipeline for mock environment synthesis
Figure 2. Multi-agent pipeline for mock-environment synthesis. A Plan Agent drafts DESIGN.md and TODO.md from grounded research; a Dev Agent implements the SPA; a Web Agent verifies live behaviour via Playwright, feeding discrepancies back for N rounds until convergence.

This produced 94 mock web applications covering communication, productivity, finance, developer tooling, and e-commerce. Combined with the 16 desktop applications from OSWorld, that gives us 110 environments and 32,122 verified RLVR tuples, the largest open CUA training corpus to date.

One thing the pipeline made viscerally clear: once reward generation is automated and environments are reliable, the bottleneck shifts entirely to the queries. A good query for CUA is not just a well-formed instruction. It requires careful thought about what capability it exercises, what difficulty level it targets, and what the initial environment context looks like before the agent starts. Generating a rich, grounded environment context alongside a realistic task description turned out to be significantly harder and more impactful than getting the reward annotation right.

We partially addressed this by grounding query generation in tutorials, online forums, and prepared asset files, which helped calibrate task realism and difficulty. But there is substantial room to improve. Scaling truly difficult, long-horizon queries that reflect how people actually use computers remains an open problem, and one we believe deserves deeper design and research attention.

DatasetPlatformData sizeEnv. sizeRewardOpen
GUI-Genesis[7]Mobile9691ProgrammaticNo
WebArena-Infinity[38]Web1,26010ProgrammaticYes
InfiniteWeb[36]Web600ProgrammaticNo★
UltraCUA[35]Desktop17,0009ProgrammaticNo★
Gym-Anything[1]Desktop7,277193VLMYes
CUA-GymDesktop + Web32,122110ProgrammaticYes
Table 1. CUA-Gym versus existing CUA RLVR datasets. Highlighted row is ours. Open indicates whether both data and pipeline are publicly released; ★ marks partial release.
CUA-Gym-Hub — 99 mock applications

CUA-Gym-Hub is a suite of 99 self-contained mock applications, each implemented as a single-page React app backed by a unified state-injection / inspection / reset API. Targets are sampled from O*NET occupational taxonomies and the Anthropic Economic Index, biasing coverage toward applications used in real digital knowledge work. Click any thumbnail to open a live session.

Results

Each tuple pairs an initial_setup.py script that provisions the environment with a natural-language task and a reward.py script that verifies completion. Interactive examples with replayable trajectories are on the project homepage.

We trained two models using GSPO on the full 32K corpus: CUA-Gym-A3B on the Qwen3.5-35B-A3B base, and CUA-Gym-A17B on Qwen3.5-397B-A17B. The first result that genuinely surprised us was how much the smaller model gained. CUA-Gym-A3B lifted 35B-A3B from 54.5 to 62.1 on OSWorld-Verified, effectively matching the unmodified 397B-A17B base at roughly 10× fewer active parameters. Training had compressed a lot of capability into a much smaller compute budget.

At the larger scale, CUA-Gym-A17B pushed the 397B-A17B base from 62.2 to 70.2. Gains at this scale are typically harder to come by — the base model is already strong, and most RL recipes saturate. The fact that the improvement held, and was consistent across most domains, gave us more confidence that the data distribution was doing real work rather than just tuning to benchmark artifacts. The per-domain breakdown in Figure 4 shows the largest gains on multi-application workflows and office tools, which aligns with where the training data had the broadest coverage.

The result we were most uncertain about going in was WebArena. The training corpus has no WebArena tasks in it — only desktop apps and synthesized web mocks. But both checkpoints improve on the held-out WebArena benchmark (A3B: 40.8 to 44.5, A17B: 54.0 to 56.0), which means the mock environments are teaching something that genuinely transfers to real browser interactions, not just pattern- matching the training pool. That was the clearest validation of the synthetic sandbox approach.

base score
+ CUA-Gym lift
RL regression
axis: 0 — 100
Qwen3.5-35B-A3B
  • overalln=369
    54.562.1+7.6 pp
  • writern=31
    50.060.0+10.0 pp
  • calcn=47
    39.053.9+14.9 pp
  • impressn=22
    55.060.0+5.0 pp
  • gimpn=26
    33.040.0+7.0 pp
  • chromen=38
    65.072.0+7.0 pp
  • multi_appsn=24
    30.051.5+21.5 pp
  • osn=55
    70.070.0+0.0 pp
  • vs_coden=23
    42.055.6+13.6 pp
  • thunderbirdn=18
    45.050.0+5.0 pp
Qwen3.5-397B-A17B
  • overalln=360
    62.270.2+8.0 pp
  • writern=23
    65.291.3+26.1 pp
  • calcn=47
    66.080.9+14.9 pp
  • impressn=47
    69.182.8+13.7 pp
  • gimpn=26
    50.061.5+11.5 pp
  • chromen=46
    63.573.9+10.4 pp
  • multi_appsn=93
    45.954.7+8.8 pp
  • osn=24
    79.287.5+8.3 pp
  • vs_coden=22
    68.268.2+0.0 pp
  • thunderbirdn=15
    80.066.7-13.3 pp
Figure 4. Per-domain success on OSWorld-Verified for both Qwen3.5-35B-A3B (top) and Qwen3.5-397B-A17B (bottom). Base score (gray) + CUA-Gym lift (oxblood); hatched red marks domains where RL regresses below the base. n is task count.
ModelOSWorld-V.WebArena
Proprietary models
Claude Sonnet 4.672.965.6
Claude Opus 4.778.0
GPT-5.578.7
Open-source models
EvoCUA-8B46.1
EvoCUA-32B56.7
OpenCUA-32B34.8
OpenCUA-72B45.0
Step-GUI-8B40.2
Kimi-K2.673.1
Ours
Qwen3.5-35B-A3B54.540.8
Qwen3.5-397B-A17B62.254.0
CUA-Gym-A3B62.144.5
CUA-Gym-A17B70.256.0
Table 2. Main results. Italic rows are baselines; the highlighted row is our model trained on CUA-Gym data.
Findings

Three things stood out from the analysis beyond the headline numbers. They each answer a question we were genuinely uncertain about going in.

Data scaling curves
Figure 5. OSWorld-Verified score and training reward vs. RL step on Qwen3.5-35B-A3B across three data scales (1.4K / 3K / 12K verified tuples), initialized from the same SFT checkpoint.
i.

More data, cleaner signal. We ran three RL experiments at 1.4K, 3K, and 12K tuples to check whether the automated pipeline was producing reward hacking. It was not. Training reward and OSWorld success tracked together across all three scales with no oscillation or collapse, and the 12K curve was still climbing at the end of training. The information barrier was doing its job.

Environment scaling
Figure 6. OSWorld-Verified score under teacher distillation across environment counts. The broad setting (80 envs, 75 trajectories each) outperforms the narrow setting (10 envs, 300 each) despite 4x fewer trajectories per environment.
ii.

Environments and trajectories are not interchangeable. The question we wanted to answer was whether expanding to 110 environments was worth the engineering cost, or whether we could have gotten the same gains by just generating more trajectories from fewer apps. The answer was clear: environment diversity and trajectory volume improve performance along distinct axes. You cannot trade one for the other. This retroactively justified a lot of the work we put into building the mock application suite.

Emergent action batching
Figure 7. Average tool calls per model step and effective trajectory length during RL training on CUA-Gym-A3B. The policy shifts from roughly 1 call per step at SFT initialization to a stable 1.4 to 1.9 band under RL.
iii.

The model learned to be efficient without being asked. Partway through training, we noticed the policy was packing multiple actions into single turns — clicking through a sequence of menus in one step rather than one at a time. We never optimized for this explicitly. It emerged because RL with group-normalized rewards penalizes long trajectories relative to shorter ones that achieve the same outcome. The result was a 33 to 45 percent reduction in effective trajectory length at matched task success. It is a small reminder that RL optimizes what you measure, and sometimes finds shortcuts you did not anticipate.

Limitations & Future Work

The most immediate limitation is one we already flagged as a bitter lesson: the quality ceiling of CUA-Gym is currently set by the quality of its task queries, and we do not yet have a good answer for how to scale the hard ones. As base models get stronger, the tasks that are too easy for them to learn from will keep creeping upward. Generating genuinely difficult, long-horizon queries that reflect how expert users actually work — multi-step workflows that span applications, require planning, and have ambiguous intermediate states — is an unsolved problem. We think it is probably the most important next step for anyone trying to push CUA post-training further.

On the environment side, there is a version of this work that goes much deeper: building a proper open-source software sandbox ecosystem, maintained by domain experts who understand the applications well enough to write realistic state transitions and edge cases. The 94 mocks in CUA-Gym-Hub are a starting point, but they are all generated by the same pipeline with the same biases. A community-built library of training environments — closer in spirit to what OpenAI Gym was for game-playing agents — would be a meaningful infrastructure contribution to the field.

There is also a scope question we did not address. CUA as a category treats the GUI as the primary interaction surface, but real knowledge workers move fluidly between graphical interfaces and the command line — writing a script, running a query, editing a config file, then switching back to a browser. Modeling this as a unified problem and training an agent that reasons about both surfaces together feels like the right long-term framing, and CUA-Gym's data format is general enough to support it.

Finally, two things we want to pursue with CUA-Gym as the platform. First, the RL rollout infrastructure for GUI agents is still immature: environment reset latency, parallel rollout management, and reward evaluation throughput are all bottlenecks that limit how fast the training loop can run. Better tooling here would benefit everyone working in this space. Second, we have only scratched the surface of the RLVR recipe itself — reward shaping, curriculum design, group composition strategies — and CUA-Gym is a controlled enough environment to run real ablations on these questions. We are planning to use it exactly that way.

Citation

If you find CUA-Gym useful, please cite:

BibTeX
@article{cuagym2026,
  title   = {CUA-Gym: Scaling Verifiable Training Environments
             and Tasks for Computer-Use Agents},
  author  = {Anonymous},
  journal = {arXiv preprint},
  year    = {2026},
}