CUA-Gym

Scaling Training for Computer-Use Agents · Bowen Wang

01 / 18

1

CUA-Gym

Scaling Verifiable Training Environments and Tasks
for Computer-Use Agents

https://cua-gym.xlang.ai

2

RLVR works for math, SWE, terminal-use

— MATH —reward = boxed match

Problem

Evaluate the definite integral:

I \;=\; \int_{0}^{\pi/2} \frac{\sin x}{\sin x + \cos x}\, dx

Agent trajectory

(1)Let

u = \tfrac{\pi}{2} - x

I = \int_{0}^{\pi/2} \frac{\cos u}{\cos u + \sin u}\, du

(2)Add the two forms:

2I = \int_{0}^{\pi/2} 1\, dx = \tfrac{\pi}{2}

(3)∴

\;I = \boxed{\dfrac{\pi}{4}}

\dfrac{\pi}{4}

match

\dfrac{\pi}{4}

reward =1

— SWE —reward = tests pass

bash — pytest

$ pytest test_cognitoidp.py::test_global_sign_out
FAILED test_global_sign_out
  AssertionError: expected NotAuthorizedException,
    got successful response — token not revoked.

# agent patches cognitoidp/models.py:
$ pytest test_cognitoidp.py::test_global_sign_out
PASSED test_global_sign_out
1 passed in 0.42s

reward =1

🤔 but how about computer-use agents?

3

Take a real computer-use task.

— example —

Can you please send an email to each client listed in the opened Notion page using the template from Gmail? You should attach their transactions in the Excel 'clients.xlsx' table on Desktop.

Inbox — mail.google.com

reward =???→How to scale this up?

4

We hand reward writing off to a coding agent.

Orchestrator
$Loading Task
$Loading Context
$Label Properties
· difficulty
· domain
· involved apps
$Prepare VM environments
$Spawn Agent Loop
while not consensus:
    g = generator.run()
    reward = discriminator.run()
    if not consensus.check():
        retry
    return (task, setup, reward)
Generator
>Loading Task and Skills
>Build envs from initial_setup.py & golden_patch.py
Initial Env
Golden Env
>Revise & retry if rejected
Discriminator
>Loading Task and Reward Skills
Decompose rewarding criteria
· emails exist in Sent...
· reward email content match template...
· recipients match Notion DB...
>Draft reward.py & verify in real envs.
>If not fulfilled ⟶ feedback & retry
Information Isolation

FILTER

>  LLM MAJORITY VOTING

consistency92
executability88
hack-risk95
clarity90
difficulty76

>  TEACHER MODEL ROLLOUTS

·Calculate $\cos(t,\, r)$
·Check $\mathrm{reward}_{\mathrm{log}}$
·Use VLM-as-a-judge to review $s_n$
·Alignment check

Pass

\mathcal{D} = \{(t,\, s,\, r)\}

5

CUA RLVR still needs another scale axis: environments.

Three limitations stand in the way of training computer-use agents at scale.

We build CUA-Gym-Hub: 98 self-contained mock applications across 7 domains — all state-injectable and parallel-isolated.

(1)

OSWorld & WebArena cover only limited applications

drifting further from the authentic domains where computer-use agents are expected to operate.

(2)

Agents need to transfer quickly to unseen domains and apps

so the training distribution has to cover the long tail, not just a few canonical desktop tools.

(3)

Real apps can’t be deployed as RL environments

they are not sandboxes — no state injection, no parallel isolation, no resetability.

Communication & Social18

discord

dingtalk

facebook

feishu

gmail

instagram

microsoft-teams

outlook-web

slack

twitter

wechat

weibo

xiaohongshu

zhihu

zoom-web

discord

dingtalk

facebook

feishu

gmail

instagram

microsoft-teams

outlook-web

slack

twitter

wechat

weibo

xiaohongshu

zhihu

zoom-web

discord

dingtalk

facebook

feishu

gmail

instagram

microsoft-teams

outlook-web

slack

twitter

wechat

weibo

xiaohongshu

zhihu

zoom-web

Productivity & Documents16

airtable

asana

canvas-lms

google-calendar

google-docs

google-drive

google-sheets

jira

lattice

linear

lucidchart

miro

monday

notion

openreview

trello

airtable

asana

canvas-lms

google-calendar

google-docs

google-drive

google-sheets

jira

lattice

linear

lucidchart

miro

monday

notion

openreview

trello

airtable

asana

canvas-lms

google-calendar

google-docs

google-drive

google-sheets

jira

lattice

linear

lucidchart

miro

monday

notion

openreview

trello

Development & Cloud12

aws-console

azure

aliyun

circleci

cloudflare

datadog

github

gitlab

postman

sentry

vercel

wandb

aws-console

azure

aliyun

circleci

cloudflare

datadog

github

gitlab

postman

sentry

vercel

wandb

aws-console

azure

aliyun

circleci

cloudflare

datadog

github

gitlab

postman

sentry

vercel

wandb

Finance & Enterprise19

bamboohr

clio

coinbase

contractbook

docusign

expensify

greenhouse

gusto

hubspot

hubspot-marketing

paypal

quickbooks

robinhood

salesforce

sap

servicenow

stripe-dashboard

tradingview

workday

bamboohr

clio

coinbase

contractbook

docusign

expensify

greenhouse

gusto

hubspot

hubspot-marketing

paypal

quickbooks

robinhood

salesforce

sap

servicenow

stripe-dashboard

tradingview

workday

bamboohr

clio

coinbase

contractbook

docusign

expensify

greenhouse

gusto

hubspot

hubspot-marketing

paypal

quickbooks

robinhood

salesforce

sap

servicenow

stripe-dashboard

tradingview

workday

E-commerce & Travel11

amazon

amazon-seller

booking-com

ebay

expedia

instacart

shopify-admin

taobao-seller

tripadvisor

uber-eats

woocommerce

amazon

amazon-seller

booking-com

ebay

expedia

instacart

shopify-admin

taobao-seller

tripadvisor

uber-eats

woocommerce

amazon

amazon-seller

booking-com

ebay

expedia

instacart

shopify-admin

taobao-seller

tripadvisor

uber-eats

woocommerce

Analytics & Marketing10

amplitude

google-ads

google-analytics

hotjar

klaviyo

looker-studio

mailchimp

meta-ads

mixpanel

tableau

amplitude

google-ads

google-analytics

hotjar

klaviyo

looker-studio

mailchimp

meta-ads

mixpanel

tableau

amplitude

google-ads

google-analytics

hotjar

klaviyo

looker-studio

mailchimp

meta-ads

mixpanel

tableau

Other8

12306

adp

epic-health

google-flights

westlaw

youtube

zendesk

zillow

12306

adp

epic-health

google-flights

westlaw

youtube

zendesk

zillow

12306

adp

epic-health

google-flights

westlaw

youtube

zendesk

zillow

6

How do we scale environments?

(1) SourcesStart from real-world software-use distributions sampled from O*NET and the Anthropic Economic Index — biasing coverage toward authentic digital knowledge work.

(2) Plan / Dev / Web agentsA multi-agent coding pipeline plans the spec, codes the frontend, and Playwright-tests the UI/UX over N rounds of dev ↔ web feedback.

(3) EngineeringAn engineering pass fixes the contract: data schema, state-injection endpoints, and a SKILL.md so every mock is a deterministic, resettable RL sandbox.

7

How do we combine environments and tasks?

(1) Session isolationEvery URL carries its own session id; parallel RL workers training on the same mock never see one another's changes — critical for distributed rollouts.

(2) State injectionWhen a task is created, the synthesis pipeline ships a JSON initial state alongside its reward.py and posts it to the mock; loading ?sid=<task_id> renders that exact world — emails, project boards, calendars, customer tickets, whatever the task description calls for, so a single mock can host arbitrarily many distinct task worlds with no code change.

Monday, inbox zero

The user logged off Friday with a clean inbox. Three system newsletters trickled in over the weekend. Nothing unread, no drafts.

open in new tab ↗

Mid-sprint, paper deadline approaching

Twenty-five threads piled up over twelve hours — PR review pings, Sentry alerts, an urgent RFC sign-off. One draft sits in Compose.

open in new tab ↗

First morning back from a week off

Eighty-one emails waiting. Forty-one in Primary alone — coworker pings, customer follow-ups, conference invites, recruiters.

open in new tab ↗

8

Example of a CUA-Gym task.

initial_setup.py

"""
Initial Setup: Vendor consolidation - set up products under BasicWear and HomeGoods vendors
Task ID: shopify_adv_005
Domain: shopify_admin_mock
Mock: shopify_admin_mock
"""
import json
import os
import shlex
import subprocess
import time
import uuid
import requests
# --- Config ---
BASE_URL = 'https://cua-gym-shopify-admin.xlang.ai'
sid = str(uuid.uuid4())
# Persist sid for golden_patch.py and reward.py
with open('/tmp/task_web_sid', 'w') as f:
    f.write(sid)
print(f'SID generated: {sid}')
# --- Build initial state ---
# 4 products: prod-001 (BasicWear), prod-002 (LeatherCo), prod-003 (SportStep), prod-004 (HomeGoods)
# MUST NOT have 'UnifiedBrands' as vendor — the task is to consolidate them
state = {
    "store": {
        "id": "store_1",
        "name": "Urban Market",
        "email": "admin@urbanmarket.myshopify.com",
        "phone": "+1-503-555-0100",
        "domain": "urbanmarket.myshopify.com",
        "customDomain": "www.urbanmarket.com",
        "address": {
            "address1": "456 Commerce Ave",
            "city": "Portland",
            "province": "Oregon",
            "provinceCode": "OR",
            "country": "United States",
            "countryCode": "US",
            "zip": "97201"
        },
        "currency": "USD",
        "timezone": "(GMT-08:00) Pacific Time",
        "weightUnit": "lb",
        "plan": "Shopify",
        "owner": {
            "firstName": "Jordan",
            "lastName": "Park",
            "email": "jordan@urbanmarket.com"
        },
        "createdAt": "2023-03-10T09:00:00Z"
    },
    "products": [
        {
            "id": "prod-001",
            "title": "Classic T-Shirt",
            "bodyHtml": "<p>Comfortable cotton t-shirt</p>",
            "vendor": "BasicWear",
            "productType": "Apparel",
            "handle": "classic-t-shirt",
            "status": "active",
            "tags": ["cotton", "basics", "casual"],
            "images": [
                {
                    "id": "img_001",
                    "src": "https://placehold.co/400x400/e8f5e9/2e7d32?text=T-Shirt",
                    "alt": "Classic T-Shirt",
                    "position": 1
                }
            ],
            "variants": [
                {
                    "id": "var_001_s",
                    "productId": "prod-001",
                    "title": "Small",
                    "price": "19.99",
                    "compareAtPrice": "24.99",
                    "sku": "BW-TS-SM",
                    "inventoryQuantity": 45,
                    "option1": "Small",
                    "option2": None,
                    "position": 1
                },
                {
                    "id": "var_001_m",
                    "productId": "prod-001",
                    "title": "Medium",
                    "price": "19.99",
                    "compareAtPrice": "24.99",
                    "sku": "BW-TS-MD",
                    "inventoryQuantity": 60,
                    "option1": "Medium",
                    "option2": None,
                    "position": 2
                },
                {
                    "id": "var_001_l",
                    "productId": "prod-001",
                    "title": "Large",
                    "price": "19.99",
                    "compareAtPrice": "24.99",
                    "sku": "BW-TS-LG",
                    "inventoryQuantity": 35,
                    "option1": "Large",
                    "option2": None,
                    "position": 3
                }
            ],
            "options": [
                {"id": "opt_001", "name": "Size", "position": 1, "values": ["Small", "Medium", "Large"]}
            ],
            "collections": ["col_001"],
            "createdAt": "2023-05-12T10:00:00Z",
            "updatedAt": "2024-11-20T14:30:00Z"
        },
        {
            "id": "prod-002",
            "title": "Leather Wallet",
            "bodyHtml": "<p>Premium leather bifold wallet</p>",
            "vendor": "LeatherCo",
            "productType": "Accessories",
            "handle": "leather-wallet",
            "status": "active",
            "tags": ["leather", "wallet", "accessories"],
            "images": [
                {
                    "id": "img_002",
                    "src": "https://placehold.co/400x400/fce8d5/7b3f00?text=Wallet",
                    "alt": "Leather Wallet",
                    "position": 1
                }
            ],
            "variants": [
                {
                    "id": "var_002_bk",
                    "productId": "prod-002",
                    "title": "Black",
                    "price": "49.99",
                    "compareAtPrice": "65.00",
                    "sku": "LC-WL-BK",
                    "inventoryQuantity": 28,
                    "option1": "Black",
                    "option2": None,
                    "position": 1
                },
                {
                    "id": "var_002_br",
                    "productId": "prod-002",
                    "title": "Brown",
                    "price": "49.99",
                    "compareAtPrice": "65.00",
                    "sku": "LC-WL-BR",
                    "inventoryQuantity": 22,
                    "option1": "Brown",
                    "option2": None,
                    "position": 2
                }
            ],
            "options": [
                {"id": "opt_002", "name": "Color", "position": 1, "values": ["Black", "Brown"]}
            ],
            "collections": ["col_002"],
            "createdAt": "2023-06-05T11:00:00Z",
            "updatedAt": "2024-10-15T09:45:00Z"
        },
        {
            "id": "prod-003",
            "title": "Running Shoes",
            "bodyHtml": "<p>Lightweight running shoes for everyday use</p>",
            "vendor": "SportStep",
            "productType": "Footwear",
            "handle": "running-shoes",
            "status": "active",
            "tags": ["running", "shoes", "sport", "athletic"],
            "images": [
                {
                    "id": "img_003",
                    "src": "https://placehold.co/400x400/e3f2fd/1565c0?text=Running+Shoes",
                    "alt": "Running Shoes",
                    "position": 1
                }
            ],
            "variants": [
                {
                    "id": "var_003_8",
                    "productId": "prod-003",
                    "title": "Size 8",
                    "price": "89.99",
                    "compareAtPrice": "110.00",
                    "sku": "SS-RS-8",
                    "inventoryQuantity": 18,
                    "option1": "Size 8",
                    "option2": None,
                    "position": 1
                },
                {
                    "id": "var_003_9",
                    "productId": "prod-003",
                    "title": "Size 9",
                    "price": "89.99",
                    "compareAtPrice": "110.00",
                    "sku": "SS-RS-9",
                    "inventoryQuantity": 24,
                    "option1": "Size 9",
                    "option2": None,
                    "position": 2
                },
                {
                    "id": "var_003_10",
                    "productId": "prod-003",
                    "title": "Size 10",
                    "price": "89.99",
                    "compareAtPrice": "110.00",
                    "sku": "SS-RS-10",
                    "inventoryQuantity": 15,
                    "option1": "Size 10",
                    "option2": None,
                    "position": 3
                }
            ],
            "options": [
                {"id": "opt_003", "name": "Size", "position": 1, "values": ["Size 8", "Size 9", "Size 10"]}
            ],
            "collections": ["col_003"],
            "createdAt": "2023-07-20T08:00:00Z",
            "updatedAt": "2024-09-30T16:00:00Z"
        },
        {
            "id": "prod-004",
            "title": "Ceramic Mug",
            "bodyHtml": "<p>Hand-crafted ceramic mug</p>",
            "vendor": "HomeGoods",
            "productType": "Kitchen",
            "handle": "ceramic-mug",
            "status": "active",
            "tags": ["ceramic", "mug", "kitchen", "handmade"],
            "images": [
                {
                    "id": "img_004",
                    "src": "https://placehold.co/400x400/fff3e0/e65100?text=Ceramic+Mug",
                    "alt": "Ceramic Mug",
                    "position": 1
                }
            ],
            "variants": [
                {
                    "id": "var_004_wh",
                    "productId": "prod-004",
                    "title": "White",
                    "price": "22.00",
                    "compareAtPrice": "28.00",
                    "sku": "HG-MG-WH",
                    "inventoryQuantity": 40,
                    "option1": "White",
                    "option2": None,
                    "position": 1
                },
                {
                    "id": "var_004_bl",
                    "productId": "prod-004",
                    "title": "Blue",
                    "price": "22.00",
                    "compareAtPrice": "28.00",
                    "sku": "HG-MG-BL",
                    "inventoryQuantity": 30,
                    "option1": "Blue",
                    "option2": None,
                    "position": 2
                }
            ],
            "options": [
                {"id": "opt_004", "name": "Color", "position": 1, "values": ["White", "Blue"]}
            ],
            "collections": ["col_004"],
            "createdAt": "2023-08-01T12:00:00Z",
            "updatedAt": "2024-12-05T11:20:00Z"
        }
    ],
    "collections": [
        {
            "id": "col_001",
            "title": "Apparel",
            "bodyHtml": "<p>Everyday clothing essentials</p>",
            "handle": "apparel",
            "collectionType": "manual",
            "productIds": ["prod-001"],
            "productsCount": 1,
            "sortOrder": "best-selling",
            "publishedAt": "2023-05-12T10:00:00Z",
            "updatedAt": "2024-11-20T14:30:00Z",
            "image": None
        },
        {
            "id": "col_002",
            "title": "Accessories",
            "bodyHtml": "<p>Premium accessories for every occasion</p>",
            "handle": "accessories",
            "collectionType": "manual",
            "productIds": ["prod-002"],
            "productsCount": 1,
            "sortOrder": "best-selling",
            "publishedAt": "2023-06-05T11:00:00Z",
            "updatedAt": "2024-10-15T09:45:00Z",
            "image": None
        },
        {
            "id": "col_003",
            "title": "Footwear",
            "bodyHtml": "<p>Sport and casual footwear</p>",
            "handle": "footwear",
            "collectionType": "manual",
            "productIds": ["prod-003"],
            "productsCount": 1,
            "sortOrder": "best-selling",
            "publishedAt": "2023-07-20T08:00:00Z",
            "updatedAt": "2024-09-30T16:00:00Z",
            "image": None
        },
        {
            "id": "col_004",
            "title": "Kitchen & Home",
            "bodyHtml": "<p>Hand-crafted home essentials</p>",
            "handle": "kitchen-home",
            "collectionType": "manual",
            "productIds": ["prod-004"],
            "productsCount": 1,
            "sortOrder": "best-selling",
            "publishedAt": "2023-08-01T12:00:00Z",
            "updatedAt": "2024-12-05T11:20:00Z",
            "image": None
        }
    ],
    "orders": [
        {
            "id": "order_001",
            "name": "#1001",
            "orderNumber": 1001,
            "email": "maya.thompson@example.com",
            "financialStatus": "paid",
            "fulfillmentStatus": "fulfilled",
            "currency": "USD",
            "subtotalPrice": "19.99",
            "totalShippingPrice": "5.99",
            "totalTax": "1.60",
            "totalDiscounts": "0.00",
            "totalPrice": "27.58",
            "lineItems": [
                {
                    "id": "li_001",
                    "productId": "prod-001",
                    "variantId": "var_001_m",
                    "title": "Classic T-Shirt",
                    "variantTitle": "Medium",
                    "quantity": 1,
                    "price": "19.99",
                    "sku": "BW-TS-MD",
                    "fulfillmentStatus": "fulfilled"
                }
            ],
            "customer": {"id": "cust_001", "firstName": "Maya", "lastName": "Thompson", "email": "maya.thompson@example.com"},
            "shippingAddress": {"address1": "789 Oak Street", "city": "Seattle", "province": "Washington", "provinceCode": "WA", "country": "United States", "countryCode": "US", "zip": "98101"},
            "billingAddress": {"address1": "789 Oak Street", "city": "Seattle", "province": "Washington", "provinceCode": "WA", "country": "United States", "countryCode": "US", "zip": "98101"},
            "note": "",
            "tags": [],
            "discountCodes": [],
            "timeline": [
                {"id": "evt_001_1", "type": "created", "message": "Order placed", "createdAt": "2025-01-10T14:30:00Z", "user": None},
                {"id": "evt_001_2", "type": "fulfilled", "message": "Order fulfilled", "createdAt": "2025-01-11T10:00:00Z", "user": "jordan@urbanmarket.com"}
            ],
            "createdAt": "2025-01-10T14:30:00Z",
            "updatedAt": "2025-01-11T10:00:00Z"
        },
        {
            "id": "order_002",
            "name": "#1002",
            "orderNumber": 1002,
            "email": "carlos.rivera@example.com",
            "financialStatus": "paid",
            "fulfillmentStatus": None,
            "currency": "USD",
            "subtotalPrice": "49.99",
            "totalShippingPrice": "7.99",
            "totalTax": "4.00",
            "totalDiscounts": "0.00",
            "totalPrice": "61.98",
            "lineItems": [
                {
                    "id": "li_002",
                    "productId": "prod-002",
                    "variantId": "var_002_bk",
                    "title": "Leather Wallet",
                    "variantTitle": "Black",
                    "quantity": 1,
                    "price": "49.99",
                    "sku": "LC-WL-BK",
                    "fulfillmentStatus": None
                }
            ],
            "customer": {"id": "cust_002", "firstName": "Carlos", "lastName": "Rivera", "email": "carlos.rivera@example.com"},
            "shippingAddress": {"address1": "321 Pine Ave", "city": "San Francisco", "province": "California", "provinceCode": "CA", "country": "United States", "countryCode": "US", "zip": "94102"},
            "billingAddress": {"address1": "321 Pine Ave", "city": "San Francisco", "province": "California", "provinceCode": "CA", "country": "United States", "countryCode": "US", "zip": "94102"},
            "note": "",
            "tags": [],
            "discountCodes": [],
            "timeline": [
                {"id": "evt_002_1", "type": "created", "message": "Order placed", "createdAt": "2025-01-15T09:00:00Z", "user": None}
            ],
            "createdAt": "2025-01-15T09:00:00Z",
            "updatedAt": "2025-01-15T09:00:00Z"
        }
    ],
    "customers": [
        {
            "id": "cust_001",
            "firstName": "Maya",
            "lastName": "Thompson",
            "email": "maya.thompson@example.com",
            "phone": "+1-206-555-0123",
            "state": "enabled",
            "ordersCount": 1,
            "totalSpent": "27.58",
            "note": "",
            "tags": [],
            "taxExempt": False,
            "verifiedEmail": True,
            "acceptsMarketing": True,
            "defaultAddress": {
                "address1": "789 Oak Street",
                "city": "Seattle",
                "province": "Washington",
                "provinceCode": "WA",
                "country": "United States",
                "countryCode": "US",
                "zip": "98101"
            },
            "createdAt": "2024-12-01T10:00:00Z",
            "updatedAt": "2025-01-10T14:30:00Z"
        },
        {
            "id": "cust_002",
            "firstName": "Carlos",
            "lastName": "Rivera",
            "email": "carlos.rivera@example.com",
            "phone": "+1-415-555-0456",
            "state": "enabled",
            "ordersCount": 1,
            "totalSpent": "61.98",
            "note": "",
            "tags": [],
            "taxExempt": False,
            "verifiedEmail": True,
            "acceptsMarketing": False,
            "defaultAddress": {
                "address1": "321 Pine Ave",
                "city": "San Francisco",
                "province": "California",
                "provinceCode": "CA",
                "country": "United States",
                "countryCode": "US",
                "zip": "94102"
            },
            "createdAt": "2024-11-15T08:00:00Z",
            "updatedAt": "2025-01-15T09:00:00Z"
        }
    ],
    "discounts": [
        {
            "id": "disc_001",
            "title": "Spring Sale",
            "code": "SPRING10",
            "type": "percentage",
            "value": "10",
            "status": "active",
            "appliesTo": "all",
            "appliesToIds": [],
            "minimumRequirement": "none",
            "minimumValue": "0",
            "customerEligibility": "all",
            "usageLimit": None,
            "usageCount": 12,
            "oncePerCustomer": False,
            "startsAt": "2025-03-01T00:00:00Z",
            "endsAt": "2025-04-30T23:59:59Z",
            "createdAt": "2025-02-20T10:00:00Z"
        }
    ],
    "draftOrders": [],
    "giftCards": [],
    "analytics": {
        "dailyMetrics": [
            {
                "date": "2025-01-15",
                "totalSales": 320.50,
                "ordersCount": 5,
                "onlineStoreSessions": 142,
                "returningCustomerRate": 0.28,
                "conversionRate": 0.035,
                "averageOrderValue": 64.10,
                "topProducts": [
                    {"productId": "prod-002", "title": "Leather Wallet", "quantity": 3, "revenue": 149.97},
                    {"productId": "prod-003", "title": "Running Shoes", "quantity": 2, "revenue": 179.98}
                ],
                "topReferrers": [{"source": "google", "sessions": 67}, {"source": "instagram", "sessions": 34}],
                "sessionsByLocation": [{"country": "United States", "sessions": 118}, {"country": "Canada", "sessions": 24}]
            }
        ],
        "totalSalesThisMonth": 4250.75,
        "totalOrdersThisMonth": 68,
        "totalSessionsThisMonth": 2840
    },
    "pages": [
        {
            "id": "page_001",
            "title": "About Us",
            "handle": "about-us",
            "bodyHtml": "<h2>Our Story</h2><p>Urban Market was founded in 2023 to bring together the best in everyday essentials from trusted vendors.</p>",
            "published": True,
            "createdAt": "2023-03-15T10:00:00Z",
            "updatedAt": "2024-08-20T14:00:00Z"
        },
        {
            "id": "page_002",
            "title": "Contact",
            "handle": "contact",
            "bodyHtml": "<h2>Get In Touch</h2><p>Email us at support@urbanmarket.com or call +1-503-555-0100.</p>",
            "published": True,
            "createdAt": "2023-03-15T10:00:00Z",
            "updatedAt": "2024-06-10T09:30:00Z"
        }
    ],
    "blogPosts": [
        {
            "id": "blog_001",
            "title": "Spring Collection Now Available",
            "author": "Jordan Park",
            "bodyHtml": "<p>We're thrilled to announce our spring collection is now live on the store!</p>",
            "handle": "spring-collection",
            "tags": ["news", "collection"],
            "published": True,
            "publishedAt": "2025-03-01T09:00:00Z",
            "createdAt": "2025-02-28T16:00:00Z"
        }
    ],
    "navigationMenus": [
        {
            "id": "nav_001",
            "title": "Main Menu",
            "handle": "main-menu",
            "items": [
                {"id": "nav_item_001", "title": "Home", "url": "/", "position": 1, "children": []},
                {"id": "nav_item_002", "title": "Shop", "url": "/collections/all", "position": 2, "children": []},
                {"id": "nav_item_003", "title": "About", "url": "/pages/about-us", "position": 3, "children": []}
            ]
        }
    ],
    "settings": {
        "storeName": "Urban Market",
        "storeEmail": "admin@urbanmarket.myshopify.com",
        "senderEmail": "noreply@urbanmarket.com",
        "storePhone": "+1-503-555-0100",
        "currency": "USD",
        "timezone": "(GMT-08:00) Pacific Time",
        "weightUnit": "lb"
    }
}
# --- Inject state ---
resp = requests.post(
    f'{BASE_URL}/post?sid={sid}',
    json={'action': 'set', 'state': state},
    timeout=30
)
assert resp.status_code == 200, f'State injection failed: {resp.text}'
print(f'State injected: sid={sid}')
# --- Verify ---
go = requests.get(f'{BASE_URL}/go?sid={sid}', timeout=10).json()
assert go['initial_state'] is not None, 'initial_state is None after injection'
print('Verified: initial_state and current_state are set')
# --- Launch browser ---
def launch_gui(command, delay_sec=1.0):
    env = os.environ.copy()
    env['DISPLAY'] = ':0'
    subprocess.Popen(
        shlex.split(command),
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
        env=env,
    )
    time.sleep(delay_sec)
launch_gui(f'google-chrome "{BASE_URL}/?sid={sid}"', delay_sec=2.0)
print(f'GUI_READY: launched browser at {BASE_URL}/?sid={sid}')

Trajectory

Task Instructionhard

I want to do a vendor consolidation. Change the vendor for all products currently under 'BasicWear' and 'HomeGoods' to 'UnifiedBrands'. Then update the product descriptions for those items to include 'Now part of the UnifiedBrands family' at the end. Don't touch products from other vendors.

step 01 / 19

CLICK

Thought

I can see the Shopify Mock Admin dashboard. Let me navigate to the Products section to see all products.

reward.py

"""
Reward Script: Vendor consolidation — change vendor for BasicWear & HomeGoods products to
UnifiedBrands, and append 'Now part of the UnifiedBrands family' to their descriptions.
Task ID: shopify_adv_005
Domain: shopify_admin_mock
Scoring:
  Component 1: prod-001 vendor changed to 'UnifiedBrands'            (0.25 pts)
  Component 2: prod-004 vendor changed to 'UnifiedBrands'            (0.25 pts)
  Component 3: prod-001 bodyHtml contains consolidation phrase        (0.20 pts)
  Component 4: prod-004 bodyHtml contains consolidation phrase        (0.20 pts)
  Component 5: prod-002 and prod-003 are unchanged (negative guard)  (0.10 pts)
  Total: 1.0
"""
import sys
import requests
BASE_URL = 'https://cua-gym-shopify-admin.xlang.ai'
CONSOLIDATION_PHRASE = 'Now part of the UnifiedBrands family'
TARGET_VENDOR = 'UnifiedBrands'
# --- Read SID ---
try:
    with open('/tmp/task_web_sid') as f:
        sid = f.read().strip()
    if not sid:
        raise ValueError('sid is empty')
except Exception as e:
    print(f'CRITICAL: Cannot read sid from /tmp/task_web_sid: {e}')
    print('REWARD: 0.0')
    sys.exit(0)
# --- Fetch state ---
try:
    resp = requests.get(f'{BASE_URL}/go?sid={sid}', timeout=15)
    resp.raise_for_status()
    data = resp.json()
except Exception as e:
    print(f'CRITICAL: Cannot fetch state from {BASE_URL}/go?sid={sid}: {e}')
    print('REWARD: 0.0')
    sys.exit(0)
initial_state = data.get('initial_state', {})
current_state = data.get('current_state', {})
if not current_state:
    print('CRITICAL: current_state is empty — no state injected for this sid')
    print('REWARD: 0.0')
    sys.exit(0)
if initial_state == current_state:
    print('INFO: current_state == initial_state — no changes applied (agent did nothing)')
    print('REWARD: 0.0')
    sys.exit(0)
def get_product_by_id(products, prod_id):
    for p in products:
        if p.get('id') == prod_id:
            return p
    return None
def verify_task():
    total_score = 0.0
    cur_products = current_state.get('products', [])
    init_products = initial_state.get('products', [])
    # Component 1: prod-001 vendor changed to 'UnifiedBrands' (0.25 pts)
    try:
        cur_p001 = get_product_by_id(cur_products, 'prod-001')
        init_p001 = get_product_by_id(init_products, 'prod-001')
        if cur_p001 is None:
            print('FAIL: Component 1 — prod-001 not found in current_state')
        elif init_p001 is None:
            print('FAIL: Component 1 — prod-001 not found in initial_state (data integrity issue)')
        else:
            init_vendor = init_p001.get('vendor', '')
            cur_vendor = cur_p001.get('vendor', '')
            # Verify the vendor was changed FROM initial (BasicWear) TO UnifiedBrands
            if init_vendor != TARGET_VENDOR and cur_vendor == TARGET_VENDOR:
                print(f'PASS: Component 1 — prod-001 vendor changed from "{init_vendor}" to "{cur_vendor}" (0.25 pts)')
                total_score += 0.25
            elif cur_vendor == TARGET_VENDOR and init_vendor == TARGET_VENDOR:
                print(f'FAIL: Component 1 — prod-001 vendor was already "{TARGET_VENDOR}" in initial_state (precondition, not task change)')
            else:
                print(f'FAIL: Component 1 — prod-001 vendor is "{cur_vendor}", expected "{TARGET_VENDOR}" (was "{init_vendor}")')
    except Exception as e:
        print(f'ERROR: Component 1 — {e}')
    # Component 2: prod-004 vendor changed to 'UnifiedBrands' (0.25 pts)
    try:
        cur_p004 = get_product_by_id(cur_products, 'prod-004')
        init_p004 = get_product_by_id(init_products, 'prod-004')
        if cur_p004 is None:
            print('FAIL: Component 2 — prod-004 not found in current_state')
        elif init_p004 is None:
            print('FAIL: Component 2 — prod-004 not found in initial_state (data integrity issue)')
        else:
            init_vendor = init_p004.get('vendor', '')
            cur_vendor = cur_p004.get('vendor', '')
            if init_vendor != TARGET_VENDOR and cur_vendor == TARGET_VENDOR:
                print(f'PASS: Component 2 — prod-004 vendor changed from "{init_vendor}" to "{cur_vendor}" (0.25 pts)')
                total_score += 0.25
            elif cur_vendor == TARGET_VENDOR and init_vendor == TARGET_VENDOR:
                print(f'FAIL: Component 2 — prod-004 vendor was already "{TARGET_VENDOR}" in initial_state (precondition, not task change)')
            else:
                print(f'FAIL: Component 2 — prod-004 vendor is "{cur_vendor}", expected "{TARGET_VENDOR}" (was "{init_vendor}")')
    except Exception as e:
        print(f'ERROR: Component 2 — {e}')
    # Component 3: prod-001 bodyHtml contains consolidation phrase (0.20 pts)
    try:
        cur_p001 = get_product_by_id(cur_products, 'prod-001')
        init_p001 = get_product_by_id(init_products, 'prod-001')
        if cur_p001 is None:
            print('FAIL: Component 3 — prod-001 not found in current_state')
        else:
            cur_body = cur_p001.get('bodyHtml', '')
            init_body = init_p001.get('bodyHtml', '') if init_p001 else ''
            # Phrase must be in current but NOT in initial (verifying it was added by the task)
            phrase_in_init = CONSOLIDATION_PHRASE in init_body
            phrase_in_cur = CONSOLIDATION_PHRASE in cur_body
            if not phrase_in_init and phrase_in_cur:
                print(f'PASS: Component 3 — prod-001 bodyHtml contains "{CONSOLIDATION_PHRASE}" (0.20 pts)')
                total_score += 0.20
            elif phrase_in_init:
                print(f'FAIL: Component 3 — phrase already existed in initial_state (precondition, not task change)')
            else:
                print(f'FAIL: Component 3 — prod-001 bodyHtml does not contain "{CONSOLIDATION_PHRASE}". Current: {cur_body[:120]}')
    except Exception as e:
        print(f'ERROR: Component 3 — {e}')
    # Component 4: prod-004 bodyHtml contains consolidation phrase (0.20 pts)
    try:
        cur_p004 = get_product_by_id(cur_products, 'prod-004')
        init_p004 = get_product_by_id(init_products, 'prod-004')
        if cur_p004 is None:
            print('FAIL: Component 4 — prod-004 not found in current_state')
        else:
            cur_body = cur_p004.get('bodyHtml', '')
            init_body = init_p004.get('bodyHtml', '') if init_p004 else ''
            phrase_in_init = CONSOLIDATION_PHRASE in init_body
            phrase_in_cur = CONSOLIDATION_PHRASE in cur_body
            if not phrase_in_init and phrase_in_cur:
                print(f'PASS: Component 4 — prod-004 bodyHtml contains "{CONSOLIDATION_PHRASE}" (0.20 pts)')
                total_score += 0.20
            elif phrase_in_init:
                print(f'FAIL: Component 4 — phrase already existed in initial_state (precondition, not task change)')
            else:
                print(f'FAIL: Component 4 — prod-004 bodyHtml does not contain "{CONSOLIDATION_PHRASE}". Current: {cur_body[:120]}')
    except Exception as e:
        print(f'ERROR: Component 4 — {e}')
    # Component 5: prod-002 (LeatherCo) and prod-003 (SportStep) are unchanged (0.10 pts)
    # Negative guard: ensure we did not accidentally modify non-target vendors
    try:
        cur_p002 = get_product_by_id(cur_products, 'prod-002')
        cur_p003 = get_product_by_id(cur_products, 'prod-003')
        init_p002 = get_product_by_id(init_products, 'prod-002')
        init_p003 = get_product_by_id(init_products, 'prod-003')
        if cur_p002 is None or cur_p003 is None:
            print('FAIL: Component 5 — prod-002 or prod-003 missing from current_state')
        else:
            p002_vendor_ok = cur_p002.get('vendor') == (init_p002.get('vendor') if init_p002 else 'LeatherCo')
            p003_vendor_ok = cur_p003.get('vendor') == (init_p003.get('vendor') if init_p003 else 'SportStep')
            p002_body_ok = cur_p002.get('bodyHtml') == (init_p002.get('bodyHtml') if init_p002 else '')
            p003_body_ok = cur_p003.get('bodyHtml') == (init_p003.get('bodyHtml') if init_p003 else '')
            if p002_vendor_ok and p003_vendor_ok and p002_body_ok and p003_body_ok:
                print(f'PASS: Component 5 — prod-002 (vendor={cur_p002.get("vendor")}) and prod-003 (vendor={cur_p003.get("vendor")}) are unchanged (0.10 pts)')
                total_score += 0.10
            else:
                issues = []
                if not p002_vendor_ok:
                    issues.append(f'prod-002 vendor changed: {init_p002.get("vendor")} → {cur_p002.get("vendor")}')
                if not p003_vendor_ok:
                    issues.append(f'prod-003 vendor changed: {init_p003.get("vendor")} → {cur_p003.get("vendor")}')
                if not p002_body_ok:
                    issues.append('prod-002 bodyHtml changed')
                if not p003_body_ok:
                    issues.append('prod-003 bodyHtml changed')
                print(f'FAIL: Component 5 — non-target products were modified: {"; ".join(issues)}')
    except Exception as e:
        print(f'ERROR: Component 5 — {e}')
    final_score = min(total_score, 1.0)
    print(f'\nScore: {total_score:.2f}/1.0')
    print(f'REWARD: {final_score}')
    return final_score
verify_task()

Score breakdown

Now part of the UnifiedBrands family
CRITICAL: Cannot read sid from /tmp/task_web_sid: {e}
CRITICAL: Cannot fetch state from {BASE_URL}/go?sid={sid}: {e}
CRITICAL: current_state is empty — no state injected for this sid

Total reward1.00

9

The largest open CUA RLVR dataset, by every axis.

32,122 verifiable RLVR tuples, spanning every major occupational domain — productivity, creative, OS, multi-app, browsing, and beyond.

All built on top of 110 diverse environments — 16 desktop applications + 94 mock web apps, every one a deterministic, resettable sandbox.

Dataset	Platform	Data size	Env. size	Reward	Open
GUI-Genesis	Mobile	969	1	Programmatic	No
WebArena-Infinity	Web	1,260	10	Programmatic	Yes
InfiniteWeb	Web	600	—	Programmatic	No★
UltraCUA	Desktop	17,000	9	Programmatic	No★
Gym-Anything	Desktop	7,277	193	VLM	Yes
▸CUA-Gym	Desktop + Web	32,122	110	Programmatic	Yes

▸All tasks ship with verifiable setups and rewards.
▸Mirrors the OSWorld evaluation environment — plug-and-play.
▸The largest and most diverse open CUA RLVR dataset to date.

10

Training details — long-horizon scaffolding with trajectory slicing.

Full supervision coverage. Sliding window only updates the last N rounds — truncation discards precisely the late turns where success or failure is decided. Slicing emits M training samples per rollout, each anchored at a different starting point, so every assistant turn receives a gradient through the union of all slices.

ii.

Only screenshots collapse, never thoughts. The placeholder <image collapsed> swaps out a single field — the screenshot. User prompts and the assistant's entire chain-of-thought and tool calls remain verbatim, so the reasoning trace is intact as context for every later turn.

iii.

Deterministic and cache-friendly. Slicing buys the same context relief as summarization without the extra LM call, the quality variance, or the lost screenshot→pixel grounding. Adjacent slices share an identical prefix, so the KV cache is reusable when computing policy log-probabilities across slices.

11

We GSPO-train on Qwen3.5 — and it scales.

Model	OSWorld-V.	WebArena	OSWorld-2.0	ScienceBoard	Spider2V
Proprietary models
Claude Sonnet 4.6	72.9	65.6	27.8	43.2	—
Claude Opus 4.7	78.0	—	—	—	—
GPT-5.5	78.7	—	—	—	—
Open-source models
EvoCUA-32B	56.7	—	—	—	—
OpenCUA-72B	45.0	—	—	—	—
Kimi-K2.6	73.1	—	—	—	—
Ours
Qwen3.5-35B-A3B	54.5	40.8	—	—	—
Qwen3.5-397B-A17B	62.2	54.0	10.9	23.7	29.9
▸CUA-Gym-A3B	62.1	44.5	—	—	—
▸CUA-Gym-A17B	72.6	56.0	24.0	35.0	45.0

12

Both axes scale — more data and more environments.

(a) Data scaling

More RL tuples → higher OSWorld score

12K tuples

3K tuples

1.4K tuples

12K tuples consistently dominate 3K and 1.4K throughout training; the gap opens after step 30 and never closes.

(b) Environment scaling

More distinct envs → higher OSWorld score

+1.0 pp envs+1.9 pp data

Scaling from 10 envs → 80 envs lifts score +1.0 pp at fixed budget; adding more trajectories on the same 80 envs adds another +1.9 pp.

13

CUA-Gym

The largest open CUA RLVR dataset — 32,122 tasks · 110 envs · all open.
Try it out →

https://cua-gym.xlang.ai

14

CUA-Gym

And what’s next?

https://cua-gym.xlang.ai

15

We observe unexpected model behavior during RL training.

Terminal-Usage

A heuristic metric: whether the agent explicitly opens a terminal and interacts with the CLI during a rollout, instead of completing the task through the GUI alone.

On OSWorld-Verified

20%→50%

terminal-usage rate, before vs after RL training

A non-hacking but non-natural behavior — the model increasingly chooses to solve GUI tasks via CLI once it discovers the terminal is reliably rewarded.

Then why not just let the agent access the CLI tool?

Step 5 / 47 of a PDF-rendering task: instead of clicking through the PDF viewer, the agent opens a terminal and pipes a python3 << ‘EOF’ heredoc into fitz.open(…).

16

Evaluating CLI harness in GUI benchmarks.

We drop a general-purpose CLI agent (Codex, Claude Code) into the OSWorld VM and hand it the GUI benchmark task verbatim — no screenshots, no mouse, just a shell. How far can CLI alone go on a benchmark built for GUI agents?

Model	OSWorld-V. (audited*)	OSWorld-2.0
Proprietary CUA models
Claude Sonnet 4.6	72.6	28.0
Claude Opus 4.7	78.0	—
GPT-5.5	78.7	—
CLI agent harnesses
▸Codex w/ GPT-5.5	54.0(71.8*)	36.8
▸Claude Code w/ Opus 4.7	49.9(68.5*)	24.6

* Audited: score after a manual review of the trajectories, correcting OSWorld task annotations and false-negative grader judgments.

General-purpose CLI agents reach ~70% on OSWorld-Verified after audit — closing most of the gap to dedicated CUA frontier models, without ever touching the GUI.

17

More importantly...

CLI harnesses sit at the top-left of the Pareto frontier: higher OSWorld-2.0 score in an order of magnitude fewer steps than a visual CUA agent.

In wall-clock time the gap is even wider — CLI finishes in minutes while a visual agent grinds for hours per task.

18

Introducing the Unified Digital Agent.

Modeling

Combine CLI and GUI actions into a single harness, and directly optimize the policy against downstream task performance — let the model pick whichever modality is cheapest for each step.

CLI

terminal, scripts, APIs

GUI

click, type, screenshot

→

Unified Digital Agent

one policy, one reward

Evaluation

Across the breadth of digital-agent benchmarks — desktop, web, and native applications.

OSWorld-2.0desktop apps

CocoaBenchmacOS native

WildClawBenchopen-domain web

ClawBenchcurated browser tasks

…and more

CUA-Gym

Scaling Verifiable Training Environments and Tasks
for Computer-Use Agents

https://cua-gym.xlang.ai

1 / 18