Bowen Wang
CUA-Gym
Scaling Training for Computer-Use Agents · Bowen Wang
01 / 18
1

CUA-Gym

Scaling Verifiable Training Environments and Tasks
for Computer-Use Agents

https://cua-gym.xlang.ai
2
RLVR works for math, SWE, terminal-use
MATHreward = boxed match
Problem

Evaluate the definite integral:

I  =  0π/2sinxsinx+cosxdxI \;=\; \int_{0}^{\pi/2} \frac{\sin x}{\sin x + \cos x}\, dx
Agent trajectory
(1)Let u=π2xu = \tfrac{\pi}{2} - x
I=0π/2cosucosu+sinuduI = \int_{0}^{\pi/2} \frac{\cos u}{\cos u + \sin u}\, du
(2)Add the two forms:
2I=0π/21dx=π22I = \int_{0}^{\pi/2} 1\, dx = \tfrac{\pi}{2}
(3)   I=π4\;I = \boxed{\dfrac{\pi}{4}}
π4\dfrac{\pi}{4}
match
π4\dfrac{\pi}{4}
reward =1
SWEreward = tests pass
bash — pytest
$ pytest test_cognitoidp.py::test_global_sign_out
FAILED test_global_sign_out
  AssertionError: expected NotAuthorizedException,
    got successful response — token not revoked.

# agent patches cognitoidp/models.py:
$ pytest test_cognitoidp.py::test_global_sign_out
PASSED test_global_sign_out
1 passed in 0.42s
reward =1
🤔 but how about computer-use agents?
3
Take a real computer-use task.
— example —

Can you please send an email to each client listed in the opened Notion page using the template from Gmail? You should attach their transactions in the Excel 'clients.xlsx' table on Desktop.

Inbox — mail.google.com
reward =???How to scale this up?
4
We hand reward writing off to a coding agent.
Orchestrator
$Loading Task
$Loading Context
$Label Properties
  • · difficulty
  • · domain
  • · involved apps
$Prepare VM environments
$Spawn Agent Loop
while not consensus:
    g = generator.run()
    reward = discriminator.run()
    if not consensus.check():
        retry
    return (task, setup, reward)
Generator
>Loading Task and Skills
>Build envs from initial_setup.py & golden_patch.py
Initial Env
Golden Env
>Revise & retry if rejected
Discriminator
>Loading Task and Reward Skills
Decompose rewarding criteria
  • · emails exist in Sent...
  • · reward email content match template...
  • · recipients match Notion DB...
>Draft reward.py & verify in real envs.
>If not fulfilled ⟶ feedback & retry
FILTER
>  LLM MAJORITY VOTING
  • consistency92
  • executability88
  • hack-risk95
  • clarity90
  • difficulty76
>  TEACHER MODEL ROLLOUTS
  • ·Calculate cos(t,r)\cos(t,\, r)
  • ·Check rewardlog\mathrm{reward}_{\mathrm{log}}
  • ·Use VLM-as-a-judge to review sns_n
  • ·Alignment check
Pass
D={(t,s,r)}\mathcal{D} = \{(t,\, s,\, r)\}
5
CUA RLVR still needs another scale axis: environments.
Three limitations stand in the way of training computer-use agents at scale.
We build CUA-Gym-Hub: 98 self-contained mock applications across 7 domains — all state-injectable and parallel-isolated.
(1)
OSWorld & WebArena cover only limited applications
drifting further from the authentic domains where computer-use agents are expected to operate.
(2)
Agents need to transfer quickly to unseen domains and apps
so the training distribution has to cover the long tail, not just a few canonical desktop tools.
(3)
Real apps can’t be deployed as RL environments
they are not sandboxes — no state injection, no parallel isolation, no resetability.
Communication & Social18
discord
dingtalk
facebook
feishu
gmail
instagram
linkedin
microsoft-teams
outlook-web
pinterest
reddit
slack
twitter
wechat
weibo
xiaohongshu
zhihu
zoom-web
discord
dingtalk
facebook
feishu
gmail
instagram
linkedin
microsoft-teams
outlook-web
pinterest
reddit
slack
twitter
wechat
weibo
xiaohongshu
zhihu
zoom-web
discord
dingtalk
facebook
feishu
gmail
instagram
linkedin
microsoft-teams
outlook-web
pinterest
reddit
slack
twitter
wechat
weibo
xiaohongshu
zhihu
zoom-web
Productivity & Documents16
airtable
asana
canvas-lms
google-calendar
google-docs
google-drive
google-sheets
jira
lattice
linear
lucidchart
miro
monday
notion
openreview
trello
airtable
asana
canvas-lms
google-calendar
google-docs
google-drive
google-sheets
jira
lattice
linear
lucidchart
miro
monday
notion
openreview
trello
airtable
asana
canvas-lms
google-calendar
google-docs
google-drive
google-sheets
jira
lattice
linear
lucidchart
miro
monday
notion
openreview
trello
Development & Cloud12
aws-console
azure
aliyun
circleci
cloudflare
datadog
github
gitlab
postman
sentry
vercel
wandb
aws-console
azure
aliyun
circleci
cloudflare
datadog
github
gitlab
postman
sentry
vercel
wandb
aws-console
azure
aliyun
circleci
cloudflare
datadog
github
gitlab
postman
sentry
vercel
wandb
Finance & Enterprise19
bamboohr
clio
coinbase
contractbook
docusign
expensify
greenhouse
gusto
hubspot
hubspot-marketing
paypal
quickbooks
robinhood
salesforce
sap
servicenow
stripe-dashboard
tradingview
workday
bamboohr
clio
coinbase
contractbook
docusign
expensify
greenhouse
gusto
hubspot
hubspot-marketing
paypal
quickbooks
robinhood
salesforce
sap
servicenow
stripe-dashboard
tradingview
workday
bamboohr
clio
coinbase
contractbook
docusign
expensify
greenhouse
gusto
hubspot
hubspot-marketing
paypal
quickbooks
robinhood
salesforce
sap
servicenow
stripe-dashboard
tradingview
workday
E-commerce & Travel11
amazon
amazon-seller
booking-com
ebay
expedia
instacart
shopify-admin
taobao-seller
tripadvisor
uber-eats
woocommerce
amazon
amazon-seller
booking-com
ebay
expedia
instacart
shopify-admin
taobao-seller
tripadvisor
uber-eats
woocommerce
amazon
amazon-seller
booking-com
ebay
expedia
instacart
shopify-admin
taobao-seller
tripadvisor
uber-eats
woocommerce
Analytics & Marketing10
amplitude
google-ads
google-analytics
hotjar
klaviyo
looker-studio
mailchimp
meta-ads
mixpanel
tableau
amplitude
google-ads
google-analytics
hotjar
klaviyo
looker-studio
mailchimp
meta-ads
mixpanel
tableau
amplitude
google-ads
google-analytics
hotjar
klaviyo
looker-studio
mailchimp
meta-ads
mixpanel
tableau
Other8
12306
adp
epic-health
google-flights
westlaw
youtube
zendesk
zillow
12306
adp
epic-health
google-flights
westlaw
youtube
zendesk
zillow
12306
adp
epic-health
google-flights
westlaw
youtube
zendesk
zillow
6
How do we scale environments?
(1) SourcesStart from real-world software-use distributions sampled from O*NET and the Anthropic Economic Index — biasing coverage toward authentic digital knowledge work.
(2) Plan / Dev / Web agentsA multi-agent coding pipeline plans the spec, codes the frontend, and Playwright-tests the UI/UX over N rounds of dev ↔ web feedback.
(3) EngineeringAn engineering pass fixes the contract: data schema, state-injection endpoints, and a SKILL.md so every mock is a deterministic, resettable RL sandbox.
7
How do we combine environments and tasks?
(1) Session isolationEvery URL carries its own session id; parallel RL workers training on the same mock never see one another's changes — critical for distributed rollouts.
(2) State injectionWhen a task is created, the synthesis pipeline ships a JSON initial state alongside its reward.py and posts it to the mock; loading ?sid=<task_id> renders that exact world — emails, project boards, calendars, customer tickets, whatever the task description calls for, so a single mock can host arbitrarily many distinct task worlds with no code change.
Monday, inbox zero
The user logged off Friday with a clean inbox. Three system newsletters trickled in over the weekend. Nothing unread, no drafts.
open in new tab ↗
Mid-sprint, paper deadline approaching
Twenty-five threads piled up over twelve hours — PR review pings, Sentry alerts, an urgent RFC sign-off. One draft sits in Compose.
open in new tab ↗
First morning back from a week off
Eighty-one emails waiting. Forty-one in Primary alone — coworker pings, customer follow-ups, conference invites, recruiters.
open in new tab ↗
8
Example of a CUA-Gym task.
initial_setup.py
"""
Initial Setup: Vendor consolidation - set up products under BasicWear and HomeGoods vendors
Task ID: shopify_adv_005
Domain: shopify_admin_mock
Mock: shopify_admin_mock
"""
import json
import os
import shlex
import subprocess
import time
import uuid
import requests
# --- Config ---
BASE_URL = 'https://cua-gym-shopify-admin.xlang.ai'
sid = str(uuid.uuid4())
# Persist sid for golden_patch.py and reward.py
with open('/tmp/task_web_sid', 'w') as f:
f.write(sid)
print(f'SID generated: {sid}')
# --- Build initial state ---
# 4 products: prod-001 (BasicWear), prod-002 (LeatherCo), prod-003 (SportStep), prod-004 (HomeGoods)
# MUST NOT have 'UnifiedBrands' as vendor — the task is to consolidate them
state = {
"store": {
"id": "store_1",
"name": "Urban Market",
"email": "admin@urbanmarket.myshopify.com",
"phone": "+1-503-555-0100",
"domain": "urbanmarket.myshopify.com",
"customDomain": "www.urbanmarket.com",
"address": {
"address1": "456 Commerce Ave",
"city": "Portland",
"province": "Oregon",
"provinceCode": "OR",
"country": "United States",
"countryCode": "US",
"zip": "97201"
},
"currency": "USD",
"timezone": "(GMT-08:00) Pacific Time",
"weightUnit": "lb",
"plan": "Shopify",
"owner": {
"firstName": "Jordan",
"lastName": "Park",
"email": "jordan@urbanmarket.com"
},
"createdAt": "2023-03-10T09:00:00Z"
},
"products": [
{
"id": "prod-001",
"title": "Classic T-Shirt",
"bodyHtml": "<p>Comfortable cotton t-shirt</p>",
"vendor": "BasicWear",
"productType": "Apparel",
"handle": "classic-t-shirt",
"status": "active",
"tags": ["cotton", "basics", "casual"],
"images": [
{
"id": "img_001",
"src": "https://placehold.co/400x400/e8f5e9/2e7d32?text=T-Shirt",
"alt": "Classic T-Shirt",
"position": 1
}
],
"variants": [
{
"id": "var_001_s",
"productId": "prod-001",
"title": "Small",
"price": "19.99",
"compareAtPrice": "24.99",
"sku": "BW-TS-SM",
"inventoryQuantity": 45,
"option1": "Small",
"option2": None,
"position": 1
},
{
"id": "var_001_m",
"productId": "prod-001",
"title": "Medium",
"price": "19.99",
"compareAtPrice": "24.99",
"sku": "BW-TS-MD",
"inventoryQuantity": 60,
"option1": "Medium",
"option2": None,
"position": 2
},
{
"id": "var_001_l",
"productId": "prod-001",
"title": "Large",
"price": "19.99",
"compareAtPrice": "24.99",
"sku": "BW-TS-LG",
"inventoryQuantity": 35,
"option1": "Large",
"option2": None,
"position": 3
}
],
"options": [
{"id": "opt_001", "name": "Size", "position": 1, "values": ["Small", "Medium", "Large"]}
],
"collections": ["col_001"],
"createdAt": "2023-05-12T10:00:00Z",
"updatedAt": "2024-11-20T14:30:00Z"
},
{
"id": "prod-002",
"title": "Leather Wallet",
"bodyHtml": "<p>Premium leather bifold wallet</p>",
"vendor": "LeatherCo",
"productType": "Accessories",
"handle": "leather-wallet",
"status": "active",
"tags": ["leather", "wallet", "accessories"],
"images": [
{
"id": "img_002",
"src": "https://placehold.co/400x400/fce8d5/7b3f00?text=Wallet",
"alt": "Leather Wallet",
"position": 1
}
],
"variants": [
{
"id": "var_002_bk",
"productId": "prod-002",
"title": "Black",
"price": "49.99",
"compareAtPrice": "65.00",
"sku": "LC-WL-BK",
"inventoryQuantity": 28,
"option1": "Black",
"option2": None,
"position": 1
},
{
"id": "var_002_br",
"productId": "prod-002",
"title": "Brown",
"price": "49.99",
"compareAtPrice": "65.00",
"sku": "LC-WL-BR",
"inventoryQuantity": 22,
"option1": "Brown",
"option2": None,
"position": 2
}
],
"options": [
{"id": "opt_002", "name": "Color", "position": 1, "values": ["Black", "Brown"]}
],
"collections": ["col_002"],
"createdAt": "2023-06-05T11:00:00Z",
"updatedAt": "2024-10-15T09:45:00Z"
},
{
"id": "prod-003",
"title": "Running Shoes",
"bodyHtml": "<p>Lightweight running shoes for everyday use</p>",
"vendor": "SportStep",
"productType": "Footwear",
"handle": "running-shoes",
"status": "active",
"tags": ["running", "shoes", "sport", "athletic"],
"images": [
{
"id": "img_003",
"src": "https://placehold.co/400x400/e3f2fd/1565c0?text=Running+Shoes",
"alt": "Running Shoes",
"position": 1
}
],
"variants": [
{
"id": "var_003_8",
"productId": "prod-003",
"title": "Size 8",
"price": "89.99",
"compareAtPrice": "110.00",
"sku": "SS-RS-8",
"inventoryQuantity": 18,
"option1": "Size 8",
"option2": None,
"position": 1
},
{
"id": "var_003_9",
"productId": "prod-003",
"title": "Size 9",
"price": "89.99",
"compareAtPrice": "110.00",
"sku": "SS-RS-9",
"inventoryQuantity": 24,
"option1": "Size 9",
"option2": None,
"position": 2
},
{
"id": "var_003_10",
"productId": "prod-003",
"title": "Size 10",
"price": "89.99",
"compareAtPrice": "110.00",
"sku": "SS-RS-10",
"inventoryQuantity": 15,
"option1": "Size 10",
"option2": None,
"position": 3
}
],
"options": [
{"id": "opt_003", "name": "Size", "position": 1, "values": ["Size 8", "Size 9", "Size 10"]}
],
"collections": ["col_003"],
"createdAt": "2023-07-20T08:00:00Z",
"updatedAt": "2024-09-30T16:00:00Z"
},
{
"id": "prod-004",
"title": "Ceramic Mug",
"bodyHtml": "<p>Hand-crafted ceramic mug</p>",
"vendor": "HomeGoods",
"productType": "Kitchen",
"handle": "ceramic-mug",
"status": "active",
"tags": ["ceramic", "mug", "kitchen", "handmade"],
"images": [
{
"id": "img_004",
"src": "https://placehold.co/400x400/fff3e0/e65100?text=Ceramic+Mug",
"alt": "Ceramic Mug",
"position": 1
}
],
"variants": [
{
"id": "var_004_wh",
"productId": "prod-004",
"title": "White",
"price": "22.00",
"compareAtPrice": "28.00",
"sku": "HG-MG-WH",
"inventoryQuantity": 40,
"option1": "White",
"option2": None,
"position": 1
},
{
"id": "var_004_bl",
"productId": "prod-004",
"title": "Blue",
"price": "22.00",
"compareAtPrice": "28.00",
"sku": "HG-MG-BL",
"inventoryQuantity": 30,
"option1": "Blue",
"option2": None,
"position": 2
}
],
"options": [
{"id": "opt_004", "name": "Color", "position": 1, "values": ["White", "Blue"]}
],
"collections": ["col_004"],
"createdAt": "2023-08-01T12:00:00Z",
"updatedAt": "2024-12-05T11:20:00Z"
}
],
"collections": [
{
"id": "col_001",
"title": "Apparel",
"bodyHtml": "<p>Everyday clothing essentials</p>",
"handle": "apparel",
"collectionType": "manual",
"productIds": ["prod-001"],
"productsCount": 1,
"sortOrder": "best-selling",
"publishedAt": "2023-05-12T10:00:00Z",
"updatedAt": "2024-11-20T14:30:00Z",
"image": None
},
{
"id": "col_002",
"title": "Accessories",
"bodyHtml": "<p>Premium accessories for every occasion</p>",
"handle": "accessories",
"collectionType": "manual",
"productIds": ["prod-002"],
"productsCount": 1,
"sortOrder": "best-selling",
"publishedAt": "2023-06-05T11:00:00Z",
"updatedAt": "2024-10-15T09:45:00Z",
"image": None
},
{
"id": "col_003",
"title": "Footwear",
"bodyHtml": "<p>Sport and casual footwear</p>",
"handle": "footwear",
"collectionType": "manual",
"productIds": ["prod-003"],
"productsCount": 1,
"sortOrder": "best-selling",
"publishedAt": "2023-07-20T08:00:00Z",
"updatedAt": "2024-09-30T16:00:00Z",
"image": None
},
{
"id": "col_004",
"title": "Kitchen & Home",
"bodyHtml": "<p>Hand-crafted home essentials</p>",
"handle": "kitchen-home",
"collectionType": "manual",
"productIds": ["prod-004"],
"productsCount": 1,
"sortOrder": "best-selling",
"publishedAt": "2023-08-01T12:00:00Z",
"updatedAt": "2024-12-05T11:20:00Z",
"image": None
}
],
"orders": [
{
"id": "order_001",
"name": "#1001",
"orderNumber": 1001,
"email": "maya.thompson@example.com",
"financialStatus": "paid",
"fulfillmentStatus": "fulfilled",
"currency": "USD",
"subtotalPrice": "19.99",
"totalShippingPrice": "5.99",
"totalTax": "1.60",
"totalDiscounts": "0.00",
"totalPrice": "27.58",
"lineItems": [
{
"id": "li_001",
"productId": "prod-001",
"variantId": "var_001_m",
"title": "Classic T-Shirt",
"variantTitle": "Medium",
"quantity": 1,
"price": "19.99",
"sku": "BW-TS-MD",
"fulfillmentStatus": "fulfilled"
}
],
"customer": {"id": "cust_001", "firstName": "Maya", "lastName": "Thompson", "email": "maya.thompson@example.com"},
"shippingAddress": {"address1": "789 Oak Street", "city": "Seattle", "province": "Washington", "provinceCode": "WA", "country": "United States", "countryCode": "US", "zip": "98101"},
"billingAddress": {"address1": "789 Oak Street", "city": "Seattle", "province": "Washington", "provinceCode": "WA", "country": "United States", "countryCode": "US", "zip": "98101"},
"note": "",
"tags": [],
"discountCodes": [],
"timeline": [
{"id": "evt_001_1", "type": "created", "message": "Order placed", "createdAt": "2025-01-10T14:30:00Z", "user": None},
{"id": "evt_001_2", "type": "fulfilled", "message": "Order fulfilled", "createdAt": "2025-01-11T10:00:00Z", "user": "jordan@urbanmarket.com"}
],
"createdAt": "2025-01-10T14:30:00Z",
"updatedAt": "2025-01-11T10:00:00Z"
},
{
"id": "order_002",
"name": "#1002",
"orderNumber": 1002,
"email": "carlos.rivera@example.com",
"financialStatus": "paid",
"fulfillmentStatus": None,
"currency": "USD",
"subtotalPrice": "49.99",
"totalShippingPrice": "7.99",
"totalTax": "4.00",
"totalDiscounts": "0.00",
"totalPrice": "61.98",
"lineItems": [
{
"id": "li_002",
"productId": "prod-002",
"variantId": "var_002_bk",
"title": "Leather Wallet",
"variantTitle": "Black",
"quantity": 1,
"price": "49.99",
"sku": "LC-WL-BK",
"fulfillmentStatus": None
}
],
"customer": {"id": "cust_002", "firstName": "Carlos", "lastName": "Rivera", "email": "carlos.rivera@example.com"},
"shippingAddress": {"address1": "321 Pine Ave", "city": "San Francisco", "province": "California", "provinceCode": "CA", "country": "United States", "countryCode": "US", "zip": "94102"},
"billingAddress": {"address1": "321 Pine Ave", "city": "San Francisco", "province": "California", "provinceCode": "CA", "country": "United States", "countryCode": "US", "zip": "94102"},
"note": "",
"tags": [],
"discountCodes": [],
"timeline": [
{"id": "evt_002_1", "type": "created", "message": "Order placed", "createdAt": "2025-01-15T09:00:00Z", "user": None}
],
"createdAt": "2025-01-15T09:00:00Z",
"updatedAt": "2025-01-15T09:00:00Z"
}
],
"customers": [
{
"id": "cust_001",
"firstName": "Maya",
"lastName": "Thompson",
"email": "maya.thompson@example.com",
"phone": "+1-206-555-0123",
"state": "enabled",
"ordersCount": 1,
"totalSpent": "27.58",
"note": "",
"tags": [],
"taxExempt": False,
"verifiedEmail": True,
"acceptsMarketing": True,
"defaultAddress": {
"address1": "789 Oak Street",
"city": "Seattle",
"province": "Washington",
"provinceCode": "WA",
"country": "United States",
"countryCode": "US",
"zip": "98101"
},
"createdAt": "2024-12-01T10:00:00Z",
"updatedAt": "2025-01-10T14:30:00Z"
},
{
"id": "cust_002",
"firstName": "Carlos",
"lastName": "Rivera",
"email": "carlos.rivera@example.com",
"phone": "+1-415-555-0456",
"state": "enabled",
"ordersCount": 1,
"totalSpent": "61.98",
"note": "",
"tags": [],
"taxExempt": False,
"verifiedEmail": True,
"acceptsMarketing": False,
"defaultAddress": {
"address1": "321 Pine Ave",
"city": "San Francisco",
"province": "California",
"provinceCode": "CA",
"country": "United States",
"countryCode": "US",
"zip": "94102"
},
"createdAt": "2024-11-15T08:00:00Z",
"updatedAt": "2025-01-15T09:00:00Z"
}
],
"discounts": [
{
"id": "disc_001",
"title": "Spring Sale",
"code": "SPRING10",
"type": "percentage",
"value": "10",
"status": "active",
"appliesTo": "all",
"appliesToIds": [],
"minimumRequirement": "none",
"minimumValue": "0",
"customerEligibility": "all",
"usageLimit": None,
"usageCount": 12,
"oncePerCustomer": False,
"startsAt": "2025-03-01T00:00:00Z",
"endsAt": "2025-04-30T23:59:59Z",
"createdAt": "2025-02-20T10:00:00Z"
}
],
"draftOrders": [],
"giftCards": [],
"analytics": {
"dailyMetrics": [
{
"date": "2025-01-15",
"totalSales": 320.50,
"ordersCount": 5,
"onlineStoreSessions": 142,
"returningCustomerRate": 0.28,
"conversionRate": 0.035,
"averageOrderValue": 64.10,
"topProducts": [
{"productId": "prod-002", "title": "Leather Wallet", "quantity": 3, "revenue": 149.97},
{"productId": "prod-003", "title": "Running Shoes", "quantity": 2, "revenue": 179.98}
],
"topReferrers": [{"source": "google", "sessions": 67}, {"source": "instagram", "sessions": 34}],
"sessionsByLocation": [{"country": "United States", "sessions": 118}, {"country": "Canada", "sessions": 24}]
}
],
"totalSalesThisMonth": 4250.75,
"totalOrdersThisMonth": 68,
"totalSessionsThisMonth": 2840
},
"pages": [
{
"id": "page_001",
"title": "About Us",
"handle": "about-us",
"bodyHtml": "<h2>Our Story</h2><p>Urban Market was founded in 2023 to bring together the best in everyday essentials from trusted vendors.</p>",
"published": True,
"createdAt": "2023-03-15T10:00:00Z",
"updatedAt": "2024-08-20T14:00:00Z"
},
{
"id": "page_002",
"title": "Contact",
"handle": "contact",
"bodyHtml": "<h2>Get In Touch</h2><p>Email us at support@urbanmarket.com or call +1-503-555-0100.</p>",
"published": True,
"createdAt": "2023-03-15T10:00:00Z",
"updatedAt": "2024-06-10T09:30:00Z"
}
],
"blogPosts": [
{
"id": "blog_001",
"title": "Spring Collection Now Available",
"author": "Jordan Park",
"bodyHtml": "<p>We're thrilled to announce our spring collection is now live on the store!</p>",
"handle": "spring-collection",
"tags": ["news", "collection"],
"published": True,
"publishedAt": "2025-03-01T09:00:00Z",
"createdAt": "2025-02-28T16:00:00Z"
}
],
"navigationMenus": [
{
"id": "nav_001",
"title": "Main Menu",
"handle": "main-menu",
"items": [
{"id": "nav_item_001", "title": "Home", "url": "/", "position": 1, "children": []},
{"id": "nav_item_002", "title": "Shop", "url": "/collections/all", "position": 2, "children": []},
{"id": "nav_item_003", "title": "About", "url": "/pages/about-us", "position": 3, "children": []}
]
}
],
"settings": {
"storeName": "Urban Market",
"storeEmail": "admin@urbanmarket.myshopify.com",
"senderEmail": "noreply@urbanmarket.com",
"storePhone": "+1-503-555-0100",
"currency": "USD",
"timezone": "(GMT-08:00) Pacific Time",
"weightUnit": "lb"
}
}
# --- Inject state ---
resp = requests.post(
f'{BASE_URL}/post?sid={sid}',
json={'action': 'set', 'state': state},
timeout=30
)
assert resp.status_code == 200, f'State injection failed: {resp.text}'
print(f'State injected: sid={sid}')
# --- Verify ---
go = requests.get(f'{BASE_URL}/go?sid={sid}', timeout=10).json()
assert go['initial_state'] is not None, 'initial_state is None after injection'
print('Verified: initial_state and current_state are set')
# --- Launch browser ---
def launch_gui(command, delay_sec=1.0):
env = os.environ.copy()
env['DISPLAY'] = ':0'
subprocess.Popen(
shlex.split(command),
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
env=env,
)
time.sleep(delay_sec)
launch_gui(f'google-chrome "{BASE_URL}/?sid={sid}"', delay_sec=2.0)
print(f'GUI_READY: launched browser at {BASE_URL}/?sid={sid}')
Trajectory
Task Instructionhard

I want to do a vendor consolidation. Change the vendor for all products currently under 'BasicWear' and 'HomeGoods' to 'UnifiedBrands'. Then update the product descriptions for those items to include 'Now part of the UnifiedBrands family' at the end. Don't touch products from other vendors.

Step 2
step 01 / 19
CLICK
Thought

I can see the Shopify Mock Admin dashboard. Let me navigate to the Products section to see all products.

reward.py
"""
Reward Script: Vendor consolidation change vendor for BasicWear & HomeGoods products to
UnifiedBrands, and append 'Now part of the UnifiedBrands family' to their descriptions.
Task ID: shopify_adv_005
Domain: shopify_admin_mock
Scoring:
Component 1: prod-001 vendor changed to 'UnifiedBrands' (0.25 pts)
Component 2: prod-004 vendor changed to 'UnifiedBrands' (0.25 pts)
Component 3: prod-001 bodyHtml contains consolidation phrase (0.20 pts)
Component 4: prod-004 bodyHtml contains consolidation phrase (0.20 pts)
Component 5: prod-002 and prod-003 are unchanged (negative guard) (0.10 pts)
Total: 1.0
"""
import sys
import requests
BASE_URL = 'https://cua-gym-shopify-admin.xlang.ai'
CONSOLIDATION_PHRASE = 'Now part of the UnifiedBrands family'
TARGET_VENDOR = 'UnifiedBrands'
# --- Read SID ---
try:
with open('/tmp/task_web_sid') as f:
sid = f.read().strip()
if not sid:
raise ValueError('sid is empty')
except Exception as e:
print(f'CRITICAL: Cannot read sid from /tmp/task_web_sid: {e}')
print('REWARD: 0.0')
sys.exit(0)
# --- Fetch state ---
try:
resp = requests.get(f'{BASE_URL}/go?sid={sid}', timeout=15)
resp.raise_for_status()
data = resp.json()
except Exception as e:
print(f'CRITICAL: Cannot fetch state from {BASE_URL}/go?sid={sid}: {e}')
print('REWARD: 0.0')
sys.exit(0)
initial_state = data.get('initial_state', {})
current_state = data.get('current_state', {})
if not current_state:
print('CRITICAL: current_state is empty — no state injected for this sid')
print('REWARD: 0.0')
sys.exit(0)
if initial_state == current_state:
print('INFO: current_state == initial_state — no changes applied (agent did nothing)')
print('REWARD: 0.0')
sys.exit(0)
def get_product_by_id(products, prod_id):
for p in products:
if p.get('id') == prod_id:
return p
return None
def verify_task():
total_score = 0.0
cur_products = current_state.get('products', [])
init_products = initial_state.get('products', [])
# Component 1: prod-001 vendor changed to 'UnifiedBrands' (0.25 pts)
try:
cur_p001 = get_product_by_id(cur_products, 'prod-001')
init_p001 = get_product_by_id(init_products, 'prod-001')
if cur_p001 is None:
print('FAIL: Component 1 — prod-001 not found in current_state')
elif init_p001 is None:
print('FAIL: Component 1 — prod-001 not found in initial_state (data integrity issue)')
else:
init_vendor = init_p001.get('vendor', '')
cur_vendor = cur_p001.get('vendor', '')
# Verify the vendor was changed FROM initial (BasicWear) TO UnifiedBrands
if init_vendor != TARGET_VENDOR and cur_vendor == TARGET_VENDOR:
print(f'PASS: Component 1 — prod-001 vendor changed from "{init_vendor}" to "{cur_vendor}" (0.25 pts)')
total_score += 0.25
elif cur_vendor == TARGET_VENDOR and init_vendor == TARGET_VENDOR:
print(f'FAIL: Component 1 — prod-001 vendor was already "{TARGET_VENDOR}" in initial_state (precondition, not task change)')
else:
print(f'FAIL: Component 1 — prod-001 vendor is "{cur_vendor}", expected "{TARGET_VENDOR}" (was "{init_vendor}")')
except Exception as e:
print(f'ERROR: Component 1 — {e}')
# Component 2: prod-004 vendor changed to 'UnifiedBrands' (0.25 pts)
try:
cur_p004 = get_product_by_id(cur_products, 'prod-004')
init_p004 = get_product_by_id(init_products, 'prod-004')
if cur_p004 is None:
print('FAIL: Component 2 — prod-004 not found in current_state')
elif init_p004 is None:
print('FAIL: Component 2 — prod-004 not found in initial_state (data integrity issue)')
else:
init_vendor = init_p004.get('vendor', '')
cur_vendor = cur_p004.get('vendor', '')
if init_vendor != TARGET_VENDOR and cur_vendor == TARGET_VENDOR:
print(f'PASS: Component 2 — prod-004 vendor changed from "{init_vendor}" to "{cur_vendor}" (0.25 pts)')
total_score += 0.25
elif cur_vendor == TARGET_VENDOR and init_vendor == TARGET_VENDOR:
print(f'FAIL: Component 2 — prod-004 vendor was already "{TARGET_VENDOR}" in initial_state (precondition, not task change)')
else:
print(f'FAIL: Component 2 — prod-004 vendor is "{cur_vendor}", expected "{TARGET_VENDOR}" (was "{init_vendor}")')
except Exception as e:
print(f'ERROR: Component 2 — {e}')
# Component 3: prod-001 bodyHtml contains consolidation phrase (0.20 pts)
try:
cur_p001 = get_product_by_id(cur_products, 'prod-001')
init_p001 = get_product_by_id(init_products, 'prod-001')
if cur_p001 is None:
print('FAIL: Component 3 — prod-001 not found in current_state')
else:
cur_body = cur_p001.get('bodyHtml', '')
init_body = init_p001.get('bodyHtml', '') if init_p001 else ''
# Phrase must be in current but NOT in initial (verifying it was added by the task)
phrase_in_init = CONSOLIDATION_PHRASE in init_body
phrase_in_cur = CONSOLIDATION_PHRASE in cur_body
if not phrase_in_init and phrase_in_cur:
print(f'PASS: Component 3 — prod-001 bodyHtml contains "{CONSOLIDATION_PHRASE}" (0.20 pts)')
total_score += 0.20
elif phrase_in_init:
print(f'FAIL: Component 3 — phrase already existed in initial_state (precondition, not task change)')
else:
print(f'FAIL: Component 3 — prod-001 bodyHtml does not contain "{CONSOLIDATION_PHRASE}". Current: {cur_body[:120]}')
except Exception as e:
print(f'ERROR: Component 3 — {e}')
# Component 4: prod-004 bodyHtml contains consolidation phrase (0.20 pts)
try:
cur_p004 = get_product_by_id(cur_products, 'prod-004')
init_p004 = get_product_by_id(init_products, 'prod-004')
if cur_p004 is None:
print('FAIL: Component 4 — prod-004 not found in current_state')
else:
cur_body = cur_p004.get('bodyHtml', '')
init_body = init_p004.get('bodyHtml', '') if init_p004 else ''
phrase_in_init = CONSOLIDATION_PHRASE in init_body
phrase_in_cur = CONSOLIDATION_PHRASE in cur_body
if not phrase_in_init and phrase_in_cur:
print(f'PASS: Component 4 — prod-004 bodyHtml contains "{CONSOLIDATION_PHRASE}" (0.20 pts)')
total_score += 0.20
elif phrase_in_init:
print(f'FAIL: Component 4 — phrase already existed in initial_state (precondition, not task change)')
else:
print(f'FAIL: Component 4 — prod-004 bodyHtml does not contain "{CONSOLIDATION_PHRASE}". Current: {cur_body[:120]}')
except Exception as e:
print(f'ERROR: Component 4 — {e}')
# Component 5: prod-002 (LeatherCo) and prod-003 (SportStep) are unchanged (0.10 pts)
# Negative guard: ensure we did not accidentally modify non-target vendors
try:
cur_p002 = get_product_by_id(cur_products, 'prod-002')
cur_p003 = get_product_by_id(cur_products, 'prod-003')
init_p002 = get_product_by_id(init_products, 'prod-002')
init_p003 = get_product_by_id(init_products, 'prod-003')
if cur_p002 is None or cur_p003 is None:
print('FAIL: Component 5 — prod-002 or prod-003 missing from current_state')
else:
p002_vendor_ok = cur_p002.get('vendor') == (init_p002.get('vendor') if init_p002 else 'LeatherCo')
p003_vendor_ok = cur_p003.get('vendor') == (init_p003.get('vendor') if init_p003 else 'SportStep')
p002_body_ok = cur_p002.get('bodyHtml') == (init_p002.get('bodyHtml') if init_p002 else '')
p003_body_ok = cur_p003.get('bodyHtml') == (init_p003.get('bodyHtml') if init_p003 else '')
if p002_vendor_ok and p003_vendor_ok and p002_body_ok and p003_body_ok:
print(f'PASS: Component 5 — prod-002 (vendor={cur_p002.get("vendor")}) and prod-003 (vendor={cur_p003.get("vendor")}) are unchanged (0.10 pts)')
total_score += 0.10
else:
issues = []
if not p002_vendor_ok:
issues.append(f'prod-002 vendor changed: {init_p002.get("vendor")} → {cur_p002.get("vendor")}')
if not p003_vendor_ok:
issues.append(f'prod-003 vendor changed: {init_p003.get("vendor")} → {cur_p003.get("vendor")}')
if not p002_body_ok:
issues.append('prod-002 bodyHtml changed')
if not p003_body_ok:
issues.append('prod-003 bodyHtml changed')
print(f'FAIL: Component 5 — non-target products were modified: {"; ".join(issues)}')
except Exception as e:
print(f'ERROR: Component 5 — {e}')
final_score = min(total_score, 1.0)
print(f'\nScore: {total_score:.2f}/1.0')
print(f'REWARD: {final_score}')
return final_score
verify_task()
Score breakdown
  • Now part of the UnifiedBrands family
  • CRITICAL: Cannot read sid from /tmp/task_web_sid: {e}
  • CRITICAL: Cannot fetch state from {BASE_URL}/go?sid={sid}: {e}
  • CRITICAL: current_state is empty — no state injected for this sid
Total reward1.00
9
The largest open CUA RLVR dataset, by every axis.
32,122 verifiable RLVR tuples, spanning every major occupational domain — productivity, creative, OS, multi-app, browsing, and beyond.
All built on top of 110 diverse environments 16 desktop applications + 94 mock web apps, every one a deterministic, resettable sandbox.
DatasetPlatformData sizeEnv. sizeRewardOpen
GUI-GenesisMobile9691ProgrammaticNo
WebArena-InfinityWeb1,26010ProgrammaticYes
InfiniteWebWeb600ProgrammaticNo★
UltraCUADesktop17,0009ProgrammaticNo★
Gym-AnythingDesktop7,277193VLMYes
CUA-GymDesktop + Web32,122110ProgrammaticYes
  • All tasks ship with verifiable setups and rewards.
  • Mirrors the OSWorld evaluation environment — plug-and-play.
  • The largest and most diverse open CUA RLVR dataset to date.
10
Training details — long-horizon scaffolding with trajectory slicing.
i.
Full supervision coverage. Sliding window only updates the last N rounds — truncation discards precisely the late turns where success or failure is decided. Slicing emits M training samples per rollout, each anchored at a different starting point, so every assistant turn receives a gradient through the union of all slices.
ii.
Only screenshots collapse, never thoughts. The placeholder <image collapsed> swaps out a single field — the screenshot. User prompts and the assistant's entire chain-of-thought and tool calls remain verbatim, so the reasoning trace is intact as context for every later turn.
iii.
Deterministic and cache-friendly. Slicing buys the same context relief as summarization without the extra LM call, the quality variance, or the lost screenshot→pixel grounding. Adjacent slices share an identical prefix, so the KV cache is reusable when computing policy log-probabilities across slices.
11
We GSPO-train on Qwen3.5 — and it scales.
ModelOSWorld-V.WebArenaOSWorld-2.0ScienceBoardSpider2V
Proprietary models
Claude Sonnet 4.672.965.627.843.2
Claude Opus 4.778.0
GPT-5.578.7
Open-source models
EvoCUA-32B56.7
OpenCUA-72B45.0
Kimi-K2.673.1
Ours
Qwen3.5-35B-A3B54.540.8
Qwen3.5-397B-A17B62.254.010.923.729.9
CUA-Gym-A3B62.144.5
CUA-Gym-A17B72.656.024.035.045.0
12
Both axes scale — more data and more environments.
(a) Data scaling
More RL tuples → higher OSWorld score
12K tuples
3K tuples
1.4K tuples
12K tuples consistently dominate 3K and 1.4K throughout training; the gap opens after step 30 and never closes.
(b) Environment scaling
More distinct envs → higher OSWorld score
+1.0 pp envs+1.9 pp data
Scaling from 10 envs → 80 envs lifts score +1.0 pp at fixed budget; adding more trajectories on the same 80 envs adds another +1.9 pp.
13

CUA-Gym

The largest open CUA RLVR dataset — 32,122 tasks · 110 envs · all open.
Try it out →

https://cua-gym.xlang.ai
14

CUA-Gym

And what’s next?

https://cua-gym.xlang.ai
15
We observe unexpected model behavior during RL training.
Terminal-Usage
A heuristic metric: whether the agent explicitly opens a terminal and interacts with the CLI during a rollout, instead of completing the task through the GUI alone.
On OSWorld-Verified
20%50%
terminal-usage rate, before vs after RL training
A non-hacking but non-natural behavior — the model increasingly chooses to solve GUI tasks via CLI once it discovers the terminal is reliably rewarded.
Then why not just let the agent access the CLI tool?
Step 5 / 47 of a PDF-rendering task: instead of clicking through the PDF viewer, the agent opens a terminal and pipes a python3 << ‘EOF’ heredoc into fitz.open(…).
16
Evaluating CLI harness in GUI benchmarks.
We drop a general-purpose CLI agent (Codex, Claude Code) into the OSWorld VM and hand it the GUI benchmark task verbatim — no screenshots, no mouse, just a shell. How far can CLI alone go on a benchmark built for GUI agents?
ModelOSWorld-V. (audited*)OSWorld-2.0
Proprietary CUA models
Claude Sonnet 4.672.628.0
Claude Opus 4.778.0
GPT-5.578.7
CLI agent harnesses
Codex w/ GPT-5.554.0(71.8*)36.8
Claude Code w/ Opus 4.749.9(68.5*)24.6
* Audited: score after a manual review of the trajectories, correcting OSWorld task annotations and false-negative grader judgments.
General-purpose CLI agents reach ~70% on OSWorld-Verified after audit — closing most of the gap to dedicated CUA frontier models, without ever touching the GUI.
17
More importantly...
05010015020025030035001h2h3h4h5h10%15%20%25%30%35%40%steps (CLI tool internal actions, or visual-agent rollout length)TimeOSWorld-2.0 score (%)13.9%24.9%Claude Sonnet 4.622.5%28.4%Claude Opus 4.7Claude Codew/ Opus 4.736 steps · 24.6%Codexw/ GPT-5.543 steps · 36.8%
CLI harnesses sit at the top-left of the Pareto frontier: higher OSWorld-2.0 score in an order of magnitude fewer steps than a visual CUA agent.
In wall-clock time the gap is even wider — CLI finishes in minutes while a visual agent grinds for hours per task.
18
Introducing the Unified Digital Agent.
Modeling
Combine CLI and GUI actions into a single harness, and directly optimize the policy against downstream task performance — let the model pick whichever modality is cheapest for each step.
CLI
terminal, scripts, APIs
+
GUI
click, type, screenshot
Unified Digital Agent
one policy, one reward
Evaluation
Across the breadth of digital-agent benchmarks — desktop, web, and native applications.
OSWorld-2.0desktop apps
CocoaBenchmacOS native
WildClawBenchopen-domain web
ClawBenchcurated browser tasks
and more

CUA-Gym

Scaling Verifiable Training Environments and Tasks
for Computer-Use Agents

https://cua-gym.xlang.ai
1 / 18