Introducing OS-Genesis, a manual-free data pipeline for synthesizing GUI agent trajectory. OS-Genesis is characterized by the following core features:
We introduce OS-Genesis, an interaction-driven pipeline that synthesizes high-quality and diverse GUI agent trajectory data without human supervision. By leveraging reverse task synthesis, OS-Genesis enables effective training of GUI agents to achieve superior performance on dynamic benchmarks such as AndroidWorld and WebArena.
OS-Genesis is built around the following components, aiming for synthesizing high-quality trajectory data for GUI agents:
Click to jump to each section.
We will release all checkpoints, code, synthesized trajectories, and more details. We hope OS-Genesis can inspire and boost future research in building GUI agents.
Training with high-quality GUI trajectories is essential for enhancing agentic capabilities. Ideal GUI agent trajectories include the following key components:
Interaction-Driven Functional Discovery is a rule-based exploration process in OS-Genesis that systematically traverses dynamic GUI environments by interacting with UI elements (e.g., clicking, typing, scrolling). This approach uncovers diverse functionalities without relying on predefined tasks, collecting interaction triples that capture pre- and post-action states along with executed actions. These triples form the foundation for reverse task synthesis, enabling the generation of meaningful and executable task instructions.
Reverse Task Synthesis is a core component of OS-Genesis that transforms observed interaction triples—comprising pre- and post-action states and actions—into meaningful low-level and high-level task instructions. By retroactively interpreting changes in the GUI environment caused by actions, this process generates executable low-level instructions, which are then aggregated into broader, goal-oriented high-level tasks. This human-free approach bridges the gap between abstract instructions and dynamic GUI functionalities, enabling the creation of diverse, high-quality trajectories without reliance on predefined tasks or human supervision.
After reverse task synthesis generates high-level and low-level task instructions, these instructions are executed within the GUI environment to create complete trajectories.
To ensure the quality and utility of these trajectories, OS-Genesis employs a Trajectory Reward Model (TRM). Built upon GPT-4o, TRM evaluates each trajectory based on completion (task fulfillment) and coherence (logical sequence of actions), assigning a graded reward score from 1 to 5. Unlike traditional binary filtering methods, TRM allows even incomplete but valuable trajectories to contribute to training.
We first evaluate OS-Genesis on mobiles tasks, covering AndroidWorld
Base Model | Strategies | AndroidWorld | AndroidControl-High | AndroidControl-Low | ||
---|---|---|---|---|---|---|
SR | Type | SR | Type | |||
GPT-4o | Zero-Shot (M3A) | 23.70 | 53.04 | 69.14 | 69.59 | 80.27 |
InternVL2-4B | Zero-Shot | 0.00 | 16.62 | 39.96 | 33.69 | 60.65 |
Task-Driven | 4.02 | 27.37 | 47.08 | 66.48 | 90.37 | |
Task-Driven w. Self Instruct | 7.14 | 24.95 | 44.27 | 66.70 | 90.79 | |
OS-Genesis | 15.18 | 33.39 | 56.20 | 73.38 | 91.32 | |
InternVL2-8B | Zero-Shot | 2.23 | 17.89 | 38.22 | 47.69 | 66.67 |
Task-Driven | 4.46 | 23.79 | 43.94 | 64.43 | 89.83 | |
Task-Driven w. Self Instruct | 5.36 | 23.43 | 44.43 | 64.69 | 89.85 | |
OS-Genesis | 16.96 | 35.77 | 64.57 | 71.37 | 91.27 | |
Qwen2-VL-7B | Zero-Shot | 0.89 | 28.92 | 61.39 | 46.37 | 72.78 |
Task-Driven | 6.25 | 38.84 | 58.08 | 71.33 | 88.71 | |
Task-Driven w. Self Instruct | 9.82 | 39.36 | 58.28 | 71.57 | 89.73 | |
OS-Genesis | 17.41 | 44.54 | 66.15 | 74.17 | 90.72 |
AndroidWorld: OS-Genesis demonstrates exceptional performance on the AndroidWorld benchmark, significantly narrowing the gap between open-source agents and the state-of-the-art GPT-4o-based M3A agent. Training with OS-Genesis-synthesized data achieves nearly double the success rates compared to task-driven methods, with a success rate improvement from 9.82% to 17.41% for Qwen2-VL-7B and substantial gains for other backbones like InternVL2-8B.
AndroidControl: On AndroidControl, OS-Genesis showcases strong OOD capability, outperforming baselines in both high/low-level tasks despite encountering only 20 of 833 apps during synthesis. It achieves superior action and planning, validating our exploration-first approach for generating diverse, high-quality tasks and adapting effectively to unseen environments.
Then, we evaluate OS-Genesis on web task, using challenging online benchmark: WebArena
WebArena: On WebArena, OS-Genesis delivers notable performance improvements across diverse 5 navigation scenarios, outperforming task-driven baselines and achieving significant gains with InternVL2-8B and Qwen2-VL-7B backbones. By leveraging reverse task synthesis, OS-Genesis effectively explores the rich interactive elements of web environments, producing more meaningful and diverse trajectories.
Base Model | Strategies | Shopping | CMS | Gitlab | Maps | Overall | |
---|---|---|---|---|---|---|---|
GPT-4o | Zero-Shot | 14.28 | 21.05 | 6.25 | 14.29 | 20.00 | 16.25 |
InternVL2-4B | Zero-Shot | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Task-Driven | 5.36 | 1.76 | 0.00 | 9.52 | 5.00 | 4.98 | |
Task-Driven w. Self Instruct | 5.36 | 3.51 | 0.00 | 9.52 | 7.50 | 5.81 | |
OS-Genesis | 10.71 | 7.02 | 3.13 | 7.94 | 7.50 | 7.88 | |
InternVL2-8B | Zero-Shot | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Task-Driven | 3.57 | 7.02 | 0.00 | 6.35 | 2.50 | 4.56 | |
Task-Driven w. Self Instruct | 8.93 | 10.53 | 6.25 | 7.94 | 0.00 | 7.05 | |
OS-Genesis | 7.14 | 15.79 | 9.34 | 6.35 | 10.00 | 9.96 | |
Qwen2-VL-7B | Zero-Shot | 12.50 | 7.02 | 6.25 | 6.35 | 5.00 | 7.47 |
Task-Driven | 8.93 | 7.02 | 6.25 | 6.35 | 5.00 | 7.05 | |
Task-Driven w. Self Instruct | 8.93 | 1.76 | 3.13 | 4.84 | 7.50 | 5.39 | |
OS-Genesis | 7.14 | 8.77 | 15.63 | 15.87 | 5.00 | 10.79 |
We investigate the impact of data scale on building GUI agents. To explore this, we partition the data synthesized by OS-Genesis into subsets, ranging from small-scale trajectories to those exceeding the size used in main experiments. Using AndroidWorld as our testbed, we focus on two primary questions: (1) How does performance improve as the data scale increases? (2) Does performance saturate at higher data scales?
As shown above, task performance generally improves as the number of trajectories increases, while saturation emerges at larger data scales.
We also investigate the gap between OS-Genesis-synthesized data and human-annotated data. (1) trajectories from OS-Genesis v.s. human-annotated trajectories. We select 1K crowdsourced trajectories from AndroidControl training set for comparison. As shown below, we significantly narrow the performance gap between synthetic trajectories and human-annotated trajectories. This is notably evident in high-level tasks, demonstrating that agents trained on \ours\ trajectories can plan and solve problems more closely aligned with human manners. In terms of average success rate, viewing human-annotated data as the gold standard, the performance retention rate of OS-Genesis data surpasses 80\%.
(2) high-level instructions synthesized through OS-Genesis v.s. human-written instructions. For comparison, we match 500 human-written tasks from the AndroidControl training set and use GPT-4o for exploration. As observed, even when high-level instructions are written by human, their performance falls short compared to OS-Genesis's instructions. This can be attributed to two main factors: (a) Pre-defined tasks sometimes fail to align with the dynamic environment, and (b) Models may introduce errors when interpreting the intentions of human annotators. In contrast, OS-Genesis generates data in a progressive way, grounded in low-level interactions, which makes it inherently more suitable for unsupervised exploration and adaptation.
OS-Genesis is a data synthesis pipeline designed to revolutionize the construction of GUI agent trajectories. Reverse task synthesis enables the generation of diverse and coherent tasks by retroactively deriving instructions from observed interactions, while TRM ensures the quality of trajectories through graded evaluations. Together, OS-Genesis address critical challenges in GUI agent trajectories construction, paving the way for high-quality agentic data generation. We hope that it can provide a promising direction for generating high-quality trajectory data for GUI agents, bringing the community one step closer to achieving digital automation.
We would like to thank AndroidWorld authors for helping us tackle various issues in dynamic environments, as well as the Cambrian authors for providing this webpage template.
@article{sun2024osgenesis,
title={OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task
Synthesis},
author={Sun, Qiushi and Cheng, Kanzhi and Ding, Zichen and Jin, Chuanyang and Wang, Yian and Xu, Fangzhi and Wu, Zhenyu and Jia, Chengyou and Chen, Liheng and Liu, Zhoumianze and others},
journal={arXiv preprint arXiv:2412.19723},
year={2024}
}