OS-Genesis

Automating GUI Agent Trajectory Construction
via Reverse Task Synthesis

Introducing OS-Genesis, a manual-free data pipeline for synthesizing GUI agent trajectory. OS-Genesis is characterized by the following core features:

Visual Representation Icon
Interaction-driven: Agents actively explore GUI environments through stepwise interactions to discover functionalities and generate data.
Connector Design Icon
Reverse Task Synthesis: OS-Genesis retroactively derives meaningful low/high-level task instructions from observed interactions and state changes, enabling the construction of diverse and executable trajectories without pre-defined tasks.
Instruction Tuning Data Icon
Trajectory Data: We construct and release high-quality mobile and web trajectories to accelerate GUI agents research.
Benchmarking Icon
Performance: OS-Genesis significantly outperforms other synthesis methods on benchmarks like AndroidWorld and WebArena.
Teaser Image

We introduce OS-Genesis, an interaction-driven pipeline that synthesizes high-quality and diverse GUI agent trajectory data without human supervision. By leveraging reverse task synthesis, OS-Genesis enables effective training of GUI agents to achieve superior performance on dynamic benchmarks such as AndroidWorld and WebArena.

OS-Genesis is built around the following components, aiming for synthesizing high-quality trajectory data for GUI agents:

  1. §Interaction-Driven Functional Discovery: A rule-based process systematically interacts with GUIs to uncover functionalities without pre-defined tasks.
  2. §Reverse Task Synthesis: Converts triples (pre- and post-action states, actions) into low-level instructions and associate them with high-level tasks. This retrospective approach bridges the gap between dynamic GUI functionalities and executable trajectories
  3. §Trajectory Reward Model: Evaluates synthesized trajectories using a graded scoring system that measures task completion and action coherence. This ensures the retention of valuable incomplete trajectories while maintaining high data quality.
  4. §Performance: We examine existing online benchmarks like AndroidWorld and WebArena. OS-Genesis achieves notable performance gains, closing the gap with human-annotated data.

Visual Representation Logo Main
Pipeline
Eval Logo Evaluation
Results
Data Logo In-depth
Analysis
Data Logo Code
Data

Click to jump to each section.

We will release all checkpoints, code, synthesized trajectories, and more details. We hope OS-Genesis can inspire and boost future research in building GUI agents.

GUI Trajectory Data

  1. a high-level instruction that defines the overall goal the agent aims to accomplish
  2. a series of low-level instructions that each describe specific steps required
  3. actions (e.g., CLICK, TYPE)
  4. states, which include visual representations like screenshots and textual representations such as a11ytree

Interaction-Driven Functional Discovery

Interaction-Driven Functional Discovery is a rule-based exploration process in OS-Genesis that systematically traverses dynamic GUI environments by interacting with UI elements (e.g., clicking, typing, scrolling). This approach uncovers diverse functionalities without relying on predefined tasks, collecting interaction triples that capture pre- and post-action states along with executed actions. These triples form the foundation for reverse task synthesis, enabling the generation of meaningful and executable task instructions.

Reverse Task Synthesis

Reverse Task Synthesis is a core component of OS-Genesis that transforms observed interaction triples—comprising pre- and post-action states and actions—into meaningful low-level and high-level task instructions. By retroactively interpreting changes in the GUI environment caused by actions, this process generates executable low-level instructions, which are then aggregated into broader, goal-oriented high-level tasks. This human-free approach bridges the gap between abstract instructions and dynamic GUI functionalities, enabling the creation of diverse, high-quality trajectories without reliance on predefined tasks or human supervision.

Exploration and Reward Modeling

After reverse task synthesis generates high-level and low-level task instructions, these instructions are executed within the GUI environment to create complete trajectories.

To ensure the quality and utility of these trajectories, OS-Genesis employs a Trajectory Reward Model (TRM). Built upon GPT-4o, TRM evaluates each trajectory based on completion (task fulfillment) and coherence (logical sequence of actions), assigning a graded reward score from 1 to 5. Unlike traditional binary filtering methods, TRM allows even incomplete but valuable trajectories to contribute to training.


Main Experiments

Mobile

We first evaluate OS-Genesis on mobiles tasks, covering AndroidWorld and AndroidControl. The results are shown in Table 2

Base Model Strategies AndroidWorld AndroidControl-High AndroidControl-Low
SR Type SR Type
GPT-4o Zero-Shot (M3A) 23.70 53.04 69.14 69.59 80.27
InternVL2-4B Zero-Shot 0.00 16.62 39.96 33.69 60.65
Task-Driven 4.02 27.37 47.08 66.48 90.37
Task-Driven w. Self Instruct 7.14 24.95 44.27 66.70 90.79
OS-Genesis 15.18 33.39 56.20 73.38 91.32
InternVL2-8B Zero-Shot 2.23 17.89 38.22 47.69 66.67
Task-Driven 4.46 23.79 43.94 64.43 89.83
Task-Driven w. Self Instruct 5.36 23.43 44.43 64.69 89.85
OS-Genesis 16.96 35.77 64.57 71.37 91.27
Qwen2-VL-7B Zero-Shot 0.89 28.92 61.39 46.37 72.78
Task-Driven 6.25 38.84 58.08 71.33 88.71
Task-Driven w. Self Instruct 9.82 39.36 58.28 71.57 89.73
OS-Genesis 17.41 44.54 66.15 74.17 90.72
Table 1: Performance on AndroidWorld and AndroidControl benchmarks.

AndroidWorld: OS-Genesis demonstrates exceptional performance on the AndroidWorld benchmark, significantly narrowing the gap between open-source agents and the state-of-the-art GPT-4o-based M3A agent. Training with OS-Genesis-synthesized data achieves nearly double the success rates compared to task-driven methods, with a success rate improvement from 9.82% to 17.41% for Qwen2-VL-7B and substantial gains for other backbones like InternVL2-8B.

AndroidControl: On AndroidControl, OS-Genesis showcases strong OOD capability, outperforming baselines in both high/low-level tasks despite encountering only 20 of 833 apps during synthesis. It achieves superior action and planning, validating our exploration-first approach for generating diverse, high-quality tasks and adapting effectively to unseen environments.

Web

Then, we evaluate OS-Genesis on web task, using challenging online benchmark: WebArena as the testbed. The results are shown in Table 2

WebArena: On WebArena, OS-Genesis delivers notable performance improvements across diverse 5 navigation scenarios, outperforming task-driven baselines and achieving significant gains with InternVL2-8B and Qwen2-VL-7B backbones. By leveraging reverse task synthesis, OS-Genesis effectively explores the rich interactive elements of web environments, producing more meaningful and diverse trajectories.

Base Model Strategies Shopping CMS Reddit Gitlab Maps Overall
GPT-4o Zero-Shot 14.28 21.05 6.25 14.29 20.00 16.25
InternVL2-4B Zero-Shot 0.00 0.00 0.00 0.00 0.00 0.00
Task-Driven 5.36 1.76 0.00 9.52 5.00 4.98
Task-Driven w. Self Instruct 5.36 3.51 0.00 9.52 7.50 5.81
OS-Genesis 10.71 7.02 3.13 7.94 7.50 7.88
InternVL2-8B Zero-Shot 0.00 0.00 0.00 0.00 0.00 0.00
Task-Driven 3.57 7.02 0.00 6.35 2.50 4.56
Task-Driven w. Self Instruct 8.93 10.53 6.25 7.94 0.00 7.05
OS-Genesis 7.14 15.79 9.34 6.35 10.00 9.96
Qwen2-VL-7B Zero-Shot 12.50 7.02 6.25 6.35 5.00 7.47
Task-Driven 8.93 7.02 6.25 6.35 5.00 7.05
Task-Driven w. Self Instruct 8.93 1.76 3.13 4.84 7.50 5.39
OS-Genesis 7.14 8.77 15.63 15.87 5.00 10.79
Table 2: Performance on WenArena benchmarks.

Analysis

How scaling Trajectory Data Improves Agentic Ability?

How Far are we from Human Data?

Conclusion

OS-Genesis is a data synthesis pipeline designed to revolutionize the construction of GUI agent trajectories. Reverse task synthesis enables the generation of diverse and coherent tasks by retroactively deriving instructions from observed interactions, while TRM ensures the quality of trajectories through graded evaluations. Together, OS-Genesis address critical challenges in GUI agent trajectories construction, paving the way for high-quality agentic data generation. We hope that it can provide a promising direction for generating high-quality trajectory data for GUI agents, bringing the community one step closer to achieving digital automation.

Acknowledgement

We would like to thank AndroidWorld authors for helping us tackle various issues in dynamic environments, as well as the Cambrian authors for providing this webpage template.

BibTeX

@article{sun2024osgenesis,
  title={OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis},
  author={Qiushi Sun and Kanzhi Cheng and Zichen Ding and Chuanyang Jin and Yian Wang and Fangzhi Xu and Zhenyu Wu and Chengyou Jia and Liheng Chen and Zhoumianze Liu and Ben Kao and Guohao Li and Junxian He and Yu Qiao and Zhiyong Wu},
  journal={arXiv preprint arXiv:2412.19723},
  year={2024}
}