Towards Safety-Enhanced Mobile GUI Agents
via Hybrid Validation in Realistic Workflows
Introducing OS-Sentinel, a novel hybrid safety detection framework, and MobileRisk-Live, a pioneering testbed for advancing safety research about autonomous mobile GUI agents. This work is characterized by the following core features:
Realistic Testbed & Benchmark: We introduce MobileRisk-Live, a dynamic sandbox environment for real-time safety studies , and MobileRisk, a benchmark of fine-grained agent trajectories with safety annotations, laying the groundwork for future research.
Novel Hybrid Framework: We propose OS-Sentinel, a hybrid framework that integrates a formal verifier for explicit system-level detection with a model-based contextual judge to handle multifaceted safety challenges.
Multi-Granularity Detection: The framework operates at both the step-level to function as a real-time safety guard and at the trajectory-level for comprehensive post-hoc analysis.
Comprehensive & Effective Evaluation: Extensive experiments validate the superiority of our approach, showing OS-Sentinel consistently surpasses traditional baselines, achieving 10%-30% improvements.
We introduce a study focused on safety-enhanced mobile GUI agents. This work features OS-Sentinel, a novel hybrid detection framework. We also present MobileRisk-Live, a dynamic sandbox environment, and MobileRisk, a corresponding safety benchmark. Together, this suite of tools enables robust evaluation and detection of multifaceted safety risks in realistic mobile workflows.
OS-Sentinel is built around the following components, aiming for the systematic evaluation and detection of safety risks in mobile agents:
§Safety for Mobile GUI Agents:
OS-Sentinel addresses the critical, underexplored challenge of agent safety on mobile platforms, tackling multifaceted risks ranging from inadvertent privacy leakage and offensive content to system compromise.
§Dynamic Sandbox Environment: This work provides MobileRisk-Live, a dynamic and extendable Android emulator-based sandbox. It supports real-time agent interaction while recording both GUI observations and underlying system state traces, which are critical for safety analysis.
§Realistic Safety Benchmark: We derive MobileRisk, a benchmark of fine-grained mobile agent trajectories built from frozen trajectories captured in MobileRisk-Live. These trajectories underwent human inspection and calibration before being annotated at both the step and trajectory level, covering a comprehensive taxonomy of 10 distinct risk categories.
task validation.
§Hybrid Detection and Validation: We propose a novel hybrid detection framework that combines a Formal Verifier for deterministic system-level violations with a VLM-based Contextual Judge for nuanced, context-dependent risks. Extensive experiments show this approach achieves 10-30% improvement over baselines, providing critical insights for developing safer mobile agents.
We will release all codes for infra, method, and benchmarking pipelines and more
details. We hope this work can inspire and boost future research advancing the safety of mobile GUI agents.
2025-10-29 Initial release of our paper, environment, benchmark, and đ Project Website. Check it out! đ
Safety Issues of Mobile GUI Agents
Large Language Models (LLMs) and Vision-Language Models (VLMs) have significantly advanced the development of
autonomous agents capable of interacting with digital systems. Recent progress in computer-using agentsâagents
that operate through GUI and CLI interfacesâhas unlocked new possibilities for automating complex workflows in
domains like software engineering and web navigation.
ScienceBoard is the first to extend this paradigm to scientific discovery, a domain where agents must not only
operate software, but also demonstrate domain understanding, precise control, and multi-step reasoning. From
simulating planetary motion in Celestia to manipulating molecular structures in ChimeraX, scientific tasks require
agents to execute precise interactions grounded in both visual perception and scientific knowledge.
computer-using agents in ScienceBoard can simulate the real behavior of human scientists operating complex tools.
This foundation sets the stage for developing AI co-scientists that can assist, or even automate the scientific
research lifecycle.
MobileRisk-Live: Testbed for Mobile Agent Safety
The ScienceBoard Environment offers a dynamic, multimodal platform where agents interact with real scientific
software in a virtualized desktop. Built on an Ubuntu-based virtual machine, this environment integrates graphical
and command-line interfaces, enabling agents to complete tasks just as human researchers wouldâby clicking,
typing, or issuing terminal commands.
Each software is carefully integrated and adapted to expose internal application states, allowing for fine-grained
state tracking and automatic evaluation. Agents perceive the environment through textual (a11ytree), visual
(screenshot), or hybrid observation modalities, and operate via a unified action space covering GUI actions, CLI
commands, and external API calls.
MobileRisk: A Benchmark of Realistic Trajectories
The ScienceBoard Benchmark is a curated suite of 169 high-quality tasks spanning six scientific domains, designed
to rigorously evaluate the capabilities of computer-using agents in realistic research scenarios. Each task
reflects real-world challenges encountered in scientific workflows, ranging from molecular structure analysis to
geospatial modeling and formal proof construction.
Task Type
Statistics
Total Tasks
169 (100%)
- GUI
38 (22.5%)
- CLI
33 (19.5%)
- GUI + CLI
98 (58.0%)
Difficulty
- Easy
91 (53.8%)
- Medium
48 (28.4%)
- Hard
28 (16.6%)
- Open Problems
2 (1.2%)
Instructions
Average length of task instructions
20.0
Average length of agentic prompts
374.9
Execution
Average steps
9.0
Average time comsumption
124(s)
Table: Statistics of ScienceBoard
Each task is validated and categorized by interface modality (GUI, CLI, or hybrid) and difficulty level, ensuring
both diversity and challenge. Notably, ScienceBoard supports cross-application workflows, such as generating
reports in TeXstudio based on prior analysis in ChimeraXâenabling end-to-end scientific automation.
The benchmark pushes beyond QA or code generation, setting a new bar for evaluating agents on tool use, coding,
visual/textual reasoning, and domain-specific knowledge in real research contexts.
Evaluations
Main Settings
Even state-of-the-art models like GPT-4o and Claude achieve only ~15% average success on ScienceBoard.
Open-source agents perform slightly worse, often dropping below 12%, and sometimes approaching 0% in specific
task categoriesâhighlighting a significant gap compared to human performance.
The results are shown in Table 1
Domain-Specific Challenges: Agents perform relatively well on algebra and biochemistry tasks,
but struggle significantly in geospatial and astronomy domains. This gap stems from two factors: (1) GUI-heavy
interactions in GIS/astronomy are harder to ground visually than CLI-based tasks, and (2) these domains feature
dense, complex visuals (e.g., maps, star charts) that strain current models' spatial reasoning abilities.
Impact of Observations: Multimodal input improves performance. The best results come from
combining screenshots with a11ytree representations, offering both visual grounding and structured element data.
In contrast, Set-of-Mark (SoM) sometimes introduces noise, especially in visually crowded interfaces like
Celestia.
Disentangled Planning and Action
Modular Design Boosts Performance: ScienceBoard experiments reveal that separating planning
from execution significantly improves performance. When GPT-4o is used solely as a planner and paired with a
specialized VLM or GUI action model like OS-ATLAS or UI-TARS as the executor, success rates increase notably across domains.
Implication: These findings highlight the potential of building multi-agent systems where
different components specialize in distinct subtasksâplanning, grounding, or domain understandingâpaving the way
for scalable and adaptable scientific agents.
Conclusion
We conduct a comprehensive study of safety issues in mobile GUI agents in this work. To support realistic safety research, we introduce MobileRisk-Live and MobileRisk, which provide a dynamic sandbox environment and a benchmark of fine-grained annotations, thereby enabling general-purpose and reproducible evaluation. We then propose OS-Sentinel, a novel hybrid detection framework that unifies deterministic verification with contextual risk assessment based on system state traces, multimodal contents, and agent actions, offering broader and deeper coverage than prior approaches. Extensive experiments and analyses demonstrate the value and reliability of our newly proposed testbeds and strategies. By contributing infrastructure, methodology, and empirical insights, this work establishes a new paradigm and moves the field forward toward safety-enhanced mobile GUI agents.
Acknowledgement
We would like to thank AndroidWorld and MobileSafetyBench authors for
providing infra and baselines, as well as the Cambrian authors for providing this webpage
template.
BibTeX
@article{sun2025sentinell
title={OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows},
author={Qiushi Sun and Mukai Li and Zhoumianze Liu and Zhihui Xie and Fangzhi Xu and Zhangyue Yin and Kanzhi Cheng and Zehao Li and Zichen Ding and Qi Liu and Zhiyong Wu and Zhuosheng Zhang and Ben Kao and Lingpeng Kong},
year={2025},
journal={arXiv preprint arXiv:2510.24411}
}