OS-Sentinel

Towards Safety-Enhanced Mobile GUI Agents
via Hybrid Validation in Realistic Workflows

Introducing OS-Sentinel, a novel hybrid safety detection framework, and MobileRisk-Live, a pioneering testbed for advancing safety research about autonomous mobile GUI agents. This work is characterized by the following core features:

Visual Representation Icon
Realistic Testbed & Benchmark: We introduce MobileRisk-Live, a dynamic sandbox environment for real-time safety studies , and MobileRisk, a benchmark of fine-grained agent trajectories with safety annotations, laying the groundwork for future research.
Connector Design Icon
Novel Hybrid Framework: We propose OS-Sentinel, a hybrid framework that integrates a formal verifier for explicit system-level detection with a model-based contextual judge to handle multifaceted safety challenges.
Instruction Tuning Data Icon
Multi-Granularity Detection: The framework operates at both the step-level to function as a real-time safety guard and at the trajectory-level for comprehensive post-hoc analysis.
Benchmarking Icon
Comprehensive & Effective Evaluation: Extensive experiments validate the superiority of our approach, showing OS-Sentinel consistently surpasses traditional baselines, achieving 10%-30% improvements.
Teaser Image

We introduce a study focused on safety-enhanced mobile GUI agents. This work features OS-Sentinel, a novel hybrid detection framework. We also present MobileRisk-Live, a dynamic sandbox environment, and MobileRisk, a corresponding safety benchmark. Together, this suite of tools enables robust evaluation and detection of multifaceted safety risks in realistic mobile workflows.

OS-Sentinel is built around the following components, aiming for the systematic evaluation and detection of safety risks in mobile agents:

  1. §Safety for Mobile GUI Agents: OS-Sentinel addresses the critical, underexplored challenge of agent safety on mobile platforms, tackling multifaceted risks ranging from inadvertent privacy leakage and offensive content to system compromise.
  2. §Dynamic Sandbox Environment: This work provides MobileRisk-Live, a dynamic and extendable Android emulator-based sandbox. It supports real-time agent interaction while recording both GUI observations and underlying system state traces, which are critical for safety analysis.
  3. §Realistic Safety Benchmark: We derive MobileRisk, a benchmark of fine-grained mobile agent trajectories built from frozen trajectories captured in MobileRisk-Live. These trajectories underwent human inspection and calibration before being annotated at both the step and trajectory level, covering a comprehensive taxonomy of 10 distinct risk categories. task validation.
  4. §Hybrid Detection and Validation: We propose a novel hybrid detection framework that combines a Formal Verifier for deterministic system-level violations with a VLM-based Contextual Judge for nuanced, context-dependent risks. Extensive experiments show this approach achieves 10-30% improvement over baselines, providing critical insights for developing safer mobile agents.

Visual Representation Logo Main
Pipeline
Eval Logo Evaluation
Results
Data Logo In-depth
Analysis
Data Logo Code
Data

Click to jump to each section.

We will release all codes for infra, method, and benchmarking pipelines and more details. We hope this work can inspire and boost future research advancing the safety of mobile GUI agents.

2025-10-29 Initial release of our paper, environment, benchmark, and 🌐 Project Website. Check it out! 🚀

Safety Issues of Mobile GUI Agents

Large Language Models (LLMs) and Vision-Language Models (VLMs) have significantly advanced the development of autonomous agents capable of interacting with digital systems. Recent progress in computer-using agents—agents that operate through GUI and CLI interfaces—has unlocked new possibilities for automating complex workflows in domains like software engineering and web navigation. ScienceBoard is the first to extend this paradigm to scientific discovery, a domain where agents must not only operate software, but also demonstrate domain understanding, precise control, and multi-step reasoning. From simulating planetary motion in Celestia to manipulating molecular structures in ChimeraX, scientific tasks require agents to execute precise interactions grounded in both visual perception and scientific knowledge.

computer-using agents in ScienceBoard can simulate the real behavior of human scientists operating complex tools. This foundation sets the stage for developing AI co-scientists that can assist, or even automate the scientific research lifecycle.

MobileRisk-Live: Testbed for Mobile Agent Safety

The ScienceBoard Environment offers a dynamic, multimodal platform where agents interact with real scientific software in a virtualized desktop. Built on an Ubuntu-based virtual machine, this environment integrates graphical and command-line interfaces, enabling agents to complete tasks just as human researchers would—by clicking, typing, or issuing terminal commands.

Each software is carefully integrated and adapted to expose internal application states, allowing for fine-grained state tracking and automatic evaluation. Agents perceive the environment through textual (a11ytree), visual (screenshot), or hybrid observation modalities, and operate via a unified action space covering GUI actions, CLI commands, and external API calls.

MobileRisk: A Benchmark of Realistic Trajectories

The ScienceBoard Benchmark is a curated suite of 169 high-quality tasks spanning six scientific domains, designed to rigorously evaluate the capabilities of computer-using agents in realistic research scenarios. Each task reflects real-world challenges encountered in scientific workflows, ranging from molecular structure analysis to geospatial modeling and formal proof construction.

Task Type Statistics
Total Tasks 169 (100%)
- GUI 38 (22.5%)
- CLI 33 (19.5%)
- GUI + CLI 98 (58.0%)
Difficulty
- Easy 91 (53.8%)
- Medium 48 (28.4%)
- Hard 28 (16.6%)
- Open Problems 2 (1.2%)
Instructions
Average length of task instructions 20.0
Average length of agentic prompts 374.9
Execution
Average steps 9.0
Average time comsumption 124(s)
Table: Statistics of ScienceBoard

Each task is validated and categorized by interface modality (GUI, CLI, or hybrid) and difficulty level, ensuring both diversity and challenge. Notably, ScienceBoard supports cross-application workflows, such as generating reports in TeXstudio based on prior analysis in ChimeraX—enabling end-to-end scientific automation. The benchmark pushes beyond QA or code generation, setting a new bar for evaluating agents on tool use, coding, visual/textual reasoning, and domain-specific knowledge in real research contexts.


Evaluations

Main Settings

Even state-of-the-art models like GPT-4o and Claude achieve only ~15% average success on ScienceBoard. Open-source agents perform slightly worse, often dropping below 12%, and sometimes approaching 0% in specific task categories—highlighting a significant gap compared to human performance. The results are shown in Table 1

Domain-Specific Challenges: Agents perform relatively well on algebra and biochemistry tasks, but struggle significantly in geospatial and astronomy domains. This gap stems from two factors: (1) GUI-heavy interactions in GIS/astronomy are harder to ground visually than CLI-based tasks, and (2) these domains feature dense, complex visuals (e.g., maps, star charts) that strain current models' spatial reasoning abilities.

Impact of Observations: Multimodal input improves performance. The best results come from combining screenshots with a11ytree representations, offering both visual grounding and structured element data. In contrast, Set-of-Mark (SoM) sometimes introduces noise, especially in visually crowded interfaces like Celestia.

Disentangled Planning and Action

Modular Design Boosts Performance: ScienceBoard experiments reveal that separating planning from execution significantly improves performance. When GPT-4o is used solely as a planner and paired with a specialized VLM or GUI action model like OS-ATLAS or UI-TARS as the executor, success rates increase notably across domains.

Implication: These findings highlight the potential of building multi-agent systems where different components specialize in distinct subtasks—planning, grounding, or domain understanding—paving the way for scalable and adaptable scientific agents.

Conclusion

We conduct a comprehensive study of safety issues in mobile GUI agents in this work. To support realistic safety research, we introduce MobileRisk-Live and MobileRisk, which provide a dynamic sandbox environment and a benchmark of fine-grained annotations, thereby enabling general-purpose and reproducible evaluation. We then propose OS-Sentinel, a novel hybrid detection framework that unifies deterministic verification with contextual risk assessment based on system state traces, multimodal contents, and agent actions, offering broader and deeper coverage than prior approaches. Extensive experiments and analyses demonstrate the value and reliability of our newly proposed testbeds and strategies. By contributing infrastructure, methodology, and empirical insights, this work establishes a new paradigm and moves the field forward toward safety-enhanced mobile GUI agents.

Acknowledgement

We would like to thank AndroidWorld and MobileSafetyBench authors for providing infra and baselines, as well as the Cambrian authors for providing this webpage template.

BibTeX

@article{sun2025sentinell
   title={OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows},
   author={Qiushi Sun and Mukai Li and Zhoumianze Liu and Zhihui Xie and Fangzhi Xu and Zhangyue Yin and Kanzhi Cheng and Zehao Li and Zichen Ding and Qi Liu and Zhiyong Wu and Zhuosheng Zhang and Ben Kao and Lingpeng Kong},
   year={2025},
   journal={arXiv preprint arXiv:2510.24411}
}