We introduce ScienceBoard, a realistic and multimodal environment designed to evaluate and advance computer-using agents for scientific discovery. By integrating domain-specific software and curating a benchmark of validated workflows, ScienceBoard enables rigorous assessment of agents’ abilities to operate in real scientific settings.

ScienceBoard is built around the following components, aiming for synthesizing high-quality trajectory data for GUI agents:

We will release all codes for infra, benchmark evaluation pipelines and more details. We hope ScienceBoard can inspire and boost future research advancing computer-using agents in scientific workflows.

2025-08-28 We release CODA, a dual-brain agent that achieves SOTA on ScienceBoard 🧠.

2025-06-30 We release new evaluation results (GUI-Actor, UI-TARS-1.5) and agent trajectories. 🎊

2025-06-08 ScienceBoard will be presented at WCUA@ICML 2025 as an oral paper! 🚀

2025-05-27 Initial release of our paper, environment, benchmark, and 🌐 Project Website. Check it out! 🚀

Computer-using Agents for Science

Large Language Models (LLMs) and Vision-Language Models (VLMs) have significantly advanced the development of autonomous agents capable of interacting with digital systems. Recent progress in computer-using agents—agents that operate through GUI and CLI interfaces—has unlocked new possibilities for automating complex workflows in domains like software engineering and web navigation. ScienceBoard is the first to extend this paradigm to scientific discovery, a domain where agents must not only operate software, but also demonstrate domain understanding, precise control, and multi-step reasoning. From simulating planetary motion in Celestia to manipulating molecular structures in ChimeraX, scientific tasks require agents to execute precise interactions grounded in both visual perception and scientific knowledge.

computer-using agents in ScienceBoard can simulate the real behavior of human scientists operating complex tools. This foundation sets the stage for developing AI co-scientists that can assist, or even automate the scientific research lifecycle.

ScienceBoard Environment

The ScienceBoard Environment offers a dynamic, multimodal platform where agents interact with real scientific software in a virtualized desktop. Built on an Ubuntu-based virtual machine, this environment integrates graphical and command-line interfaces, enabling agents to complete tasks just as human researchers would—by clicking, typing, or issuing terminal commands.

Each software is carefully integrated and adapted to expose internal application states, allowing for fine-grained state tracking and automatic evaluation. Agents perceive the environment through textual (a11ytree), visual (screenshot), or hybrid observation modalities, and operate via a unified action space covering GUI actions, CLI commands, and external API calls.

ScienceBoard Benchmark

The ScienceBoard Benchmark is a curated suite of 169 high-quality tasks spanning six scientific domains, designed to rigorously evaluate the capabilities of computer-using agents in realistic research scenarios. Each task reflects real-world challenges encountered in scientific workflows, ranging from molecular structure analysis to geospatial modeling and formal proof construction.

Task Type	Statistics
Total Tasks	169 (100%)
- GUI	38 (22.5%)
- CLI	33 (19.5%)
- GUI + CLI	98 (58.0%)
Difficulty
- Easy	91 (53.8%)
- Medium	48 (28.4%)
- Hard	28 (16.6%)
- Open Problems	2 (1.2%)
Instructions
Average length of task instructions	20.0
Average length of agentic prompts	374.9
Execution
Average steps	9.0
Average time comsumption	124(s)

Table: Statistics of ScienceBoard

Each task is validated and categorized by interface modality (GUI, CLI, or hybrid) and difficulty level, ensuring both diversity and challenge. Notably, ScienceBoard supports cross-application workflows, such as generating reports in TeXstudio based on prior analysis in ChimeraX—enabling end-to-end scientific automation. The benchmark pushes beyond QA or code generation, setting a new bar for evaluating agents on tool use, coding, visual/textual reasoning, and domain-specific knowledge in real research contexts.

Evaluations

Main Settings

Even state-of-the-art models like GPT-4o and Claude achieve only ~15% average success on ScienceBoard. Open-source agents perform slightly worse, often dropping below 12%, and sometimes approaching 0% in specific task categories—highlighting a significant gap compared to human performance. The results are shown in Table 1

Domain-Specific Challenges: Agents perform relatively well on algebra and biochemistry tasks, but struggle significantly in geospatial and astronomy domains. This gap stems from two factors: (1) GUI-heavy interactions in GIS/astronomy are harder to ground visually than CLI-based tasks, and (2) these domains feature dense, complex visuals (e.g., maps, star charts) that strain current models' spatial reasoning abilities.

Impact of Observations: Multimodal input improves performance. The best results come from combining screenshots with a11ytree representations, offering both visual grounding and structured element data. In contrast, Set-of-Mark (SoM) sometimes introduces noise, especially in visually crowded interfaces like Celestia.

Disentangled Planning and Action

Modular Design Boosts Performance: ScienceBoard experiments reveal that separating planning from execution significantly improves performance. When GPT-4o is used solely as a planner and paired with a specialized VLM or GUI action model like OS-ATLAS or UI-TARS as the executor, success rates increase notably across domains.

Implication: These findings highlight the potential of building multi-agent systems where different components specialize in distinct subtasks—planning, grounding, or domain understanding—paving the way for scalable and adaptable scientific agents.

Conclusion

ScienceBoard represents a major step toward building intelligent, computer-using agents for real scientific workflows. By combining a realistic, multimodal environment with a challenging benchmark grounded in domain expertise, ScienceBoard enables rigorous evaluation of agents in tasks far beyond static QA or code generation. Our findings reveal that even top-tier models fall short of human performance, especially in visually complex or domain-specific scenarios. However, modular designs, multimodal input, and agent specialization show promising gains—pointing to a path forward. ScienceBoard lays the foundation for the next generation of AI research assistants. We invite the community to explore, evaluate, and build upon this platform to accelerate progress toward truly autonomous scientific discovery.

ScienceBoard

Evaluating Multimodal Autonomous Agents
in Realistic Scientific Workflows

Computer-using Agents for Science

ScienceBoard Environment

ScienceBoard Benchmark

Evaluations

Main Settings

Disentangled Planning and Action

Conclusion

Acknowledgement

BibTeX

ScienceBoard

Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Computer-using Agents for Science

ScienceBoard Environment

ScienceBoard Benchmark

Evaluations

Main Settings

Disentangled Planning and Action

Conclusion

Acknowledgement

BibTeX

Evaluating Multimodal Autonomous Agents
in Realistic Scientific Workflows