Evaluating Multimodal Autonomous Agents
in Realistic Scientific Workflows
Introducing ScienceBoard, a first-of-its-kind evaluation platform for multimodal agents in
scientific workflows.
ScienceBoard is characterized by the following core features:
Pioneering Application: ScienceBoard is the first to bring computer-using agents
into the domain of scientific discovery, enabling autonomous research assistants across disciplines.
Realistic Environment: We provide a dynamic, visually grounded virtual environment
integrated with professional scientific software, supporting both GUI and CLI interaction in real-time
workflows.
Challenging Benchmark: A new benchmark of 169 rigorously validated tasks across 6
core domains is introduced, capturing real-world challenges.
Comprehensive Evaluations: We presents systematic evaluations across a wide range of
agents powered by LLMs, VLMs, and GUI action models.
We introduce ScienceBoard, a realistic and multimodal environment designed to evaluate and advance computer-using
agents for scientific discovery. By integrating domain-specific software and curating a benchmark of validated
workflows, ScienceBoard enables rigorous assessment of agents’ abilities to operate in real scientific settings.
ScienceBoard is built around the following components, aiming for synthesizing high-quality trajectory data for GUI
agents:
§Computer-using Agents for Scientific Discovery:
ScienceBoard pioneers the application of computer-using agents to real-world scientific discovery, enabling
digital automation for complex science tasks.
§Infra for Scientific Discovery: ScienceBoard provides a
visually rich and dynamically configurable Ubuntu-based VM serves as the scientific playground, supporting both
GUI and CLI interactions and enabling agents to perform end-to-end tasks like simulation and computation.
§Challenging Benchmark: We curate a benchmark of 169
human-annotated tasks across 6 scientific domains, each paired with automated evaluation functions for reliable
task validation.
§Systematic Evaluation and Analysis: Comprehensive
experiments are conducted with state-of-the-art LLMs, VLMs, and GUI action models. We further provide insights
into current agent limitations and guiding future research.
We will release all codes for infra, benchmark evaluation pipelines and more
details. We hope ScienceBoard can inspire and boost future research advancing computer-using agents in scientific
workflows.
Computer-using Agents for Science
Large Language Models (LLMs) and Vision-Language Models (VLMs) have significantly advanced the development of
autonomous agents capable of interacting with digital systems. Recent progress in computer-using agents—agents
that operate through GUI and CLI interfaces—has unlocked new possibilities for automating complex workflows in
domains like software engineering and web navigation.
ScienceBoard is the first to extend this paradigm to scientific discovery, a domain where agents must not only
operate software, but also demonstrate domain understanding, precise control, and multi-step reasoning. From
simulating planetary motion in Celestia to manipulating molecular structures in ChimeraX, scientific tasks require
agents to execute precise interactions grounded in both visual perception and scientific knowledge.
computer-using agents in ScienceBoard can simulate the real behavior of human scientists operating complex tools.
This foundation sets the stage for developing AI co-scientists that can assist, or even automate the scientific
research lifecycle.
ScienceBoard Environment
The ScienceBoard Environment offers a dynamic, multimodal platform where agents interact with real scientific
software in a virtualized desktop. Built on an Ubuntu-based virtual machine, this environment integrates graphical
and command-line interfaces, enabling agents to complete tasks just as human researchers would—by clicking,
typing, or issuing terminal commands.
Each software is carefully integrated and adapted to expose internal application states, allowing for fine-grained
state tracking and automatic evaluation. Agents perceive the environment through textual (a11ytree), visual
(screenshot), or hybrid observation modalities, and operate via a unified action space covering GUI actions, CLI
commands, and external API calls.
ScienceBoard Benchmark
The ScienceBoard Benchmark is a curated suite of 169 high-quality tasks spanning six scientific domains, designed
to rigorously evaluate the capabilities of computer-using agents in realistic research scenarios. Each task
reflects real-world challenges encountered in scientific workflows, ranging from molecular structure analysis to
geospatial modeling and formal proof construction.
Task Type
Statistics
Total Tasks
169 (100%)
- GUI
38 (22.5%)
- CLI
33 (19.5%)
- GUI + CLI
98 (58.0%)
Difficulty
- Easy
91 (53.8%)
- Medium
48 (28.4%)
- Hard
28 (16.6%)
- Open Problems
2 (1.2%)
Instructions
Average length of task instructions
20.0
Average length of agentic prompts
374.9
Execution
Average steps
9.0
Average time comsumption
124(s)
Table: Statistics of ScienceBoard
Each task is validated and categorized by interface modality (GUI, CLI, or hybrid) and difficulty level, ensuring
both diversity and challenge. Notably, ScienceBoard supports cross-application workflows, such as generating
reports in TeXstudio based on prior analysis in ChimeraX—enabling end-to-end scientific automation.
The benchmark pushes beyond QA or code generation, setting a new bar for evaluating agents on tool use, coding,
visual/textual reasoning, and domain-specific knowledge in real research contexts.
Evaluations
Main Settings
Even state-of-the-art models like GPT-4o and Claude achieve only ~15% average success on ScienceBoard.
Open-source agents perform slightly worse, often dropping below 12%, and sometimes approaching 0% in specific
task categories—highlighting a significant gap compared to human performance.
The results are shown in Table 1
Domain-Specific Challenges: Agents perform relatively well on algebra and biochemistry tasks,
but struggle significantly in geospatial and astronomy domains. This gap stems from two factors: (1) GUI-heavy
interactions in GIS/astronomy are harder to ground visually than CLI-based tasks, and (2) these domains feature
dense, complex visuals (e.g., maps, star charts) that strain current models' spatial reasoning abilities.
Impact of Observations: Multimodal input improves performance. The best results come from
combining screenshots with a11ytree representations, offering both visual grounding and structured element data.
In contrast, Set-of-Mark (SoM) sometimes introduces noise, especially in visually crowded interfaces like
Celestia.
Disentangled Planning and Action
Then, we evaluate OS-Genesis on web task, using challenging online benchmark: WebArena as the testbed.
The results are shown in Table 2
Modular Design Boosts Performance: ScienceBoard experiments reveal that separating planning
from execution significantly improves performance. When GPT-4o is used solely as a planner and paired with a
specialized VLM or GUI action model as the executor, success rates increase notably across domains.
Implication: These findings highlight the potential of building multi-agent systems where
different components specialize in distinct subtasks—planning, grounding, or domain understanding—paving the way
for scalable and adaptable scientific agents.
Conclusion
ScienceBoard represents a major step toward building intelligent, computer-using agents for real scientific
workflows. By combining a realistic, multimodal environment with a challenging benchmark grounded in domain
expertise, ScienceBoard enables rigorous evaluation of agents in tasks far beyond static QA or code generation.
Our findings reveal that even top-tier models fall short of human performance, especially in visually complex or
domain-specific scenarios. However, modular designs, multimodal input, and agent specialization show promising
gains—pointing to a path forward. ScienceBoard lays the foundation for the next generation of AI research
assistants. We invite the community to explore, evaluate, and build upon this platform to accelerate progress
toward truly autonomous scientific discovery.
Acknowledgement
We would like to thank OSWorld authors for
helping us tackle various issues in building infra and task evaluation, as well as the Cambrian authors for providing this webpage
template.
BibTeX
@article{sun2025scienceboard,
title={ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows},
author={Qiushi Sun and Zhoumianze Liu and Chang Ma and Zichen Ding and Fangzhi Xu and Zhangyue Yin
and Haiteng Zhao and Zhenyu Wu and Kanzhi Cheng and Zhaoyang Liu and others},
year={2025},
journal={arXiv preprint arXiv:2505.19897}
}