ScienceBoard

Evaluating Multimodal Autonomous Agents
in Realistic Scientific Workflows

Introducing ScienceBoard, a first-of-its-kind evaluation platform for multimodal agents in scientific workflows. ScienceBoard is characterized by the following core features:

Visual Representation Icon
Pioneering Application: ScienceBoard is the first to bring computer-using agents into the domain of scientific discovery, enabling autonomous research assistants across disciplines.
Connector Design Icon
Realistic Environment: We provide a dynamic, visually grounded virtual environment integrated with professional scientific software, supporting both GUI and CLI interaction in real-time workflows.
Instruction Tuning Data Icon
Challenging Benchmark: A new benchmark of 169 rigorously validated tasks across 6 core domains is introduced, capturing real-world challenges.
Benchmarking Icon
Comprehensive Evaluations: We presents systematic evaluations across a wide range of agents powered by LLMs, VLMs, and GUI action models.
Teaser Image

We introduce ScienceBoard, a realistic and multimodal environment designed to evaluate and advance computer-using agents for scientific discovery. By integrating domain-specific software and curating a benchmark of validated workflows, ScienceBoard enables rigorous assessment of agents’ abilities to operate in real scientific settings.

ScienceBoard is built around the following components, aiming for synthesizing high-quality trajectory data for GUI agents:

  1. §Computer-using Agents for Scientific Discovery: ScienceBoard pioneers the application of computer-using agents to real-world scientific discovery, enabling digital automation for complex science tasks.
  2. §Infra for Scientific Discovery: ScienceBoard provides a visually rich and dynamically configurable Ubuntu-based VM serves as the scientific playground, supporting both GUI and CLI interactions and enabling agents to perform end-to-end tasks like simulation and computation.
  3. §Challenging Benchmark: We curate a benchmark of 169 human-annotated tasks across 6 scientific domains, each paired with automated evaluation functions for reliable task validation.
  4. §Systematic Evaluation and Analysis: Comprehensive experiments are conducted with state-of-the-art LLMs, VLMs, and GUI action models. We further provide insights into current agent limitations and guiding future research.

Visual Representation Logo Main
Pipeline
Eval Logo Evaluation
Results
Data Logo In-depth
Analysis
Data Logo Code
Data

Click to jump to each section.

We will release all codes for infra, benchmark evaluation pipelines and more details. We hope ScienceBoard can inspire and boost future research advancing computer-using agents in scientific workflows.

Computer-using Agents for Science

Large Language Models (LLMs) and Vision-Language Models (VLMs) have significantly advanced the development of autonomous agents capable of interacting with digital systems. Recent progress in computer-using agents—agents that operate through GUI and CLI interfaces—has unlocked new possibilities for automating complex workflows in domains like software engineering and web navigation. ScienceBoard is the first to extend this paradigm to scientific discovery, a domain where agents must not only operate software, but also demonstrate domain understanding, precise control, and multi-step reasoning. From simulating planetary motion in Celestia to manipulating molecular structures in ChimeraX, scientific tasks require agents to execute precise interactions grounded in both visual perception and scientific knowledge.

computer-using agents in ScienceBoard can simulate the real behavior of human scientists operating complex tools. This foundation sets the stage for developing AI co-scientists that can assist, or even automate the scientific research lifecycle.

ScienceBoard Environment

The ScienceBoard Environment offers a dynamic, multimodal platform where agents interact with real scientific software in a virtualized desktop. Built on an Ubuntu-based virtual machine, this environment integrates graphical and command-line interfaces, enabling agents to complete tasks just as human researchers would—by clicking, typing, or issuing terminal commands.

Each software is carefully integrated and adapted to expose internal application states, allowing for fine-grained state tracking and automatic evaluation. Agents perceive the environment through textual (a11ytree), visual (screenshot), or hybrid observation modalities, and operate via a unified action space covering GUI actions, CLI commands, and external API calls.

ScienceBoard Benchmark

The ScienceBoard Benchmark is a curated suite of 169 high-quality tasks spanning six scientific domains, designed to rigorously evaluate the capabilities of computer-using agents in realistic research scenarios. Each task reflects real-world challenges encountered in scientific workflows, ranging from molecular structure analysis to geospatial modeling and formal proof construction.

Task Type Statistics
Total Tasks 169 (100%)
- GUI 38 (22.5%)
- CLI 33 (19.5%)
- GUI + CLI 98 (58.0%)
Difficulty
- Easy 91 (53.8%)
- Medium 48 (28.4%)
- Hard 28 (16.6%)
- Open Problems 2 (1.2%)
Instructions
Average length of task instructions 20.0
Average length of agentic prompts 374.9
Execution
Average steps 9.0
Average time comsumption 124(s)
Table: Statistics of ScienceBoard

Each task is validated and categorized by interface modality (GUI, CLI, or hybrid) and difficulty level, ensuring both diversity and challenge. Notably, ScienceBoard supports cross-application workflows, such as generating reports in TeXstudio based on prior analysis in ChimeraX—enabling end-to-end scientific automation. The benchmark pushes beyond QA or code generation, setting a new bar for evaluating agents on tool use, coding, visual/textual reasoning, and domain-specific knowledge in real research contexts.


Evaluations

Main Settings

Even state-of-the-art models like GPT-4o and Claude achieve only ~15% average success on ScienceBoard. Open-source agents perform slightly worse, often dropping below 12%, and sometimes approaching 0% in specific task categories—highlighting a significant gap compared to human performance. The results are shown in Table 1

Domain-Specific Challenges: Agents perform relatively well on algebra and biochemistry tasks, but struggle significantly in geospatial and astronomy domains. This gap stems from two factors: (1) GUI-heavy interactions in GIS/astronomy are harder to ground visually than CLI-based tasks, and (2) these domains feature dense, complex visuals (e.g., maps, star charts) that strain current models' spatial reasoning abilities.

Impact of Observations: Multimodal input improves performance. The best results come from combining screenshots with a11ytree representations, offering both visual grounding and structured element data. In contrast, Set-of-Mark (SoM) sometimes introduces noise, especially in visually crowded interfaces like Celestia.

Disentangled Planning and Action

Then, we evaluate OS-Genesis on web task, using challenging online benchmark: WebArena as the testbed. The results are shown in Table 2

Modular Design Boosts Performance: ScienceBoard experiments reveal that separating planning from execution significantly improves performance. When GPT-4o is used solely as a planner and paired with a specialized VLM or GUI action model as the executor, success rates increase notably across domains.

Implication: These findings highlight the potential of building multi-agent systems where different components specialize in distinct subtasks—planning, grounding, or domain understanding—paving the way for scalable and adaptable scientific agents.

Conclusion

ScienceBoard represents a major step toward building intelligent, computer-using agents for real scientific workflows. By combining a realistic, multimodal environment with a challenging benchmark grounded in domain expertise, ScienceBoard enables rigorous evaluation of agents in tasks far beyond static QA or code generation. Our findings reveal that even top-tier models fall short of human performance, especially in visually complex or domain-specific scenarios. However, modular designs, multimodal input, and agent specialization show promising gains—pointing to a path forward. ScienceBoard lays the foundation for the next generation of AI research assistants. We invite the community to explore, evaluate, and build upon this platform to accelerate progress toward truly autonomous scientific discovery.

Acknowledgement

We would like to thank OSWorld authors for helping us tackle various issues in building infra and task evaluation, as well as the Cambrian authors for providing this webpage template.

BibTeX

@article{sun2025scienceboard,
   title={ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows},
   author={Qiushi Sun and Zhoumianze Liu and Chang Ma and Zichen Ding and Fangzhi Xu and Zhangyue Yin and Haiteng Zhao and Zhenyu Wu and Kanzhi Cheng and Zhaoyang Liu and others},
   year={2025},
   journal={arXiv preprint arXiv:2505.19897}
}