MechEvalAgent: Grounded Evaluation of Research Agents in Mechanistic Interpretability

<aside> 💡

We introduce MechEvalAgents as the first step towards rethinking research evaluation.

</aside>

Academic peer review still treats paper as the single source of truth. Reviewers read the narrative, comment on the narrative, and score the narrative, even when the underlying code is right there. But science is more than a story. It’s something we can test. In this work, we aim to verify the science, not just the story.

This is especially relevant as AI agents are increasingly used in scientific research, assisting with idea generation, experiment design, and even writing full papers. Events like the Agent4Science conference have gone further, requiring researchers to collaborate with AI systems and even listing AI as the first author of submitted work. We need new ways of evaluating research in the era of AI agents.

Several recent works tackle this problem. For example, ResearchRubrics introduces a benchmark of prompts and finely-grained rubrics for evaluating deep research agents on the quality of their long-form answers. DeepResearch Bench presents 100 PhD-level tasks across 22 domains focused on report generation and retrieval/citation quality. While these frameworks mark important progress, they still evaluate agents primarily at the knowledge level rather than validating the full research trace from hypothesis to experiment to code to result to conclusion.

In this work, we address this bottleneck in two steps. First, we propose a standardized research-output format for AI research agents, so their work can be inspected and compared consistently. Second, we introduce a grounded, automatic evaluation pipeline that assesses the coherence of the experimental process, the reproducibility of results, and the generalizability of the agent’s scientific understanding in mechanistic interpretability. Finally, we present case studies revealing common failure modes in current research agents and discuss the remaining limitations of automated evaluation.

Created with the help of Gemini and ChatGPT

A Standard for Unified Research Agent Outputs

If we want to evaluate research agents systematically, we first need to make their work comparable (in fact, this applies to human research outputs as well). At the moment, different agents produce very different kinds of outputs. Some write paper-style reports, others generate GitHub repositories or scattered python scripts. Each experiment ends up looking unique, which makes it difficult to tell whether two agents truly reached the same insight or just presented it differently. Without a shared structure, evaluation becomes driven more by presentation style than by scientific substance.

We argue that research agents should produce a unified set of outputs, organized around the similar scientific reasoning process that humans follow. Humans typically present their work in the “rational reconstruction” style, which only shows ideal history of exploration. However, research agents naturally record the entire research process, including detours that do not directly lead to the answer. These detours are still valuable for evaluation because they reveal whether the agent can update and refine its hypotheses based on empirical findings. Therefore, we argue that a research trace should include:

Plans outlining the hypothesis, methodology, and expected outcomes, along with how these evolve.
Code Implementation that executes the plan and produces interpretable intermediate outputs.
Code Walkthrough explaining how the code works and how to run it.
Research Report documenting the goal, data, methods, results, analysis, and final conclusions.

Screenshot 2025-11-19 at 12.57.31.png

This shared structure makes evaluation grounded: we can see whether the implementation follows the plan, whether the results truly support the claims, and whether the documentation accurately reflects the process. In unifying these outputs, we make research agents easier to compare and build upon. Over time, this shared format allows agents to speak a common research language, so we can start to measure their progress as scientists rather than as text generators.

MechEvalAgents for Grounded Evaluation in Mechanistic Interpretability

Once we have a unified format, we need a domain where that structure can actually be tested and scored. Mechanistic interpretability (Mech Interp) offers a good testbed. Two key properties make Mech Interp suited for evaluating research agents: