folder_open ~/Benchmarks
Listing 5 items (4 available, 1 coming soon)DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
A comprehensive benchmark for evaluating deep research agents on complex scientific research tasks.
DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Reports
A diagnostic benchmark for deep research agents using expert-derived rubrics to evaluate report quality.
Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles
A live benchmark leveraging Wikipedia Good Articles as expert-level references to evaluate deep research agents.
WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora
A benchmark designed to assess GraphRAG performance in the wild using Wikipedia's external references as retrieval corpus, featuring 1,197 questions across 12 topics.
MCP Agent Bench
Details will be announced soon.