folder_open ~/Benchmarks/DeepResearch Bench II
DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Reports
1University of Science and Technology of China 2Metastone Technology, Beijing, China
{imlrz, dumingxuan}@mail.ustc.edu.cn
*Equal contribution †Project lead §Corresponding author
Abstract
In recent years, the integration of Large Language Models (LLMs) with deep research capabilities has led to the development of systems that can autonomously conduct multi-step online research and generate comprehensive reports. However, existing evaluation benchmarks for such systems fail to fully capture the complexity of real-world research tasks.
We introduce DeepResearch Bench II, a benchmark designed to evaluate deep research systems across three critical dimensions: Information Recall, Analysis, and Presentation. These dimensions are assessed using fine-grained rubrics derived from 132 expert-authored research reports, ensuring that evaluations are both comprehensive and verifiable.
analytics Benchmark Statistics
Research Tasks
From expert reports
Total Rubrics
Fine-grained & verifiable
Topic Domains
Comprehensive coverage
Expert Hours
For review & refinement
schema Three-Dimensional Evaluation Framework
We deconstruct deep research tasks into three key dimensions to comprehensively evaluate system capabilities:
Information Recall
Evaluates whether the model can accurately and comprehensively retrieve relevant information from the internet.
- Understand what information should be collected
- Find relevant data from vast sources
- Ensure accuracy through source validation
Analysis
Examines whether the system can synthesize gathered information and extract new insights.
- Synthesize information from multiple sources
- Extract hidden insights (trends, paradigms)
- Generate conclusions beyond raw data
Presentation
Assesses whether findings are presented in an appropriate, user-friendly way.
- Enable user trust and verification
- Use tables, charts for clarity
- Consider user's knowledge level
checklist Rubric Statistics
Avg. InfoRecall
rubrics per task
Avg. Analysis
rubrics per task
Avg. Presentation
rubrics per task
Citation
@article{li2026deepresearchbenchii,
title={DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Reports},
author={Li, Ruizhe and Du, Mingxuan and Xu, Benfeng and Zhu, Chiwei and Wang, Xiaorui and Mao, Zhendong},
journal={arXiv preprint arXiv:2601.08536},
year={2026}
}
Leaderboard
Comparison of SOTA Deep Research Agents based on 9,430 expert-written rubrics across 132 research tasks. Scores represent rubric pass rates (%).
Scoring Metrics
Accurately and comprehensively retrieve relevant information from the internet
Synthesize information and extract new, hidden insights beyond raw data
Present findings in a trustworthy, verifiable, and user-accessible manner
Weighted overall performance score
| # | Model | InfoRecall | Analysis | Presentation | TotalScore |
|---|---|---|---|---|---|
| 1 |
NVIDIA
|
49.23 | 61.55 | 93.15 | 54.50 |
| 2 |
Huawei, 2026
|
43.94 | 56.12 | 90.08 | 49.34 |
| 3 |
Huawei, 2026
|
40.07 | 60.44 | 86.54 | 46.90 |
| 4 |
OpenAI-GPT-o3 Deep Research
OpenAI, 2025
|
39.98 | 49.85 | 89.16 | 45.40 |
| 5 |
Gemini-3-Pro Deep Research
Google, 2025
|
39.09 | 48.94 | 91.85 | 44.60 |
| 6 |
Gemini-2.5-Pro Deep Research
Google, 2024
|
34.91 | 51.91 | 90.24 | 41.98 |
| 7 |
Doubao Deep Research
ByteDance
|
34.83 | 49.43 | 83.51 | 40.99 |
| 8 |
Qwen3-Max Deep Research
Alibaba Cloud, 2025
|
34.18 | 48.04 | 74.59 | 39.25 |
| 9 |
Grok Deep Search
xAI
|
33.52 | 42.50 | 91.42 | 39.23 |
| 10 |
Perplexity Research
Perplexity AI, 2025
|
33.05 | 44.47 | 79.34 | 38.58 |
| 11 |
Tongyi Deep Research
Alibaba, 2025
|
22.95 | 35.89 | 86.13 | 29.89 |
* All evaluations conducted under identical conditions. Scores are rubric pass rates (%).
Methodology Notes
- InfoRecall: Assesses whether the model can utilize its planning and reasoning abilities, along with search tools, to accurately and comprehensively retrieve relevant information from the internet.
- Analysis: The model needs to synthesize all gathered information and extract new, hidden insights (such as trends or paradigms) from the data.
- Presentation: After retrieving and analyzing information, the model must present findings in an appropriate way—ensuring users can trust and verify the information.
- TotalScore: Weighted combination of all metrics, emphasizing content quality over presentation.
add_circle Join the Leaderboard
If you would like to add your model to the leaderboard, please contact us at imlrz@mail.ustc.edu.cn or dumingxuan@mail.ustc.edu.cn.
Data Viewer
Browse Tasks, Rubrics & Research Content
filter_list Filter Tasks
Loading tasks...