folder_open ~/Benchmarks/DeepResearch Bench II

BENCHMARK Deep Research
PREPRINT 2026

DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Reports

Ruizhe Li1,* · Mingxuan Du1,* · Benfeng Xu1,2,† · Chiwei Zhu1 · Xiaorui Wang2 · Zhendong Mao1,§

1University of Science and Technology of China 2Metastone Technology, Beijing, China

{imlrz, dumingxuan}@mail.ustc.edu.cn

*Equal contribution Project lead §Corresponding author

PREPRINT 2026 BENCHMARK
DeepResearch Bench II Framework Overview

Abstract

In recent years, the integration of Large Language Models (LLMs) with deep research capabilities has led to the development of systems that can autonomously conduct multi-step online research and generate comprehensive reports. However, existing evaluation benchmarks for such systems fail to fully capture the complexity of real-world research tasks.

We introduce DeepResearch Bench II, a benchmark designed to evaluate deep research systems across three critical dimensions: Information Recall, Analysis, and Presentation. These dimensions are assessed using fine-grained rubrics derived from 132 expert-authored research reports, ensuring that evaluations are both comprehensive and verifiable.

analytics Benchmark Statistics

132

Research Tasks

From expert reports

9,430

Total Rubrics

Fine-grained & verifiable

22

Topic Domains

Comprehensive coverage

300+

Expert Hours

For review & refinement

schema Three-Dimensional Evaluation Framework

We deconstruct deep research tasks into three key dimensions to comprehensively evaluate system capabilities:

search

Information Recall

Evaluates whether the model can accurately and comprehensively retrieve relevant information from the internet.

  • Understand what information should be collected
  • Find relevant data from vast sources
  • Ensure accuracy through source validation
psychology

Analysis

Examines whether the system can synthesize gathered information and extract new insights.

  • Synthesize information from multiple sources
  • Extract hidden insights (trends, paradigms)
  • Generate conclusions beyond raw data
description

Presentation

Assesses whether findings are presented in an appropriate, user-friendly way.

  • Enable user trust and verification
  • Use tables, charts for clarity
  • Consider user's knowledge level

checklist Rubric Statistics

52.9

Avg. InfoRecall

rubrics per task

12.8

Avg. Analysis

rubrics per task

5.7

Avg. Presentation

rubrics per task

Citation

@article{li2026deepresearchbenchii,
  title={DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Reports},
  author={Li, Ruizhe and Du, Mingxuan and Xu, Benfeng and Zhu, Chiwei and Wang, Xiaorui and Mao, Zhendong},
  journal={arXiv preprint arXiv:2601.08536},
  year={2026}
}

Leaderboard

Comparison of SOTA Deep Research Agents based on 9,430 expert-written rubrics across 132 research tasks. Scores represent rubric pass rates (%).

Scoring Metrics

InfoRecall

Accurately and comprehensively retrieve relevant information from the internet

Analysis

Synthesize information and extract new, hidden insights beyond raw data

Presentation

Present findings in a trustworthy, verifiable, and user-accessible manner

TotalScore

Weighted overall performance score

Bold = Best in column Underline = Second best
# Model InfoRecall Analysis Presentation TotalScore
1
NVIDIA
49.23 61.55 93.15 54.50
2
Huawei, 2026
43.94 56.12 90.08 49.34
3
Huawei, 2026
40.07 60.44 86.54 46.90
4
OpenAI-GPT-o3 Deep Research
OpenAI, 2025
39.98 49.85 89.16 45.40
5
Gemini-3-Pro Deep Research
Google, 2025
39.09 48.94 91.85 44.60
6
Gemini-2.5-Pro Deep Research
Google, 2024
34.91 51.91 90.24 41.98
7
Doubao Deep Research
ByteDance
34.83 49.43 83.51 40.99
8
Qwen3-Max Deep Research
Alibaba Cloud, 2025
34.18 48.04 74.59 39.25
9
Grok Deep Search
xAI
33.52 42.50 91.42 39.23
10
Perplexity Research
Perplexity AI, 2025
33.05 44.47 79.34 38.58
11
Tongyi Deep Research
Alibaba, 2025
22.95 35.89 86.13 29.89

* All evaluations conducted under identical conditions. Scores are rubric pass rates (%).

Methodology Notes

  • InfoRecall: Assesses whether the model can utilize its planning and reasoning abilities, along with search tools, to accurately and comprehensively retrieve relevant information from the internet.
  • Analysis: The model needs to synthesize all gathered information and extract new, hidden insights (such as trends or paradigms) from the data.
  • Presentation: After retrieving and analyzing information, the model must present findings in an appropriate way—ensuring users can trust and verify the information.
  • TotalScore: Weighted combination of all metrics, emphasizing content quality over presentation.

add_circle Join the Leaderboard

If you would like to add your model to the leaderboard, please contact us at imlrz@mail.ustc.edu.cn or dumingxuan@mail.ustc.edu.cn.

Data Viewer

Browse Tasks, Rubrics & Research Content

filter_list Filter Tasks

Task 1 / 132
hourglass_empty

Loading tasks...