folder_open ~/Benchmarks/DeepResearch Bench II

BENCHMARK Deep Research

PREPRINT 2026

DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Reports

Ruizhe Li^1,* · Mingxuan Du^1,* · Benfeng Xu^1,2,† · Chiwei Zhu¹ · Xiaorui Wang² · Zhendong Mao^1,§

¹University of Science and Technology of China ²Metastone Technology, Beijing, China

{imlrz, dumingxuan}@mail.ustc.edu.cn

^*Equal contribution ^†Project lead ^§Corresponding author

PREPRINT 2026 BENCHMARK

DeepResearch Bench II Framework Overview

Abstract

In recent years, the integration of Large Language Models (LLMs) with deep research capabilities has led to the development of systems that can autonomously conduct multi-step online research and generate comprehensive reports. However, existing evaluation benchmarks for such systems fail to fully capture the complexity of real-world research tasks.

We introduce DeepResearch Bench II, a benchmark designed to evaluate deep research systems across three critical dimensions: Information Recall, Analysis, and Presentation. These dimensions are assessed using fine-grained rubrics derived from 132 expert-authored research reports, ensuring that evaluations are both comprehensive and verifiable.

analytics Benchmark Statistics

132

Research Tasks

From expert reports

9,430

Total Rubrics

Fine-grained & verifiable

Topic Domains

Comprehensive coverage

300+

Expert Hours

For review & refinement

schema Three-Dimensional Evaluation Framework

We deconstruct deep research tasks into three key dimensions to comprehensively evaluate system capabilities:

Information Recall

Evaluates whether the model can accurately and comprehensively retrieve relevant information from the internet.

Understand what information should be collected
Find relevant data from vast sources
Ensure accuracy through source validation

psychology

Analysis

Examines whether the system can synthesize gathered information and extract new insights.

Synthesize information from multiple sources
Extract hidden insights (trends, paradigms)
Generate conclusions beyond raw data

description

Presentation

Assesses whether findings are presented in an appropriate, user-friendly way.

Enable user trust and verification
Use tables, charts for clarity
Consider user's knowledge level

checklist Rubric Statistics

52.9

Avg. InfoRecall

rubrics per task

12.8

Avg. Analysis

rubrics per task

5.7

Avg. Presentation

rubrics per task

Citation

@article{li2026deepresearchbenchii,
  title={DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Reports},
  author={Li, Ruizhe and Du, Mingxuan and Xu, Benfeng and Zhu, Chiwei and Wang, Xiaorui and Mao, Zhendong},
  journal={arXiv preprint arXiv:2601.08536},
  year={2026}
}

Leaderboard

Comparison of SOTA Deep Research Agents based on 9,430 expert-written rubrics across 132 research tasks. Scores represent rubric pass rates (%).

Scoring Metrics

InfoRecall

Accurately and comprehensively retrieve relevant information from the internet

Analysis

Synthesize information and extract new, hidden insights beyond raw data

Presentation

Present findings in a trustworthy, verifiable, and user-accessible manner

TotalScore

Weighted overall performance score

Bold = Best in column Underline = Second best

#	Model	InfoRecall	Analysis	Presentation	TotalScore
1	Xiaoyi DeepResearch Huawei, 2026	53.05	69.90	91.12	58.72
2	CMCC-DeepInsight China Mobile, 2026	49.60	62.95	92.94	55.39
3	nvidia-aiq (Nemotron 3, Opus 4.6) NVIDIA	49.23	61.55	93.15	54.50
4	OpenAI-GPT-o3 Deep Research OpenAI, 2025	39.98	49.85	89.16	45.40
5	Gemini-3-Pro Deep Research Google, 2025	39.09	48.94	91.85	44.60
6	Gemini-2.5-Pro Deep Research Google, 2024	34.91	51.91	90.24	41.98
7	Doubao Deep Research ByteDance	34.83	49.43	83.51	40.99
8	Qwen3-Max Deep Research Alibaba Cloud, 2025	34.18	48.04	74.59	39.25
9	Grok Deep Search xAI	33.52	42.50	91.42	39.23
10	Perplexity Research Perplexity AI, 2025	33.05	44.47	79.34	38.58
11	Tongyi Deep Research Alibaba, 2025	22.95	35.89	86.13	29.89

* All evaluations conducted under identical conditions. Scores are rubric pass rates (%).

Methodology Notes

InfoRecall: Assesses whether the model can utilize its planning and reasoning abilities, along with search tools, to accurately and comprehensively retrieve relevant information from the internet.
Analysis: The model needs to synthesize all gathered information and extract new, hidden insights (such as trends or paradigms) from the data.
Presentation: After retrieving and analyzing information, the model must present findings in an appropriate way—ensuring users can trust and verify the information.
TotalScore: Weighted combination of all metrics, emphasizing content quality over presentation.

add_circle Join the Leaderboard

If you would like to add your model to the leaderboard, please contact us at imlrz@mail.ustc.edu.cn or dumingxuan@mail.ustc.edu.cn.

Data Viewer

Browse Tasks, Rubrics & Research Content

filter_list Filter Tasks

Task ID (idx)

Language

Theme

Description Keywords

Task 1 / 132

hourglass_empty

Loading tasks...