folder_open ~/Benchmarks/WildGraphBench
WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora
1University of Science and Technology of China 2Metastone Technology, Beijing, China
{wangpengyu, benfeng, zlczlc}@mail.ustc.edu.cn
*Work done during internship †Project lead §Corresponding author
Abstract
Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge, failing to adequately evaluate systems in realistic settings involving long contexts and large-scale heterogeneous documents.
We introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild. We leverage Wikipedia's unique structure, where cohesive narratives are grounded in long and heterogeneous external reference documents, to construct a benchmark reflecting real-world scenarios.
Specifically, we sample articles across 12 top-level topics, using their external references as the retrieval corpus and citation-linked statements as ground truth, resulting in 1,197 questions spanning three levels of complexity: single-fact QA, multi-fact QA, and section-level summarization.
analytics Benchmark Statistics
Questions
3 question types
Topics
Wikipedia categories
Tokens
Reference corpus size
Avg. References
Per Wikipedia article
quiz Three Question Types
We design three types of questions to stress the spectrum from precise retrieval to broad factual coverage:
Single-Fact QA
Questions grounded by a single gold statement and one reference.
Multi-Fact QA
Questions requiring evidence aggregation across multiple statements/references.
Section Summary
Section-level summary questions evaluated at the statement level.
category Topic Coverage
Culture
155 Q
Geography
98 Q
Health
150 Q
History
36 Q
Human Act.
140 Q
Mathematics
33 Q
Nature
28 Q
People
154 Q
Philosophy
70 Q
Religion
106 Q
Society
114 Q
Technology
113 Q
fact_check Evaluation Framework
QA Accuracy
For single-fact and multi-fact questions:
- Each question has one gold statement
- LLM judge checks factual equivalence
- Binary score: 1 (correct) or 0 (incorrect)
Summary Score
Statement-level evaluation metrics:
- Extract predicted statements from output
- Match against gold statement set
- Compute Precision, Recall, and F1
lightbulb Key Findings
- check_circle GraphRAG is not always advantageous — it can be more expensive than NaiveRAG or BM25 without clear gains for single-fact lookup.
- trending_up Graph-based retrieval shines on multi-fact questions — Microsoft GraphRAG (global) achieves the best accuracy (47.64%) on questions requiring cross-document evidence aggregation.
- warning Summary questions remain challenging — all methods obtain low statement-level scores, with NaiveRAG achieving the highest recall and best F1 due to broader context coverage.
- hub Hub-and-spoke patterns — our graph exhibits a dramatically larger max degree (967), indicating hub entities that stress cross-document multi-source summarization.
Citation
@misc{wang2026wildgraphbenchbenchmarkinggraphragwildsource,
title={WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora},
author={Pengyu Wang and Benfeng Xu and Licheng Zhang and Shaohan Wang and Mingxuan Du and Chiwei Zhu and Zhendong Mao},
year={2026},
eprint={2602.02053},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.02053},
}