folder_open ~/Benchmarks/WildGraphBench

BENCHMARK GraphRAG

PREPRINT 2026

WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora

Pengyu Wang^1,* · Benfeng Xu^1,2,† · Licheng Zhang^1,§ · Shaohan Wang¹ · Mingxuan Du¹ · Chiwei Zhu¹ · Zhendong Mao¹

¹University of Science and Technology of China ²Metastone Technology, Beijing, China

{wangpengyu, benfeng, zlczlc}@mail.ustc.edu.cn

^*Work done during internship ^†Project lead ^§Corresponding author

PREPRINT 2026 BENCHMARK

Abstract

Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge, failing to adequately evaluate systems in realistic settings involving long contexts and large-scale heterogeneous documents.

We introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild. We leverage Wikipedia's unique structure, where cohesive narratives are grounded in long and heterogeneous external reference documents, to construct a benchmark reflecting real-world scenarios.

Specifically, we sample articles across 12 top-level topics, using their external references as the retrieval corpus and citation-linked statements as ground truth, resulting in 1,197 questions spanning three levels of complexity: single-fact QA, multi-fact QA, and section-level summarization.

Why we use Wikipedia references as wild evidence

analytics Benchmark Statistics

1,197

Questions

3 question types

Topics

Wikipedia categories

45.5M

Tokens

Reference corpus size

325

Avg. References

Per Wikipedia article

quiz Three Question Types

We design three types of questions to stress the spectrum from precise retrieval to broad factual coverage:

looks_one

Single-Fact QA

Questions grounded by a single gold statement and one reference.

667 questions

looks_two

Multi-Fact QA

Questions requiring evidence aggregation across multiple statements/references.

191 questions

summarize

Section Summary

Section-level summary questions evaluated at the statement level.

339 questions

category Topic Coverage

palette

Culture

155 Q

public

Geography

98 Q

health_and_safety

Health

150 Q

history_edu

History

36 Q

sports_soccer

Human Act.

140 Q

functions

Mathematics

33 Q

eco

Nature

28 Q

person

People

154 Q

psychology

Philosophy

70 Q

self_improvement

Religion

106 Q

groups

Society

114 Q

memory

Technology

113 Q

fact_check Evaluation Framework

check

QA Accuracy

For single-fact and multi-fact questions:

Each question has one gold statement
LLM judge checks factual equivalence
Binary score: 1 (correct) or 0 (incorrect)

format_list_numbered

Summary Score

Statement-level evaluation metrics:

Extract predicted statements from output
Match against gold statement set
Compute Precision, Recall, and F1

lightbulb Key Findings

check_circle GraphRAG is not always advantageous — it can be more expensive than NaiveRAG or BM25 without clear gains for single-fact lookup.
trending_up Graph-based retrieval shines on multi-fact questions — Microsoft GraphRAG (global) achieves the best accuracy (47.64%) on questions requiring cross-document evidence aggregation.
warning Summary questions remain challenging — all methods obtain low statement-level scores, with NaiveRAG achieving the highest recall and best F1 due to broader context coverage.
hub Hub-and-spoke patterns — our graph exhibits a dramatically larger max degree (967), indicating hub entities that stress cross-document multi-source summarization.

Citation

@misc{wang2026wildgraphbenchbenchmarkinggraphragwildsource,
  title={WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora}, 
  author={Pengyu Wang and Benfeng Xu and Licheng Zhang and Shaohan Wang and Mingxuan Du and Chiwei Zhu and Zhendong Mao},
  year={2026},
  eprint={2602.02053},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2602.02053}, 
}