Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: (1) a low Signal-to-Noise Ratio (SNR), where crucial evidence is easily buried among irrelevant pages, and (2) supervision scarcity, since datasets that provide only final short answers offer a weak learning signal. In this paper, we address these challenges by introducing a paradigm that requires the model to execute a structured Analysis–Localization–Reasoning workflow. To instill this capability, we design a two-stage training framework. We first perform supervised fine-tuning on high-quality data generated through an efficient knowledge distillation strategy. We then employ Evidence-aware Group Relative Policy Optimization to jointly optimize evidence localization and answer accuracy. In addition, we introduce an Evidence-Guided Resolution Allocation strategy to mitigate the memory constraints of training on multi-page documents. Extensive experiments show that DocSeeker achieves superior performance on both in-domain and out-of-domain benchmarks. The model generalizes robustly from short-page training to ultra-long documents and works naturally with visual Retrieval-Augmented Generation systems, providing a strong foundation for building practical long-document reasoning pipelines.
DocSeeker follows a structured visual reasoning pipeline. Instead of directly predicting an answer from a long document, the model is trained to first analyze the question, then localize the most relevant evidence pages, and finally perform grounded reasoning over the identified evidence. This explicit reasoning path improves interpretability and helps the model resist the heavy noise introduced by long multi-page inputs.
Training proceeds in two stages. Stage I injects the ALR reasoning paradigm through supervised fine-tuning on distilled ALR chain-of-thought data. Stage II applies Evidence-aware GRPO to jointly optimize output format, evidence localization, and final answer accuracy. During training, EGRA keeps evidence pages at high resolution while selectively reducing computation on non-evidence pages, enabling more efficient long-document learning without sacrificing critical visual details.
DocSeeker delivers consistent gains across both in-domain and out-of-domain benchmarks. Compared with the same 7B baseline, it achieves clear improvements on DUDE, MP-DocVQA, MMLongBench-doc, LongDocURL, and SlideVQA. These results indicate that the proposed ALR reasoning paradigm and evidence-aware optimization strategy substantially strengthen long-document understanding and generalization.
@inproceedings{yan2026docseeker,
title = {DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding},
author = {Hao Yan and Yuliang Liu and Xingchen Liu and Yuyi Zhang and Minghui Liao and Jihao Wu and Wei Chen and Xiang Bai},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2026}
}