DocExplainer / README.md

update licence

8c49cda verified 3 months ago

7.53 kB

	---
	datasets:
	- letxbe/BoundingDocs
	language:
	- en
	pipeline_tag: visual-question-answering
	tags:
	- Visual-Question-Answering
	- Question-Answering
	- Document
	license: apache-2.0
	---


	<div align="center">

	<h1>DocExplainer: Document VQA with Bounding Box Localization</h1>

	</div>

	DocExplainer is a an approach to Document Visual Question Answering (Document VQA) with bounding box localization.
	Unlike standard VLMs that only provide text-based answers, DocExplainer adds visual evidence through bounding boxes, making model predictions more interpretable.
	It is designed as a plug-and-play module to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.

	- Authors: Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
	- Affiliations: [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
	- License: apache-2.0
	- Paper: ["Towards Reliable and Interpretable Document Question Answering via VLMs"](https://arxiv.org/abs/2509.10129) by Alessio Chen et al.

	<div align="center">
	<img src="https://cdn.prod.website-files.com/655f447668b4ad1dd3d4b3d9/664cc272c3e176608bc14a4c_LOGO%20v0%20-%20LetXBebicolore.svg" alt="letxbe ai logo" width="200">
	<img src="https://www.dinfo.unifi.it/upload/notizie/Logo_Dinfo_web%20(1).png" alt="Logo Unifi" width="200">
	</div>


	## Model Details

	DocExplainer is a fine-tuned [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384)-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:

	1. Question Answering: Any VLM is used as a black box component to generate a textual answer given in input a document image and question.
	2. Bounding Box Explanation: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence.


	## Model Architecture
	DocExplainer builds on [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings.

	![https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg](https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg)

	## Training Procedure
	- Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers.
	- A regression head outputs normalized coordinates `[x1, y1, x2, y2]`.
	- Backbone: SigLiP2 Giant (frozen).
	- Loss Function: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1].

	#### Training Setup
	- Dataset: [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs)
	- Epochs: 20
	- Optimizer: AdamW
	- Hardware: 1 × NVIDIA L40S-1-48G GPU
	- Model Selection: Best checkpoint chosen by highest mean IoU on the validation split.



	## Quick Start

	Here is a simple example of how to use `DocExplainer` to get an answer and its corresponding bounding box from a document image.

	```python
	from PIL import Image
	import requests
	import torch
	from transformers import AutoModel, AutoModelForImageTextToText, AutoProcessor
	import json

	url = "https://i.postimg.cc/BvftyvS3/image-1d100e9.jpg"
	image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
	question = "What is the invoice number?"

	# -----------------------
	# 1. Load SmolVLM2-2.2B for answer generation
	# -----------------------
	vlm_model = AutoModelForImageTextToText.from_pretrained(
	"HuggingFaceTB/SmolVLM2-2.2B-Instruct",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	attn_implementation="flash_attention_2"
	)
	processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")

	PROMPT = """Based only on the document image, answer the following question:
	Question: {QUESTION}
	Provide ONLY a JSON response in the following format (no trailing commas!):
	{{
	"content": "answer"
	}}
	"""

	prompt_text = PROMPT.format(QUESTION=question)

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": prompt_text},
	]
	},
	]

	inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
	).to(vlm_model.device, dtype=torch.bfloat16)

	input_length = inputs['input_ids'].shape[1]
	generated_ids = vlm_model.generate(**inputs, do_sample=False, max_new_tokens=2056)

	output_ids = generated_ids[:, input_length:]
	generated_texts = processor.batch_decode(
	output_ids,
	skip_special_tokens=True,
	)

	decoded_output = generated_texts[0].replace("Assistant:", "", 1).strip()
	answer = json.loads(decoded_output)['content']

	print(f"Answer: {answer}")

	# -----------------------
	# 2. Load DocExplainer for bounding box prediction
	# -----------------------
	explainer = AutoModel.from_pretrained("letxbe/DocExplainer", trust_remote_code=True)
	bbox = explainer.predict(image, answer)
	print(f"Predicted bounding box (normalized): {bbox}")
	```


	<table>
	<tr>
	<td width="50%" valign="top">
	Example Output:

	Question: What is the invoice number? <br>
	Answer: 3Y8M2d-846<br><br>
	Predicted BBox: [0.6353235244750977, 0.03685223311185837, 0.8617828488349915, 0.058749228715896606] <br>
	</td>
	<td width="50%" valign="top">
	Visualized Answer Location:
	<img src="https://i.postimg.cc/0NmBM0b1/invoice-explained.png" alt="Invoice with predicted bounding box" width="100%">
	</td>
	</tr>
	</table>


	## Performance

	\| Architecture \| Prompting \| ANLS \| MeanIoU \|
	\|--------------------------------\|------------\|-------\|---------\|
	\| Smolvlm-2.2B \| Zero-shot \| 0.527 \| 0.011 \|
	\| \| Anchors \| 0.543 \| 0.026 \|
	\| \| CoT \| 0.561 \| 0.011 \|
	\| Qwen2-vl-7B \| Zero-shot \| 0.691 \| 0.048 \|
	\| \| Anchors \| 0.694 \| 0.051 \|
	\| \| CoT \| <ins>0.720</ins> \| 0.038 \|
	\| Claude Sonnet 4 \| Zero-shot \| 0.737 \| 0.031 \|
	\| Smolvlm-2.2B + DocExplainer \| Zero-shot \| 0.572 \| 0.175 \|
	\| Qwen2-vl-7B + DocExplainer \| Zero-shot \| 0.689 \| 0.188 \|
	\| Smol + Naive OCR \| Zero-shot \| 0.556 \| <ins>0.405</ins> \|
	\| Qwen + Naive OCR \| Zero-shot \| 0.690 \| 0.494 \|


	Document VQA performance of different models and prompting strategies on the [BoundingDocs v2.0 dataset](https://huggingface.co/datasets/letxbe/BoundingDocs). <br>
	The best value is shown in bold, the second-best value is <ins>underlined</ins>.

	## Citation

	If you use `DocExplainer`, please cite:

	```bibtex
	@misc{chen2025reliableinterpretabledocumentquestion,
	title={Towards Reliable and Interpretable Document Question Answering via VLMs},
	author={Alessio Chen and Simone Giovannini and Andrea Gemelli and Fabio Coppini and Simone Marinai},
	year={2025},
	eprint={2509.10129},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2509.10129},
	}
	```

	## Limitations
	- Prototype only: Intended as a first approach, not a production-ready solution.
	- Dataset constraints: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly