AdithyaSK commited on
Commit
64a5609
Β·
verified Β·
1 Parent(s): 7d37ff2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -5
README.md CHANGED
@@ -12,9 +12,18 @@ pipeline_tag: visual-document-retrieval
12
  base_model:
13
  - google/gemma-3-4b-it
14
  ---
15
-
16
  # ColNetraEmbed
17
 
 
 
 
 
 
 
 
 
 
 
18
  **ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.
19
 
20
  ## Model Description
@@ -48,7 +57,7 @@ from colpali_engine.models import ColGemma3, ColGemmaProcessor3
48
  model_name = "Cognitive-Lab/ColNetraEmbed"
49
  model = ColGemma3.from_pretrained(
50
  model_name,
51
- dtype=torch.bfloat16,
52
  device_map="cuda",
53
  )
54
  processor = ColGemmaProcessor3.from_pretrained(model_name)
@@ -94,7 +103,7 @@ for i, query in enumerate(queries):
94
 
95
  ## Model Details
96
 
97
- - **Base Model:** Gemma3-2B
98
  - **Vision Encoder:** SigLIP
99
  - **Training Data:** Multilingual document datasets
100
  - **Embedding Strategy:** Multi-vector (Late Interaction)
@@ -102,7 +111,56 @@ for i, query in enumerate(queries):
102
 
103
  ## Performance
104
 
105
- ColNetraEmbed achieves state-of-the-art results on visual document retrieval benchmarks. See our [paper](https://arxiv.org/abs/2512.03514) for detailed evaluation metrics.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
  ## Citation
108
 
@@ -124,4 +182,6 @@ This model is released under the same license as the base Gemma3 model.
124
 
125
  ## Acknowledgments
126
 
127
- Built on top of the ColPali framework and Gemma3 architecture.
 
 
 
12
  base_model:
13
  - google/gemma-3-4b-it
14
  ---
 
15
  # ColNetraEmbed
16
 
17
+ ![Group 54 (1)](https://cdn-uploads.huggingface.co/production/uploads/6442d975ad54813badc1ddf7/-fYMikXhSuqRqm-UIdulK.png)
18
+
19
+
20
+ [![Paper](https://img.shields.io/badge/arXiv-2512.03514-b31b1b.svg)](https://arxiv.org/abs/2512.03514)
21
+ [![GitHub](https://img.shields.io/badge/GitHub-colpali-181717?logo=github)](https://github.com/adithya-s-k/colpali)
22
+ [![Model](https://img.shields.io/badge/πŸ€—%20HuggingFace-Model-yellow)](https://huggingface.co/Cognitive-Lab/ColNetraEmbed)
23
+ [![Blog](https://img.shields.io/badge/Blog-CognitiveLab-blue)](https://www.cognitivelab.in/blog/introducing-netraembed)
24
+ [![Demo](https://img.shields.io/badge/Demo-Try%20it%20out-green)](https://cloud.cognitivelab.in)
25
+
26
+
27
  **ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.
28
 
29
  ## Model Description
 
57
  model_name = "Cognitive-Lab/ColNetraEmbed"
58
  model = ColGemma3.from_pretrained(
59
  model_name,
60
+ torch_dtype=torch.bfloat16,
61
  device_map="cuda",
62
  )
63
  processor = ColGemmaProcessor3.from_pretrained(model_name)
 
103
 
104
  ## Model Details
105
 
106
+ - **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it)
107
  - **Vision Encoder:** SigLIP
108
  - **Training Data:** Multilingual document datasets
109
  - **Embedding Strategy:** Multi-vector (Late Interaction)
 
111
 
112
  ## Performance
113
 
114
+ ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2.
115
+
116
+ ### Benchmark Results
117
+
118
+ **Nayana-IR Cross-Lingual**
119
+
120
+ | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
121
+ |-------|:------:|:---------:|:------:|:------:|
122
+ | **ColNetraEmbed** | **0.637** | **0.700** | **0.610** | **0.610** |
123
+ | Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 |
124
+ | ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 |
125
+ | ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 |
126
+ | GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 |
127
+ | ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 |
128
+ | ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 |
129
+
130
+ **Nayana-IR Monolingual**
131
+
132
+ | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
133
+ |-------|:------:|:---------:|:------:|:------:|
134
+ | **ColNetraEmbed** | **0.670** | **0.764** | **0.645** | **0.686** |
135
+ | ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 |
136
+ | ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 |
137
+ | GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 |
138
+ | ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 |
139
+ | ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 |
140
+
141
+ **ViDoRe v2**
142
+
143
+ | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
144
+ |-------|:------:|:---------:|:------:|:------:|
145
+ | ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 |
146
+ | Jina-Embeddings-v4 | 0.576 | 0.686 | - | - |
147
+ | GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 |
148
+ | ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 |
149
+ | **ColNetraEmbed** | **0.551** | **0.664** | **0.445** | **0.445** |
150
+ | ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 |
151
+ | ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 |
152
+
153
+ **Key Results:**
154
+ - πŸ† **Strong multilingual performance** with ColBERT-style late interaction
155
+ - πŸ“ˆ **124% improvement** over ColPali-v1.3 on cross-lingual tasks
156
+ - 🌍 Supports **22 languages** across diverse script families
157
+ - πŸ” **Fine-grained matching** through token-level MaxSim scoring
158
+
159
+ **Comparison: Multi-vector vs Single-vector**
160
+ - ColNetraEmbed (multi-vector): More interpretable with token-level attribution
161
+ - NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage
162
+
163
+ See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and architectural comparisons.
164
 
165
  ## Citation
166
 
 
182
 
183
  ## Acknowledgments
184
 
185
+ This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in).
186
+
187
+ Built on top of the ColPali framework and Gemma3 architecture.