Cognitive-Lab
/

ColNetraEmbed

@@ -12,9 +12,18 @@ pipeline_tag: visual-document-retrieval
 base_model:
 - google/gemma-3-4b-it
 ---
 # ColNetraEmbed
 **ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.
 ## Model Description
@@ -48,7 +57,7 @@ from colpali_engine.models import ColGemma3, ColGemmaProcessor3
 model_name = "Cognitive-Lab/ColNetraEmbed"
 model = ColGemma3.from_pretrained(
     model_name,
-    dtype=torch.bfloat16,
     device_map="cuda",
 )
 processor = ColGemmaProcessor3.from_pretrained(model_name)
@@ -94,7 +103,7 @@ for i, query in enumerate(queries):
 ## Model Details
-- **Base Model:** Gemma3-2B
 - **Vision Encoder:** SigLIP
 - **Training Data:** Multilingual document datasets
 - **Embedding Strategy:** Multi-vector (Late Interaction)
@@ -102,7 +111,56 @@ for i, query in enumerate(queries):
 ## Performance
-ColNetraEmbed achieves state-of-the-art results on visual document retrieval benchmarks. See our [paper](https://arxiv.org/abs/2512.03514) for detailed evaluation metrics.
 ## Citation
@@ -124,4 +182,6 @@ This model is released under the same license as the base Gemma3 model.
 ## Acknowledgments
-Built on top of the ColPali framework and Gemma3 architecture.

 base_model:
 - google/gemma-3-4b-it
 ---
 # ColNetraEmbed
+![Group 54 (1)](https://cdn-uploads.huggingface.co/production/uploads/6442d975ad54813badc1ddf7/-fYMikXhSuqRqm-UIdulK.png)
+[![Paper](https://img.shields.io/badge/arXiv-2512.03514-b31b1b.svg)](https://arxiv.org/abs/2512.03514)
+[![GitHub](https://img.shields.io/badge/GitHub-colpali-181717?logo=github)](https://github.com/adithya-s-k/colpali)
+[![Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow)](https://huggingface.co/Cognitive-Lab/ColNetraEmbed)
+[![Blog](https://img.shields.io/badge/Blog-CognitiveLab-blue)](https://www.cognitivelab.in/blog/introducing-netraembed)
+[![Demo](https://img.shields.io/badge/Demo-Try%20it%20out-green)](https://cloud.cognitivelab.in)
 **ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.
 ## Model Description
 model_name = "Cognitive-Lab/ColNetraEmbed"
 model = ColGemma3.from_pretrained(
     model_name,
+    torch_dtype=torch.bfloat16,
     device_map="cuda",
 )
 processor = ColGemmaProcessor3.from_pretrained(model_name)
 ## Model Details
+- **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it)
 - **Vision Encoder:** SigLIP
 - **Training Data:** Multilingual document datasets
 - **Embedding Strategy:** Multi-vector (Late Interaction)
 ## Performance
+ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2.
+### Benchmark Results
+**Nayana-IR Cross-Lingual**
+| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
+|-------|:------:|:---------:|:------:|:------:|
+| **ColNetraEmbed** | **0.637** | **0.700** | **0.610** | **0.610** |
+| Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 |
+| ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 |
+| ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 |
+| GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 |
+| ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 |
+| ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 |
+**Nayana-IR Monolingual**
+| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
+|-------|:------:|:---------:|:------:|:------:|
+| **ColNetraEmbed** | **0.670** | **0.764** | **0.645** | **0.686** |
+| ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 |
+| ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 |
+| GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 |
+| ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 |
+| ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 |
+**ViDoRe v2**
+| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
+|-------|:------:|:---------:|:------:|:------:|
+| ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 |
+| Jina-Embeddings-v4 | 0.576 | 0.686 | - | - |
+| GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 |
+| ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 |
+| **ColNetraEmbed** | **0.551** | **0.664** | **0.445** | **0.445** |
+| ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 |
+| ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 |
+**Key Results:**
+- 🏆 **Strong multilingual performance** with ColBERT-style late interaction
+- 📈 **124% improvement** over ColPali-v1.3 on cross-lingual tasks
+- 🌍 Supports **22 languages** across diverse script families
+- 🔍 **Fine-grained matching** through token-level MaxSim scoring
+**Comparison: Multi-vector vs Single-vector**
+- ColNetraEmbed (multi-vector): More interpretable with token-level attribution
+- NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage
+See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and architectural comparisons.
 ## Citation
 ## Acknowledgments
+This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in).
+Built on top of the ColPali framework and Gemma3 architecture.