Vintern-3B-beta ๐ป๐ณ โ๏ธ - The LLaVA ๐ Challenger
What's new in Vintern-3B-beta!
- We successfully reproduced the training process of InternVL from scratch.
- The model is the result of integrating Qwen/Qwen2.5-3B-Instruct and InternViT-300M-448px through an MLP layer.
- Trained with more than 10 Milion Vietnamese QnAs, Descriptions, and 10% English, Chinese Data from OpenGVLab/InternVL-Chat-V1-2-SFT-Data.
Model Details
| Model Name | Vision Part | Language Part |
|---|---|---|
| Vintern-3B-beta | InternViT-300M-448px | Qwen2.5-3B-Instruct |
Bytedance/MTVQA Benchmark
We surpassed GPT-4o and are approaching Gemini 1.5 Pro on the MTVQA dataset for Vietnamese. The benchmark result in MTVQA from open_vlm_leaderboard.
| Rank | Method | Param (B) | Language Model | Vision Model | VI |
|---|---|---|---|---|---|
| 1 | Gemini-1.5-Pro | 41.3 | |||
| 2 | Vintern-3B-beta | 3 | Qwen2.5-3B-Instruct | InternViT-300M | 41.289 |
| 3 | GPT-4o (0513, detail-h...) | 39.6 | |||
| 4 | GPT-4o (0806, detail-h...) | 38.9 | |||
| 5 | Gemini-1.5-Flash | 38.9 | |||
| 6 | Qwen-VL-Max-0809 | 72 | Qwen2-72B | ViT-600M | 36.9 |
| 7 | GPT-4o (0513, detail-lo...) | 26.1 | |||
| 8 | Qwen-VL-Plus-0809 | 27.8 | |||
| 9 | GLM-4v-9B | 9 | GLM-4-9B | EVA-02-5B | 26.6 |
| 10 | InternVL2-Llama3-76B | 76 | Llama-3-70B-Instruct | InternViT-6B | 26.7 |
| 11 | Step-1.5V | Step-1.5 | stepencoder | 18.4 | |
| 12 | InternVL2-40B | 40 | Nous-Hermes-2-Yi-34B | InternViT-6B | 21.2 |
| 13 | Pixtral-12B | 13 | Nemo-12B | ViT-400M | 19.7 |
Zalo VMLU Benchmark
The Vintern-3B-beta achieved a score of 54.81 on the Zalo VMLU Benchmark.
generation_config = dict(max_new_tokens= 64, do_sample=False, num_beams = 1, repetition_penalty=1.5)
question = "Bแบกn lร trแปฃ lรฝ AI giแบฃi trแบฏc nghiแปm rแบฅt chรญnh xรกc. Bแบกn biแบฟt chแบฏc chแบฏn ฤรกp รกn ฤรบng nhแบฅt. Chแป ฤฦฐa ra chแปฏ cรกi ฤแปฉng trฦฐแปc cรขu trแบฃ lแปi ฤรบng cแปงa cรขu hแปi trแบฏc nghiแปm sau: Cรกc cฦก quan nร o sau ฤรขy lร cฦก quan tฦฐ phรกp? Lแปฑa Chแปn:\nA. Viแปn kiแปm sรกt nhรขn dรขn\nB. Tรฒa รกn nhรขn dรขn\nC. Chรญnh phแปง\nD. Cแบฃ A vร B\nCรขu trแบฃ lแปi ฤรบng nhแบฅt lร :"
model.chat(tokenizer, None, question, generation_config)
OpenCompass Benchmark
Vintern-3B-beta is now on open_vlm_leaderboard. You can visit to view more detailed evaluations.
The current results are at a quite good level, and we are expanding the training set in English and other languages to approach models within a comparable parameter range.
"The table is referenced from the repo Qwen/Qwen2-VL-2B-Instruct."
| Benchmark | InternVL2-2B | MiniCPM-V 2.0 | Qwen2-VL-2B | Vintern-3B-beta |
|---|---|---|---|---|
| MMMUval | 36.3 | 38.2 | 41.1 | 43.55 |
| DocVQAtest | 86.9 | - | 90.1 | 80.47 |
| InfoVQAtest | 58.9 | - | 65.5 | 48.28 |
| ChartQAtest | 76.2 | - | 73.5 | 68.32 |
| TextVQAval | 73.4 | - | 79.7 | 67.09 |
| OCRBench | 781 | 605 | 794 | 619 |
| MTVQA | 10.9 | 8.8 | 20.0 | 23.58 |
| Vi-MTVQA | 9.3 | 8.4 | - | 41.29 |
| RealWorldQA | 57.3 | 55.8 | 62.9 | 57.9 |
| MMEsum | 1876.8 | 1808.6 | 1872.0 | 1772.9 |
| MMBench-ENtest | 73.2 | 69.1 | 74.9 | 70.62 |
| MMStar | 49.8 | 39.1 | 48.0 | 47.6 |
| HallBenchavg | 38.0 | 36.1 | 41.7 | 43.22 |
| MathVistatestmini | 46.0 | 39.8 | 43.0 | 43.9 |
SEED-Bench-2-Plus CVPR2024
We achieved Rank 1 for models under 4 billion parameters on the SEED-Bench benchmark, recently published by Tencent at CVPR2024.
Examples
User: <image>
Mรด tแบฃ hรฌnh แบฃnh chi tiแบฟt.
Assistant: Bแปฉc แบฃnh lร mแปt giao diแปn แปฉng dแปฅng ฤแบทt hร ng thแปฉc ฤn trแปฑc tuyแบฟn. Giao diแปn ฤฦฐแปฃc chia thร nh cรกc phแบงn:
- Phแบงn ฤแบงu tiรชn hiแปn thแป thรดng bรกo "Sแบฝ xong lรบc 10:50 SA" vร dรฒng chแปฏ "Bแบฟp ฤang thแปฑc hiแปn mรณn.". Bรชn cแบกnh ฤรณ lร biแปu tฦฐแปฃng bแบฟp ฤang hoแบกt ฤแปng vแปi thanh tiแบฟn ฤแป mร u xanh lรก cรขy.
- Tiแบฟp theo lร dรฒng chแปฏ "Vui lรฒng ฤแบฟn lแบฅy mรณn trฦฐแปc 12:50".
- Phแบงn tiแบฟp theo lร thรดng tin vแป quรกn cร phรช "Wego Coffee - Mapletree Business Centre", kรจm theo dรฒng chแปฏ "#GF-29".
- Dรฒng chแปฏ "ฤแปซng quรชn kiแปm tra lแบกi ฤแป chแบฏc rแบฑng bแบกn nhแบญn ฤรบng mรณn nhรฉ!" ฤฦฐแปฃc hiแปn thแป trong mแปt khung hรฌnh chแปฏ nhแบญt mร u be.
- Phแบงn cuแปi cรนng lร thรดng tin vแป ฤฦกn hร ng, bao gแปm logo quรกn cร phรช, dรฒng chแปฏ "Wego Coffee - Mapletree Business Centre", dรฒng chแปฏ "Tแปng cแปng", giรก trแป ฤฦกn hร ng (70.000ฤ), giรก trแป thanh toรกn (120.000ฤ) vร dรฒng chแปฏ "Merchants has not started a chat".
- Cuแปi cรนng lร nรบt "Xem tรณm tแบฏt ฤฦกn hร ng" vร nรบt "ฤรฃ lแบฅy mรณn" cรณ mร u xanh lรก cรขy.
User: <image>
Liแปt kรช toร n bแป bร i thฦก cรณ trong แบฃnh.
Assistant: ฤi khแบฏp thแบฟ gian khรดng ai tแปt bแบฑng mแบน Gรกnh nแบทng cuแปc ฤแปi khรดng ai khแป bแบฑng cha
User: <image>
Liแปt kรช toร n bแป bร i thฦก cรณ trong แบฃnh.
Assistant: Bร i thฦก cรณ nแปi dung:
- Mแบน nhฦฐ mแปt ngแปn hแบฃi ฤฤng
- Nhฦฐ รดng mแบทt trแปi, nhฦฐ รดng mแบทt trฤng
- Ngร y ngร y vแบฅt vแบฃ, tแบฃo tแบงn
- Chฤm lo con cรกi, lร m viแปc siรชng nฤng.
User: <image>
Mรด tแบฃ hรฌnh แบฃnh chi tiแบฟt.
Assistant: Bแปฉc แบฃnh chแปฅp mแปt bแบฃng ฤen vแปi cรกc cรดng thแปฉc toรกn hแปc ฤฦฐแปฃc viแบฟt bแบฑng phแบฅn trแบฏng. แป gรณc trรชn bรชn trรกi, cรณ dรฒng chแปฏ "20/19/SGK" ฤฦฐแปฃc viแบฟt bแบฑng mร u vร ng nhแบกt. Tiแบฟp theo lร phแบงn a) vแปi hai phฦฐฦกng trรฌnh:
3x + y = 3
2x - y = 7
Dฦฐแปi hai phฦฐฦกng trรฌnh nร y lร mลฉi tรชn chแป sang phแบฃi vร hai phฦฐฦกng trรฌnh mแปi:
5x = 10
3x + y = 3
3.2 + y = 3
Sau ฤรณ lร dรฒng chแปฏ "x = 2"
y = -3
Cuแปi cรนng lร dรฒng chแปฏ "Vแบญy hpt ฤรฃ cho cรณ no! (x; y) = (2; -3)"
User: <image>
Mรด tแบฃ hรฌnh แบฃnh chi tiแบฟt.
Assistant: Bแปฉc แบฃnh minh hแปa quรก trรฌnh xแปญ lรฝ hรฌnh แบฃnh bแบฑng mแบกng lฦฐแปi thแบงn kinh nhรขn tแบกo (CNN). Bแปฉc แบฃnh ฤฦฐแปฃc chia thร nh 3 phแบงn chรญnh:
1. **Phแบงn ฤแบงu tiรชn:** Hiแปn thแป mแปt hรฌnh แบฃnh cแปงa mแปt chiแบฟc xe hฦกi mร u trแบฏng vแปi dรฒng chแปฏ "input image" แป bรชn dฦฐแปi.
2. **Phแบงn thแปฉ hai:** Hiแปn thแป 6 khแปi vuรดng, mแปi khแปi ฤแบกi diแปn cho mแปt lแปp trong mแบกng CNN. Cรกc lแปp ฤฦฐแปฃc ฤรกnh sแป thแปฉ tแปฑ tแปซ trรกi sang phแบฃi:
* Convolutional layer
* Pooling layer
* Convolutional layer
* Pooling layer
* Fully-connected layer
* Predicted class
3. **Phแบงn thแปฉ ba:** Hiแปn thแป mแปt hรฌnh แบฃnh vแปi dรฒng chแปฏ "car" แป bรชn dฦฐแปi.
Mแปt mลฉi tรชn mร u ฤen nแปi cรกc khแปi vuรดng vแปi nhau, thแป hiแปn quรก trรฌnh truyแปn dแปฏ liแปu giแปฏa cรกc lแปp.
Quickstart
Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents. To run inference using the model, follow the steps outlined in our Colab inference notebook
import numpy as np
import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
model = AutoModel.from_pretrained(
"5CD-AI/Vintern-3B-beta",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Vintern-3B-beta", trust_remote_code=True, use_fast=False)
test_image = 'test-image.jpg'
pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 512, do_sample=False, num_beams = 3, repetition_penalty=3.5)
question = '<image>\nMรด tแบฃ hรฌnh แบฃnh mแปt cรกch chi tiแบฟt.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
#question = "Cรขu hแปi khรกc ......"
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
#print(f'User: {question}\nAssistant: {response}')
Bias, Risks, and Limitations
The model might have biases because it learned from data that could be biased. Users should be cautious of these possible biases when using the model.
Citation
@misc{doan2024vintern1befficientmultimodallarge,
title={Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese},
author={Khang T. Doan and Bao G. Huynh and Dung T. Hoang and Thuc D. Pham and Nhat H. Pham and Quan T. M. Nguyen and Bang Q. Vo and Suong N. Hoang},
year={2024},
eprint={2408.12480},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.12480},
}
- Downloads last month
- 223