Vintern-3B-beta ๐Ÿ‡ป๐Ÿ‡ณ โ„๏ธ - The LLaVA ๐ŸŒ‹ Challenger

What's new in Vintern-3B-beta!

Model Details

Model Name Vision Part Language Part
Vintern-3B-beta InternViT-300M-448px Qwen2.5-3B-Instruct

Bytedance/MTVQA Benchmark

We surpassed GPT-4o and are approaching Gemini 1.5 Pro on the MTVQA dataset for Vietnamese. The benchmark result in MTVQA from open_vlm_leaderboard.

Rank Method Param (B) Language Model Vision Model VI
1 Gemini-1.5-Pro 41.3
2 Vintern-3B-beta 3 Qwen2.5-3B-Instruct InternViT-300M 41.289
3 GPT-4o (0513, detail-h...) 39.6
4 GPT-4o (0806, detail-h...) 38.9
5 Gemini-1.5-Flash 38.9
6 Qwen-VL-Max-0809 72 Qwen2-72B ViT-600M 36.9
7 GPT-4o (0513, detail-lo...) 26.1
8 Qwen-VL-Plus-0809 27.8
9 GLM-4v-9B 9 GLM-4-9B EVA-02-5B 26.6
10 InternVL2-Llama3-76B 76 Llama-3-70B-Instruct InternViT-6B 26.7
11 Step-1.5V Step-1.5 stepencoder 18.4
12 InternVL2-40B 40 Nous-Hermes-2-Yi-34B InternViT-6B 21.2
13 Pixtral-12B 13 Nemo-12B ViT-400M 19.7

Zalo VMLU Benchmark

The Vintern-3B-beta achieved a score of 54.81 on the Zalo VMLU Benchmark.

generation_config = dict(max_new_tokens= 64, do_sample=False, num_beams = 1, repetition_penalty=1.5)
question = "Bแบกn lร  trแปฃ lรฝ AI giแบฃi trแบฏc nghiแป‡m rแบฅt chรญnh xรกc. Bแบกn biแบฟt chแบฏc chแบฏn ฤ‘รกp รกn ฤ‘รบng nhแบฅt. Chแป‰ ฤ‘ฦฐa ra chแปฏ cรกi ฤ‘แปฉng trฦฐแป›c cรขu trแบฃ lแปi ฤ‘รบng cแปงa cรขu hแปi trแบฏc nghiแป‡m sau: Cรกc cฦก quan nร o sau ฤ‘รขy lร  cฦก quan tฦฐ phรกp? Lแปฑa Chแปn:\nA. Viแป‡n kiแปƒm sรกt nhรขn dรขn\nB. Tรฒa รกn nhรขn dรขn\nC. Chรญnh phแปง\nD. Cแบฃ A vร  B\nCรขu trแบฃ lแปi ฤ‘รบng nhแบฅt lร :"
model.chat(tokenizer, None, question, generation_config)

OpenCompass Benchmark

Vintern-3B-beta is now on open_vlm_leaderboard. You can visit to view more detailed evaluations.

The current results are at a quite good level, and we are expanding the training set in English and other languages to approach models within a comparable parameter range.

"The table is referenced from the repo Qwen/Qwen2-VL-2B-Instruct."

Benchmark InternVL2-2B MiniCPM-V 2.0 Qwen2-VL-2B Vintern-3B-beta
MMMUval 36.3 38.2 41.1 43.55
DocVQAtest 86.9 - 90.1 80.47
InfoVQAtest 58.9 - 65.5 48.28
ChartQAtest 76.2 - 73.5 68.32
TextVQAval 73.4 - 79.7 67.09
OCRBench 781 605 794 619
MTVQA 10.9 8.8 20.0 23.58
Vi-MTVQA 9.3 8.4 - 41.29
RealWorldQA 57.3 55.8 62.9 57.9
MMEsum 1876.8 1808.6 1872.0 1772.9
MMBench-ENtest 73.2 69.1 74.9 70.62
MMStar 49.8 39.1 48.0 47.6
HallBenchavg 38.0 36.1 41.7 43.22
MathVistatestmini 46.0 39.8 43.0 43.9

SEED-Bench-2-Plus CVPR2024

We achieved Rank 1 for models under 4 billion parameters on the SEED-Bench benchmark, recently published by Tencent at CVPR2024.

Examples

User: <image>
Mรด tแบฃ hรฌnh แบฃnh chi tiแบฟt.
Assistant: Bแปฉc แบฃnh lร  mแป™t giao diแป‡n แปฉng dแปฅng ฤ‘แบทt hร ng thแปฉc ฤƒn trแปฑc tuyแบฟn. Giao diแป‡n ฤ‘ฦฐแปฃc chia thร nh cรกc phแบงn: 
- Phแบงn ฤ‘แบงu tiรชn hiแปƒn thแป‹ thรดng bรกo "Sแบฝ xong lรบc 10:50 SA" vร  dรฒng chแปฏ "Bแบฟp ฤ‘ang thแปฑc hiแป‡n mรณn.". Bรชn cแบกnh ฤ‘รณ lร  biแปƒu tฦฐแปฃng bแบฟp ฤ‘ang hoแบกt ฤ‘แป™ng vแป›i thanh tiแบฟn ฤ‘แป™ mร u xanh lรก cรขy. 
- Tiแบฟp theo lร  dรฒng chแปฏ "Vui lรฒng ฤ‘แบฟn lแบฅy mรณn trฦฐแป›c 12:50". 
- Phแบงn tiแบฟp theo lร  thรดng tin vแป quรกn cร  phรช "Wego Coffee - Mapletree Business Centre", kรจm theo dรฒng chแปฏ "#GF-29". 
- Dรฒng chแปฏ "ฤแปซng quรชn kiแปƒm tra lแบกi ฤ‘แปƒ chแบฏc rแบฑng bแบกn nhแบญn ฤ‘รบng mรณn nhรฉ!" ฤ‘ฦฐแปฃc hiแปƒn thแป‹ trong mแป™t khung hรฌnh chแปฏ nhแบญt mร u be. 
- Phแบงn cuแป‘i cรนng lร  thรดng tin vแป ฤ‘ฦกn hร ng, bao gแป“m logo quรกn cร  phรช, dรฒng chแปฏ "Wego Coffee - Mapletree Business Centre", dรฒng chแปฏ "Tแป•ng cแป™ng", giรก trแป‹ ฤ‘ฦกn hร ng (70.000ฤ‘), giรก trแป‹ thanh toรกn (120.000ฤ‘) vร  dรฒng chแปฏ "Merchants has not started a chat". 
- Cuแป‘i cรนng lร  nรบt "Xem tรณm tแบฏt ฤ‘ฦกn hร ng" vร  nรบt "ฤรฃ lแบฅy mรณn" cรณ mร u xanh lรก cรขy.
User: <image>
Liแป‡t kรช toร n bแป™ bร i thฦก cรณ trong แบฃnh.
Assistant: ฤi khแบฏp thแบฟ gian khรดng ai tแป‘t bแบฑng mแบน Gรกnh nแบทng cuแป™c ฤ‘แปi khรดng ai khแป• bแบฑng cha
User: <image>
Liแป‡t kรช toร n bแป™ bร i thฦก cรณ trong แบฃnh.
Assistant: Bร i thฦก cรณ nแป™i dung: 
- Mแบน nhฦฐ mแป™t ngแปn hแบฃi ฤ‘ฤƒng
- Nhฦฐ รดng mแบทt trแปi, nhฦฐ รดng mแบทt trฤƒng
- Ngร y ngร y vแบฅt vแบฃ, tแบฃo tแบงn
- Chฤƒm lo con cรกi, lร m viแป‡c siรชng nฤƒng.
User: <image>
Mรด tแบฃ hรฌnh แบฃnh chi tiแบฟt.
Assistant: Bแปฉc แบฃnh chแปฅp mแป™t bแบฃng ฤ‘en vแป›i cรกc cรดng thแปฉc toรกn hแปc ฤ‘ฦฐแปฃc viแบฟt bแบฑng phแบฅn trแบฏng. แปž gรณc trรชn bรชn trรกi, cรณ dรฒng chแปฏ "20/19/SGK" ฤ‘ฦฐแปฃc viแบฟt bแบฑng mร u vร ng nhแบกt. Tiแบฟp theo lร  phแบงn a) vแป›i hai phฦฐฦกng trรฌnh: 
3x + y = 3
2x - y = 7
Dฦฐแป›i hai phฦฐฦกng trรฌnh nร y lร  mลฉi tรชn chแป‰ sang phแบฃi vร  hai phฦฐฦกng trรฌnh mแป›i:
5x = 10
3x + y = 3
3.2 + y = 3
Sau ฤ‘รณ lร  dรฒng chแปฏ "x = 2" 
y = -3
Cuแป‘i cรนng lร  dรฒng chแปฏ "Vแบญy hpt ฤ‘รฃ cho cรณ no! (x; y) = (2; -3)"
User: <image>
Mรด tแบฃ hรฌnh แบฃnh chi tiแบฟt.
Assistant: Bแปฉc แบฃnh minh hแปa quรก trรฌnh xแปญ lรฝ hรฌnh แบฃnh bแบฑng mแบกng lฦฐแป›i thแบงn kinh nhรขn tแบกo (CNN). Bแปฉc แบฃnh ฤ‘ฦฐแปฃc chia thร nh 3 phแบงn chรญnh:

1. **Phแบงn ฤ‘แบงu tiรชn:** Hiแปƒn thแป‹ mแป™t hรฌnh แบฃnh cแปงa mแป™t chiแบฟc xe hฦกi mร u trแบฏng vแป›i dรฒng chแปฏ "input image" แปŸ bรชn dฦฐแป›i.
2. **Phแบงn thแปฉ hai:** Hiแปƒn thแป‹ 6 khแป‘i vuรดng, mแป—i khแป‘i ฤ‘แบกi diแป‡n cho mแป™t lแป›p trong mแบกng CNN. Cรกc lแป›p ฤ‘ฦฐแปฃc ฤ‘รกnh sแป‘ thแปฉ tแปฑ tแปซ trรกi sang phแบฃi:
    * Convolutional layer
    * Pooling layer
    * Convolutional layer
    * Pooling layer
    * Fully-connected layer
    * Predicted class
3. **Phแบงn thแปฉ ba:** Hiแปƒn thแป‹ mแป™t hรฌnh แบฃnh vแป›i dรฒng chแปฏ "car" แปŸ bรชn dฦฐแป›i.

Mแป™t mลฉi tรชn mร u ฤ‘en nแป‘i cรกc khแป‘i vuรดng vแป›i nhau, thแปƒ hiแป‡n quรก trรฌnh truyแปn dแปฏ liแป‡u giแปฏa cรกc lแป›p.

Quickstart

Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents. To run inference using the model, follow the steps outlined in our Colab inference notebook

import numpy as np
import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

model = AutoModel.from_pretrained(
    "5CD-AI/Vintern-3B-beta",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Vintern-3B-beta", trust_remote_code=True, use_fast=False)

test_image = 'test-image.jpg'

pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 512, do_sample=False, num_beams = 3, repetition_penalty=3.5)

question = '<image>\nMรด tแบฃ hรฌnh แบฃnh mแป™t cรกch chi tiแบฟt.'

response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

#question = "Cรขu hแปi khรกc ......"
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
#print(f'User: {question}\nAssistant: {response}')

Bias, Risks, and Limitations

The model might have biases because it learned from data that could be biased. Users should be cautious of these possible biases when using the model.

Citation

@misc{doan2024vintern1befficientmultimodallarge,
      title={Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese}, 
      author={Khang T. Doan and Bao G. Huynh and Dung T. Hoang and Thuc D. Pham and Nhat H. Pham and Quan T. M. Nguyen and Bang Q. Vo and Suong N. Hoang},
      year={2024},
      eprint={2408.12480},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.12480}, 
}
Downloads last month
223
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for 5CD-AI/Vintern-3B-beta

Base model

Qwen/Qwen2.5-3B
Finetuned
(912)
this model
Quantizations
1 model

Space using 5CD-AI/Vintern-3B-beta 1

Paper for 5CD-AI/Vintern-3B-beta