The journey of an image in a VLM

LLM

VLM

My understand regarding how images in VLMs are processed.

Published

August 11, 2025

The VLM has three pieces: the vision encoder, the pooler/projector, and the LLM.

The input image has size 644x476
The model has kernel size 14x14
The total number of image patches is therefore (644/14)·(476/14)=46·34=1564
These patches are processed into feature vectors by the vision encoder
The pooling/projection layer downsizes the number of image feature vectors

Qwen 2.5 VL combines a 2x2 block image feature vectors/patches into a single block
Pooling/projection is done to create a more informative token

The result is (46/2)·(34/2)=391
391 matches the number of image tokens created by the tokenizer