The journey of an image in a VLM

LLM
VLM
My understand regarding how images in VLMs are processed.
Published

August 11, 2025

The VLM has three pieces: the vision encoder, the pooler/projector, and the LLM.

  1. The input image has size 644x476
  2. The model has kernel size 14x14
  3. The total number of image patches is therefore (644/14)·(476/14)=46·34=1564
  4. These patches are processed into feature vectors by the vision encoder
  5. The pooling/projection layer downsizes the number of image feature vectors
  1. The result is (46/2)·(34/2)=391
  2. 391 matches the number of image tokens created by the tokenizer
Back to top