Pulling Back the Curtain on VLM Attention

Understanding LLM Attention and Vision Encoder Attention

Conversation
Computer Vision
LLMs
Transformers
A LLM narrates our conversation.
Author

Salman Naqvi

Published

Monday, 04 August 2025

Foreword

In this converation, I attempted to understand how this notebook, by user zjysteven visualized the attention placed on an image by a VLM. My main confusion stemmed around the two different attentions that are calculated in the notebook.

The Confusion of Two Attentions

We observed the code first calculating an llm_attn_matrix. At the time, we thought: “Great, this must cover all attention, including both text and images.” But then, the code computed a separate vis_attn_matrix.

“Why recalculate visual attention?” you asked. “Didn’t we already calculate attention for all tokens, including the image tokens?” This was a brilliant question—and the key to understanding the entire process. We gradually realized these two attention types originate from different parts of the model and address fundamentally distinct questions.

The Language Model, the Clever “Manager”

We first grasped the role of the language model’s (LLM) attention. It acts like a strategic manager. When the model generates a word like “cat,” this “manager” examines the question and image, then decides: “To output ‘cat,’ the image token representing ‘ears’ is most critical, while ‘whiskers’ is moderately important.” Thus, the LLM’s attention answers the “which” question. It assigns an importance score to each image token. But here was the catch: these scores are “coarse.” They reveal which large patch matters but not what within the patch the model focuses on. Plotting this alone would color entire patches uniformly, yielding unusable visualizations.

The Vision Encoder, the Diligent “Analyst”

So what does the vision encoder’s (ViT) attention do? Initially, we mistakenly assumed it also knew the question. But we soon learned it works like a meticulous “analyst” that completes its job before the question ever arrives. Its task is to analyze the image in extreme detail, completely independent of the question. It examines every small image patch and generates a “map” for each, revealing how that patch relates to all others. The ViT’s attention thus answers the “what” question—outputting detailed maps without scores. “But if it doesn’t know the question,” you asked, “how would it highlight relevant features? It could focus on irrelevant elements.” You were absolutely right. The ViT alone cannot determine what matters for the question; it merely catalogues all interesting structures in the image.

The Power of Collaboration

Finally, we understood how everything integrates. It’s a two-step process, like manager and analyst collaborating:

  1. The diligent analyst (ViT) first prepares a detailed image analysis report—containing a unique attention “map” for every patch.
  2. The clever manager (LLM) reads the question, consults this report, and selects which maps to use (and their weights).
  3. We, as external observers, then combine these selected maps according to the LLM’s weights to produce the final, clear attention heatmap.

The resulting visualization isn’t generated by any single component—it emerges from the symbiosis of both systems, harmonized by our interpretation.

End of Narration

If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do post them down in the comment section below!

Frequently Asked Questions (FAQ)

Q: Why do we need two attention mechanisms (LLM’s and ViT’s)? A: Because they answer two distinct but necessary questions. LLM attention is “coarse,” identifying which patch matters for word generation. ViT attention is “fine,” providing spatial “maps” for each patch to reveal internal image structures.

Q: If the vision encoder doesn’t know my question, isn’t its attention potentially irrelevant? A: Correct. The ViT objectively analyzes the image, cataloguing all salient structures unrelated to the question. The LLM then acts as a filter, selecting only ViT outputs relevant to the question.

Q: Why not use only the LLM’s attention for plotting? Doesn’t it already know what’s important? A: LLM attention assigns scores to entire patches but lacks spatial resolution. Plotting it alone would color entire patches uniformly, producing low-detail visualizations. ViT maps provide the necessary granularity.

Q: So the vision encoder generates one attention map per image patch? How do we display so many maps? A: Exactly. For a 576-patch image, the ViT produces 576 maps. The final heatmap is a weighted average of all 576 maps, with weights derived from the LLM’s importance scores for each patch.

Back to top