Tracing the Emergence of Symbol Grounding in (Multimodal) Language Models – VISION AND IMAGE PROCESSING (VIP) RESEARCH GROUP

Prof. Freda Shi

February 13th, 2026 – 12:00-1:00pm, EC4-2101A

Do language models acquire symbol grounding in Harnad’s (1990) sense, that is, non-arbitrary, causally meaningful links between symbols and referents? To answer this question, we first introduce a controlled evaluation framework that assigns each concept two distinct tokens: one appearing in non-verbal scene descriptions and another in linguistic utterances—this “pseudo multimodal” setup prevents trivial identity mappings and enables direct tests of grounding. Behaviorally, we find that models trained from scratch show consistent surprisal reduction when the linguistic form is preceded by its matching scene token, relative to matched controls, and co-occurrence statistics cannot explain this effect. Mechanistically, saliency flow and tuned-lens analyses converge on the finding that grounding concentrates in middle-layer computations and is implemented through a gather-and-aggregate (G&A) mechanism: earlier heads gather information from scene tokens, while later heads aggregate it to support the prediction of linguistic forms. The phenomenon is replicated in visual dialogue data and across architectures with explicit memory (including Transformers and state-space models), but not in unidirectional LSTMs. Together, these results provide behavioral and mechanistic evidence that symbol grounding can emerge in autoregressive LMs, while delineating the architectural conditions under which it arises. Time permitting, I will introduce a recent toolkit developed in our lab to analyze general-purpose vision-language models.