Foundations in Multimodal Mechanistic Interpretability (William Rudman, Brown University)
Abstract: Mechanistic interpretability has been instrumental in understanding Large Language Models, yet remains underexplored in multimodal models. This is due to a lack of effective image-corruption methods needed for causal analysis. The first part of this talk introduces NOTICE, a novel corruption scheme designed for MLLMs, enabling causal mediation analysis for MLLMs.
Next, we examine the reasoning capabilities of MLLMs and find that MLLMs are shape-blind. Namely, vision-encoders in MLLMs embed geometrically dissimilar objects into the same regions of their representation space. We construct a side-counting dataset of abstract shapes, showing that current MLLMs achieve near-zero accuracy on a trivial task for humans.
Finally, we present ongoing work on VisualCounterfact, a dataset designed to investigate the relationship between counterfactual visual inputs and world knowledge. VisualCounterfact consists of tuples that alter specific visual properties—color, size, and texture—of common objects. For instance, given (banana, color, yellow), we create a counterfact image (banana, color, purple) by modifying the object's pixels. Using VisualCounterfact, we locate a mechanism for reliably controlling whether a model will answer with the counterfactual property present in the image or retrieve the world-knowledge answer from its weights.