From Perception to Interaction: Physically Grounded Representations for Object and Scene Modeling – VISION AND IMAGE PROCESSING (VIP) RESEARCH GROUP

Prof. Yuhao Chen

February 20th, 2026 – 12:00-1:00pm, EC4-2101A

Current computer vision algorithms excel at recognizing static, rigid objects under controlled conditions but often struggle when faced with occlusion, breakage, or topology changes—for example, tracking per-bite portion changes while eating a salad. These challenges expose a fundamental limitation: many visual systems operate on single-view appearance without explicitly modeling geometry or physical scale, making their predictions unstable for reconstruction and unreliable for interaction.

In this talk, I present a progression of my works that expand visual representation across temporal and spatial dimensions toward physically grounded object and scene models. I begin with monocular and video-based methods, where dense temporal tracking improves short-term consistency but remains limited in spatial association and sparse temporal coherence. Moving beyond temporal cues alone, my work incorporates multi-view geometric reasoning, showing how aggregating geometry across viewpoints improves downstream visual tasks such as tracking. Registering observations into a shared metric scale provides a unifying reference that links sparse temporal measurements into a consistent geometric framework across frames and viewpoints, enabling more reliable analysis of temporal change at both object and scene levels. When objects undergo cutting or structural transformation, topology changes and interior exposure challenge surface-based assumptions. To address this, I develop interior-consistent 3D generation techniques that produce realistic interior visualizations while preserving geometric coherence during structural change. Up to this stage, the focus is on passively acquiring geometry and object trajectories grounded in physical space. I then extend these representations toward interaction, demonstrating how physically scaled geometric object modeling and large-scale synthetic supervision improve robotic grasp prediction and generalization. In parallel, I investigate learned latent representations in generative models, analyzing how implicitly structured priors can be examined and manipulated to better understand and control visual structure. This complementary line of work explores how structure emerges in data-driven representations alongside explicit geometric modeling.

Together, these contributions illustrate a geometry-centered expansion of visual representation—from single-view appearance to temporally consistent understanding, metric 3D geometry, topology-aware object modeling, and interaction—toward structured object and scene models that support reasoning and action in physical environments.