Research

Our research explores core challenges in computer vision, multimodal artificial intelligence, and embodied intelligence, with a strong emphasis on building intelligent systems that perceive, reason, and act in the real world. This page introduces our major research directions and representative topics, spanning from fundamental algorithmic advances to system-level integration for practical and scalable AI solutions.

Generative AI

We study modern generative models including diffusion models, GAN variants, and recent video and 3D/4D generative frameworks. Our research emphasizes controllable generation, temporal and spatial consistency, efficient inference, and safety. We aim to build practical generative pipelines that can reliably produce high-quality images, videos, and 3D/4D representations under real-world computational and deployment constraints.

Diffusion Models Video Generation 3D / 4D Generation Controllable Generation Efficient Inference Safe & Trustworthy Generation

Embodied AI & Robotics

Our embodied AI research bridges perception and action through planning, control, and interactive decision-making in physical environments. We investigate data-efficient learning methods for robotic manipulation and embodied interaction, focusing on how agents can acquire skills from limited data and adapt to new situations. Our goal is to build robust embodied agents that tightly integrate perception, reasoning, and action, enabling reliable operation in complex and unstructured real-world settings.

Vision-Language-Action (VLA) Embodied Agents Robotic Manipulation Planning & Control Data-Efficient Learning Interactive Decision-Making

Vision & Multimodal Learning

We develop robust visual representations and multimodal learning methods that combine visual, linguistic, and structured information. Our research covers vision–language models (VLMs), fine-grained recognition, and reliable perception under distribution shifts and challenging conditions. We also study training strategies and model architectures that improve generalization, robustness, and interpretability across diverse real-world environments.

Vision-Language Models (VLMs) Multimodal Representation Fine-Grained Recognition Robust & Trustworthy Perception

Selected Keywords

Diffusion Model, 3D/4D Generation, Vision-Language-Action (VLA), Multimodal Foundation Models, Embodied Agents, Robotic Manipulation, Data-Efficient & Trustworthy AI