avatar

Yatai Ji

PH.D. Student in MMLab, HKU

Biography

I am now a PH.D student in MMLab of HKU, supervised by Prof. Ping Luo. During my master period, I studied in IIG group in Tsinghua University, supervised by Prof. Yujiu Yang. I received my bachelor degree in Department of Automation from Tsinghua University in 2021. My research interests lie in Multi-Modal Learning, including Vision Language Pre-training, Large Multimodal Model and Video Generation. Recently, I have some works on multimodal discrete diffusion model and 3D spatial reasoning model.


Selected Papers

  • Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

    • Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali, Pavlo Molchanov, Jan Kautz, Simon See, Hongxu Yin, Ping Luo, Sifei Liu.
    • Arxiv [pdf] [code] [webpage]
  • From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

    • Yatai Ji, Teng Wang, Yuying Ge, Zhiheng Liu, Sidi Yang, Ying Shan, Ping Luo.
    • Arxiv [pdf] [code] [webpage]
  • Global and Local Semantic Completion Learning for Vision-Language Pre-training

    • RongCheng Tu*, Yatai Ji*, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu.
    • TPAMI (CCF A) [pdf]
  • Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

    • Yatai Ji*, Jiacheng Zhang*, Jie Wu, Shilong Zhang, Shoufa Chen, Chongjian Ge, Peize Sun, Weifeng Chen, Wenqi Shao, Xuefeng Xiao, Weilin Huang, Ping Luo.
    • ICCV2025 (CCF A) [pdf] [code]
  • IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

    • Yatai Ji, Shilong Zhang, Jie Wu, Peize Sun, Weifeng Chen, Xuefeng Xiao, Sidi Yang, Yujiu Yang, Ping Luo.
    • ICLR2025 (CCF A) [pdf] [code]
  • Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning.

    • Weifeng Chen*, Yatai Ji*, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin.
    • Arxiv [pdf] [code]
  • Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

    • Yatai Ji*, Rongcheng Tu*, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu.
    • CVPR2023 (CCF A) [pdf] [code]
  • MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

    • Yatai Ji*, Junjie Wang*, Yuan Gong, Lin Zhang, Yanru Zhu, Hongfa Wang, Jiaxing Zhang, Tetsuya Sakai, Yujiu Yang.
    • CVPR2023 (CCF A) [pdf] [code]
  • MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

    • Junjie Wang *, Yatai Ji*, Jiaqi Sun, Yujiu Yang, Tetsuya Sakai.
    • EMNLP2021 (CCF B) [pdf] [code]

Awards

  • 2023.7, Tencent Rhino-Bird Research Scholarship
  • 2024.6, Outstanding Master’s Thesis Award of Tsinghua University
  • 2024.9, Shenzhen Universiade International Scholarship

Internship

  • 2022~2023, AMAI, Department of Data Platform, Tencent
  • 2023~2024, AI Platform, Intelligence Creation Department, ByteDance
  • 2025, Nvidia Research, Nvidia
  • 2025, ARC Lab, Tencent
  • 2026, Kling, KuaiShou
  • 2026, mGenAI, Meta