Profile photo of Deyu Zhou, AI Researcher

Deyu Zhou 周德宇

Email: dzhou861[at]connect.hkust-gz.edu.cn

I research interactive world models, driven by the lifelong goal of using AI for social good.

I'm always excited to discuss new ideas, especially the crazy ones. Don't hesitate to get in touch!

Ph.D. candidate @ HKUST-GZ, supervised by Prof. Harry SHUM and Prof. Lionel NI.

Selective Projects

Huawei · Face Blur Detection, Virtual Makeup (2019)

Tencent AI Lab · Emotion Classification, Neural Machine Translation (2020-2021)

Xiaobing · Multimodal Conversation, Audio-driven Talking Head Generation (2021-2022)

Step Fun · Text-to-Video Generation, Autoregressive Video Generation (2024)

Mentors & Collaborators

I have been very fortunate to work with and learn from Dr. Chen Dong, Dr. Shuangzhi Wu, Dr. Zhaopeng Tu, Dr. Baoyuan Wang, Dr. Duomin Wang, Dr. Yu Deng, Dr. Quan Sun, Dr. Zheng Ge, Dr. Nan Duan and Dr. Xiangyu Zhang.

Research Highlights

Autoregressive Text-to-Image Generation Open-sourced Foundation Model

NextStep-1 Technical Report

Technical Report

Autoregressive text-to-image model with continuous tokens.

30B Text-to-Video GenerationOpen-sourced Foundation Model

Step-Video-T2V Technical Report

Technical Report

State-of-the-art text-to-video model with 30B parameters, capable of generating 204-frame videos through novel architecture design.

Model architecture diagram for MAGI, an autoregressive video generation framework.
Model Arch.
Autoregressive Video GenerationNovel Foundation Architecture

Taming Teacher Forcing for Masked Autoregressive Video Generation

CVPR 2025

A novel frame-level autoregressive video generation framework combining masked and causal modeling with Complete Teacher Forcing, achieving +23% FVD improvement.

Visual results of TH-PAD, a talking head generation model with diffusion priors.
Talking Head GenerationDiffusion Prior

TH-PAD: Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

ICCV 2023

We introduce a novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead sample all holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and overall naturalness.