Boyang Xiao

Email: xiao.boyang.816@s.kyushu-u.ac.jp
Room 1022, West II Building, 744 Motooka, Nishi-ku, Fukuoka, Japan

Ph.D student
Graduate School and Faculty of Information Science and Electrical Engineering, Kyushu University, Japan

Self Introduction

I received my Bachelor degree in Telecommunications Engineering and Management from Beijing University of Posts and Telecommunications and completed my Master degree in Health Data Science at the University of Manchester. Now I am a member of Suzuki Lab.
I developed interest in deep learning and computer vision through coursework and projects during my studies, including image segmentation and generative modeling tasks.
I have also explored interdisciplinary research in brain-computer interfaces, where I was involved in EEG hardware design and neural signal processing.
Currently, my research focuses on multimodal video generation, with an emphasis on generating temporally coherent and semantically consistent content from textual inputs.

Educational Background

2019/09 ~ 2023/07 Bachelor degree, Telecommunications Engineering and Management, Beijing University of Posts and Telecommunications, China
2023/09 ~ 2024/11 Master of Science, Health Data Science, The University of Manchester, United Kingdom
2025/04 ~ Now Ph.D student, Machine Learning, Kyushu University, Japan

Work Experience

2021/09 ~ 2021/11 Intern, Beijing Xiaomi Mobile Software Co., Ltd., Beijing, China
2022/07 ~ 2022/10 Developer Intern, Telefonaktiebolaget L.M. Ericsson (China), Beijing, China
2024/10 ~ 2025/01 Research Intern, Institute of Computing Technology (ICT), Chinese Academic of Science (CAS), Beijing, China

Research Interests

Machine Learning, Generative Model, Video Generation, Large Language Model

Research Proposal

Recent advances in generative models have significantly improved the visual quality of text-to-video generation systems. Modern video diffusion models, combined with large language models for prompt planning, are capable of producing visually plausible short clips under existing evaluation metrics. However, despite these improvements, generating long, semantically consistent videos remains a fundamental challenge.
In preliminary experiments involving prompt rewriting using large language model planners, we observe several critical limitations in current video generation pipelines. First, video models exhibit limited sensitivity to long textual contexts, often resulting in semantic drift and style inconsistency across frames. Second, existing evaluation metrics are largely insensitive to these issues, failing to capture semantic degradation over time. Third, current models lack effective mechanisms for maintaining cross-frame and cross-segment coherence, which severely restricts the generation of longer videos beyond short clips (e.g., over 10 seconds), leading to accumulated semantic errors and generation failures.
To address these challenges, this research proposes a framework for long-form video generation with improved semantic consistency. Specifically, we explore: (1) hierarchical generation strategies with inherited representations, augmented by cross-frame attention mechanisms to mitigate semantic loss across segments; (2) zero-shot integration of reference video elements to provide additional structural and semantic guidance; and (3) test-time generation strategies that dynamically refine outputs during inference.
Through these approaches, this work aims to improve the scalability of video generation systems toward longer durations, while enhancing alignment between textual intent and generated visual content. Ultimately, this research seeks to bridge the gap between short-form visual plausibility and long-form semantic coherence in multimodal generative models.

Publication

[1] Li, S., Xiao, B. & Xie, S. (2022). Animal recognition using Siamese network with two kinds of backbone networks. Proc. AIIIP 2022, 124562W. https://doi.org/10.1117/12.2659594 (Equally Contribution)
[2] Xiao, B. (2025). A Comparison of LSTM and CNN Performance in EEG Motor Imagery with Application to Edge Computing Non-invasive Brain-computer Interface Possibilities. Proc. AIIIP 2024, 273 - 278. https://doi.org/10.1145/3707292.3707376
[3] Zhang, X., Kang G., Xiao, B., Zhan, J. Tensor databases empower AI for science: A case study on retrosynthetic analysis. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, Volume 5, Issue 1, 2025, 100216, ISSN 2772-4859, https://doi.org/10.1016/j.tbench.2025.100216.

Suzuki Lab., , ISEE, Kyushu University
Last modified: Apirl. , 2026