Xuyang Cao

Senior Algorithm Engineer, Ph.D.

JD.com, Beijing, China
No. 18 Kechuang 11 Street, Tongzhou District, Beijing

Email:

newxuyangcao [at] gmail [dot] com

Biography

Dr. Xuyang Cao is a senior algorithm engineer at JD Health Inc. in JD.com, specializing in MLLM related research and application. Over the past year, his primary focus has been on multimodal deep learning algorithms, including medical MLLMs, audio-driven talking face generation, brain science related research, etc.

Since 2016, he has been engaged in research in computer vision, image processing, and pattern recognition. He completed his undergraduate, and Ph.D. degrees at Beijing Jiaotong University. During his Ph.D., his research focused on semantic segmentation and semi-supervised learning, under the supervision of professor Houjin Chen and professor Yanfen Li. Additionally, he closely collaborated with professor Yahui Peng during his doctoral studies.

Experience

Algorithm Engineer, JD Health Inc., JD.com	2022-Present
Ph.D. Candidate, School of Electronic and Information Engineering, Beijing Jiaotong University	2017-2022
Master Candidate, School of Electronic and Information Engineering, Beijing Jiaotong University	2016-2017
Bachelor of Engineering, School of Electronic and Information Engineering, Beijing Jiaotong University	2012-2016

Projects

Unified Medical Multimodal Foundation Model		2025.06-Present
Built the open-source medical MLLM framework, CitrusV, a unified medical multimodal model that achieves state-of-the-art performance at the time, serving as the foundational model for downstream medical applications. CitrusV is the first model in the industry to support both medical image understanding and detection/segmentation, enabling unified parsing of textual, imaging, laboratory, and other multimodal medical data. For the first time, addressed a longstanding question by demonstrating that a unified MLLM can surpass expert segmentation models, improving average performance by 112%. Introduced chain-of-thought (CoT) reasoning aligned with clinical diagnostic workflows, delivering multimodal reasoning that is explainable, verifiable, and localizable for medical decision support. Achieved state-of-the-art performance across 7 core task categories and 21 benchmark datasets. Core achievements released at the annual JDD 2025 (link), forming open-source assets (Code, Technical Report).
Audio Driven Talking Face Generation		2024.01-2025.06
Developed an audio-driven talking face generation framework, to support offline and online medical applications (DAU over 100k), including medical education, medical consultation, etc. Proposed JoyVASA, a diffusion-based talking face generation model, which separates dynamic expressions from static 3D identity cues in facial representation, trained a diffusion transformer to generate identity-independent motion sequences directly from audio, and used a dedicated renderer to synthesize high-quality animations. JoyVASA achieves real-time performance, with a RTF of 0.76 and a speed of 40x faster than existing diffusion based models. The open source projects [JoyVASA and JoyHallo] achieved 1,300+ GitHub stars.
The Brain Science Project - Brain-Computer Interface (BCI) - Brain Disease Diagnosis Based on Medical Imaging		2023.02-2024.01
Served as a key researcher in JD Health Research, focusing on brain science related research, including Brain-Computer Interface (BCI), and the early diagnosis of Alzheimer’s (AD) and Bipolar Disorder (BD) based on medical imaging. Research interests include: explainable AI, medical multimodal fusion model, BCI based intelligent rehabilitation, etc. Several papers and patents are published, including multiple studies in brain imaging and EEG recognition achieving state-of-the-art results. A national key research and development program of China is approved. (project: Virtual Reality Cognitive Rehabilitation Training System for the Elderly).
Research on Deep Learning based Breast Mass (Ultrasound) and Lung Mass (MRI) Segmentation		2018.09-2022.04
Research on deep learning based semantic segmentation algorithms on breast ultrasound images as well as lung MRI images. Delved deep into semantic segmentation algorithms, such as fully supervised learning, semi-supervised learning, neural network architecture search, etc. Proposed lightweight dilated densely connected network for 3D breast tumor segmentation. The performance improved over 5% compared with classical segmentation networks, while network parameters were over 20 times smaller than classical networks. Designed an uncertainty-aware temporal-ensembling model for semi-supervised segmentation. The semi-supervised method achieved 94.4% of the performance of supervised segmentation with only 1.1% labeled data. Suggested an NAS-based 3D medical image segmentation framework, and achieved an improvement of 4.2% compared with the state-of-the-art human-designed segmentation network. Related journal and conference papers have been published, total impact factor 30+.
High Speed Train Gear Defect Detection Based on Computer Vision		2018.09-2022.04
As a core developer, designed an automatic detection and quantitative analysis solution for surface pitting of coupling gear components in train sets, based on computer vision technology. Collected coupling gear surface images and applied image recognition algorithms to generate statistics, reports, and alerts for gears exceeding the defect threshold. Led the development of algorithms for detecting and segmenting gear surface wear areas, as well as the design and implementation of the software system on the Windows platform. Completed defect detection of 104 gear surfaces (internal and external) with defects larger than 1mm within 3 minutes; published related papers and patents.
Geometric Parameters Measurement of an Overhead Line System		2018.09-2022.04
Provided a solution for measuring the geometric parameters of an overhead line system using scale factors and frame differences. I was responsible for the preliminary algorithm simulation and participated in the hardware structure design work. Two related patents have been granted.

Publications

G. Wang, Y. Li, Z. Zhou, S. An, X. Cao, Y. Jin, et al. PlgFormer: parallel extraction of local-global features for AD diagnosis on sMRI using a unified CNN-transformer architecture, Frontiers in Neurology 16 (2025): 1626922. [paper]
G. Wang, J. Zhao, X. Liu, Y. Liu, X. Cao, C. Li, Z. Liu, Q. Sun et al. Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning. Arxiv, 2025. [paper] [code][project]
J. Fei, SH. Ding, J. Zhao, J. Luo, G. Wang, Y. Yao, X. Cao, et al. Chronic disease trajectory network and Monte Carlo simulation in Chinese population. European Heart Journal, Volume 46, Issue Supplement_1, November 2025, ehaf784.4556. [paper]
J. Fei, G. Wang, J. Zhao, J. Luo, C. Zhang, Z. Wu, M. Gao, X. Cao, et al. Leveraging e-commerce user behavior data for common chronic diseases prediction. European Heart Journal, Volume 46, Issue Supplement_1, November 2025, ehaf784.4438. [paper]
G. Wang, X. Cao, S. An, F. Fan, C. Zhang, J. Wang, F. Yu, Z. Wang. Multi-Dimension-Embedding-Aware Modality Fusion Transformer for Psychiatric Disorder Classification, ICIGP, 2025. [paper]
X. Cao, G Wang, S Shi, J Zhao, Y Yao, J Fei, M Gao. JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation. Arxiv, 2024. [paper] [code][project]
S. Shi, X. Cao, J. Zhao, G. Wang. JoyHallo: Digital human model for Mandarin. Arxiv, 2024. [technical report] [code][project]
Z. Gao, Y. Guo, G. Wang, X. Chen, X. Cao, C. Zhang, S. An, F. Xu. Robust deep learning from incomplete annotation for accurate lung nodule detection. Computers in Biology and Medicine, 2024, 173:108361. [paper]
X. Cao, H. Chen, Y. Li, Y. Peng, Y. Zhou, L. Cheng, T. Liu, D. Shen. Auto-DenseUnet: Searchable Neural Network Architecture for Tumor Segmentation in 3D Automated Breast Ultrasound. Medical Image Analysis, 2022, 82: 102589. [paper]
Y. Zhou, H. Chen, Y. Li, X. Cao, S. Wang, D. Shen. Cross-Model Attention-Guided Tumor Segmentation for 3D Automated Breast Ultrasound (ABUS) Images. IEEE Journal of Biomedical and Health Informatics, 2022, 26(1): 301-311. [paper]
X. Cao, H. Chen, Y. Li, Y. Peng, S. Wang, L. Cheng. Uncertainty Aware Temporal-Ensembling Model for Semi-supervised ABUS Mass Segmentation. IEEE Transactions on Medical Imaging, 2021, 40(1):431-443. [paper]
X. Cao, H. Chen, Y. Li, Y. Peng, S. Wang, L. Cheng. Dilated Densely Connected U-Net with Uncertainty Focus Loss for 3D ABUS Mass Segmentation. Computer Methods and Programs in Biomedicine, 2021, 209: 106313. [paper]
J. Li, H. Chen, Y. Li, Y. Peng, N. Cai, X. Cao. AMRSegNet: Adaptive Modality Recalibration Network for Lung Tumor Segmentation on Multi-Modal MR Images. Multimedia Tools and Applications, 2021, 80: 33779–33797. [paper]
X. Cao, H. Chen, Y. Li, Y. Peng, Y. Zhou, L. Cheng. Boundary Loss with Non-Euclidean Distance Constraint for ABUS Mass Segmentation. 2020 CISP-BMEI, Chengdu, China, 2020, pp: 645-650. [paper]
Y. Peng, X. Cao, H. Chen, Y. Li, J. Li, X. Wang. Preliminary Study on Noise and Artifact Reduction in Phase-Contrast CT Image of Tristructural-Isotropic Coated Fuel Particle (in Chinese). Acta Electronica Sinica, 2019, 47(2): 448-453. [paper]
C. Wang, F. Li, Y. Li, H. Chen and X. Cao. A Defect Status Detecting Method for External Gear in Railway. 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC), Chongqing, 2018, pp: 123-127. [paper]
Y. Li, X. Cao, H. Chen, L. Zhang, N. Yang. Defect Status Detection Method Based on Machine Vision for External Gear in Train (in Chinese). Journal of The China Railway Society, 2018, 40(12):33-41. [paper]
J. Wei, X. Cao, H. chen, Y. li. Research on benign and malignant masses classification in mammogram (in Chinese). Journal of Beijing Jiaotong University, 2017, 41(5): 73-. [paper]

Patents & Books

Y. Peng, W. Jiang, Z. Zhu, H. Yang, X. Cao, H. Chen. A method of Measuring the Geometric Parameters of an Overhead Line System by using Geometric Magnification and Monocular Vision. China, CN201810182553.1, 2018-11-13. [Link]

Y. Peng, C. Zhang, B. Zheng, J. Yin, X. Cao, H. Chen. A method and a Device for Measuring the Geometric Parameters of an Overhead Line System by using Scale Factors and Frame Differences. China, CN201710464403.5, 2017-06-19. [Link]

Y. Zhou, X. Cao. Neural Networks with TensorFlow 2, Apress, 2020. [translated] [Link]

Personal Qualifications

Research Interests:		MLLM, AIGC, Image Processing, Pattern Recognition, Computer Vision, Artificial Intelligence
Language Skills:		Chinese (Mother Tongue) \| English (Proficient in Listening, Speaking, Reading, Writing), IELTS Score: 7
Computer Skills:		Programming Languages: Python, C++ \| Computer Vision: Pytorch, OpenCV \| Others: Linux, Vim, LaTeX
Hobbies:		Personal blog with over 330,000 PV \| Enjoy reading, hiking, and traveling