Profile Picture Profile Picture Alt

Zirui "Colin" Wang

Ph.D. Student, UC Berkeley

zwcolin [at] eecs.berkeley [dot] edu

About

I am a first-year Ph.D. student in EECS at UC Berkeley, affiliated with Berkeley AI Research (BAIR) and Sky Computing Lab. My research is advised by Prof. Joseph Gonzalez, Prof. Trevor Darrell and Prof. Ion Stoica. I am interested in multimodal interactive intelligence. I work as a Member of Technical Staff at Voio, Inc, where I co-lead multimodal post-training for vision-language models to understand and reason about volumetric radiology scans. I am an incoming research scientist intern at Meta Superintelligence Lab in summer 2026.

Previously, I obtained B.S. in Data Science at the Halicioglu Data Science Institute (HDSI) and B.A. in Cognitive Science at the CogSci Department at the University of California, San Diego (UCSD). I was advised by Prof. Zhuowen Tu and Prof. Zhiting Hu for generative models in computer vision during my undergraduate years. I obtained my M.S.E. in Computer Science at Princeton University, advised by Prof. Danqi Chen where I worked on multimodal pre-training, reasoning and evaluation. I am a recipient of Siebel Scholar, Class of 2025 in Computer Science.

Publications

The Wealth of Agents: From Monolithic to Polylithic Systems
The Wealth of Agents: From Monolithic to Polylithic Systems
Austin W. Hanjie, Carlos E. Jimenez, Zirui Wang, Karthik R. Narasimhan
Preprint, 2026

We introduce polylithic systems, where independent producers contribute specialized capabilities to a shared agentic system. Our experiments show that coordinated specialists from multiple producers outperform monolithic generalists, and even small producers that underperform in isolation can improve the overall system by 5–13%.

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
Zirui Wang*, Junyi Zhang*, Jiaxin Ge*, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, Xudong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez
Preprint, 2026

We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback.

FrontierCS: Evolving Challenges for Evolving Intelligence
FrontierCS: Evolving Challenges for Evolving Intelligence
Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, Youran Sun, Wesley Zheng, Meiyuwang Zhang, Ruyi Ji, Xuechang Tu, Zihan Zheng, Zexing Chen, Kangyang Zhou, Zhaozi Wang, Jingbang Chen, Aleksandra Korolova, Peter Henderson, Pramod Viswanath, Vijay Ganesh, Saining Xie, Zhuang Liu, Dawn Song, Sewon Min, Ion Stoica, Joseph E. Gonzalez, Jingbo Shang, Alvin Cheung
Preprint, 2025

FrontierCS is a benchmark of unsolved, open-ended, verifiable, and diverse computer science challenges designed to evolve alongside AI capabilities, featuring problems like polyomino packing that remain difficult even for advanced models.

YOLO-Count: Differentiable Object Counting for Text-to-Image Generation
YOLO-Count: Differentiable Object Counting for Text-to-Image Generation
Guanning Zeng, Xiang Zhang, Zirui Wang, Haiyang Xu, Zeyuan Chen, Bingnan Li, Zhuowen Tu
International Conference on Computer Vision (ICCV), 2025

We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation.

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen
Neural Information Processing Systems (NeurIPS)
NeurIPS Workshop on Multimodal Algorithmic Reasoning (Spotlight)
ECCV Workshop on Emergent Visual Abilities and Limits of Foundation Models
, 2024

CharXiv reveals significant shortcomings in MLLMs’ chart understanding, showing a large performance gap between models and humans.

Improving Language Understanding from Screenshots
Improving Language Understanding from Screenshots
Preprint, 2024

We close the performance gap between screenshot Language Models and text-only Language Models on language understanding tasks with our PTP objective.

Language Models as Science Tutors
Language Models as Science Tutors
International Conference on Machine Learning (ICML), 2024

We propose TutorChat and TutorEval, a dataset of long synthetic dialogues about textbooks and a question-ansering benchmark consisting questions about long chapters from STEM textbooks written by human experts.

TokenCompose: Grounding Diffusion with Token-level Supervision
TokenCompose: Grounding Diffusion with Token-level Supervision
Computer Vision and Pattern Recognition (CVPR), 2024

We introduce token-wise consistency terms between the image content and object segmentation maps in training text-to-image models for enhanced multi-category instance composition and photorealism.

OmniControlNet: Dual-stage Integration for Conditional Image Generation
OmniControlNet: Dual-stage Integration for Conditional Image Generation
Yilin Wang*, Haiyang Xu*, Xiang Zhang, Zeyuan Chen, Zhizhou Sha, Zirui Wang, Zhuowen Tu
Computer Vision and Pattern Recognition (CVPR), Workshop in Generative Models for Computer Vision, 2024

We provide a two-way integration for the widely-adopted ControlNet method by integrating four external condition generation algorithms into a single dense image labeling method, and by integrating its individually trained image generation processes into a single model.

Language Models Meet World Models: Embodied Experiences Enhance Language Models
Language Models Meet World Models: Embodied Experiences Enhance Language Models
Neural Information Processing Systems (NeurIPS), 2023

We establish a framework that effectively and efficiently finetunes a language model with embodied experience while retaining its language modeling abilities.

On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning
On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning
International Conference on Learning Representations (ICLR), 2023

We investigate whether internal models learned by modern model-based RL algorithms can be leveraged to solve new, distinctly different tasks faster.

Services

📝 Conference Reviewer:

  • International Conference on Machine Learning (ICML): 2024, 2025, 2026
  • International Conference on Learning Representations (ICLR): 2024, 2025, 2026
  • Conference on Neural Information Processing Systems (NeurIPS): 2024 🏆*, 2025
  • Association for Computational Linguistics (ACL): 2025
  • Conference on Computer Vision and Pattern Recognition (CVPR): 2025
  • Association for the Advancement of Artificial Intelligence (AAAI): 2026

* Outstanding Reviewer

🎓 Alumni Interviewer:

  • Princeton University, Class of 2030: 2026

Teaching

Misc

  • 🏋️ I compete in powerlifting at 82.5kg. The physique? Just what happens when you spend too much time under a barbell.