One Brain, Many Bodies: How HEX Is Breaking the Embodiment Barrier in Humanoid Robotics
HEX is a 2.4B-parameter open-source VLA model pretrained on 12 million frames from seven different humanoid robots — and it controls all of them with a single policy.
For years, the assumption in humanoid robotics has been simple: one robot, one brain. Train a policy on a Unitree G1, and it stays on the G1. Switch to a Unitree H1 or a Leju Kuavo, and you start from scratch. A research team from Beijing Innovation Center of Humanoid Robotics, Xi’an Jiaotong University, Peking University, and Nankai University has just challenged that assumption — and released everything under an open license.
Released on May 17, 2026, HEX (Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation) is a 2.4-billion-parameter vision-language-action framework pretrained on over 12 million frames from seven distinct humanoid embodiments. The result is a single model that can control the Unitree G1, Unitree H1, Leju Kuavo, Tienkung 2.0, Tienkung 3.0, Tienyi, and AgiBot — without retraining.
The Problem: Robots Don’t Share
Current VLA models for humanoids follow a predictable pattern. A team collects data on one robot platform, trains a large model, and achieves impressive results — on that platform. But the moment you swap the hardware, the policy collapses. Different robots have different joint configurations, different sensor setups, different proportions, and different action spaces. The policy has no vocabulary for a body it was never trained on.
This is the embodiment gap, and it is one of the quietest but most expensive problems in robotics. Every new humanoid platform requires its own data collection pipeline, its own training run, its own fine-tuning. The field keeps rebuilding the same wheel in slightly different sizes.
The Insight: Treat Body Parts as Slots
HEX’s central architectural move is to stop thinking about robots as monolithic bodies and start thinking about them as collections of body parts — arms, hands, legs, waist, head — that can be mapped into a shared representation space. The team calls this humanoid-aligned universal state representation. Instead of encoding “this is a Unitree G1 joint angle,” HEX encodes “this is a left-arm state across all humanoids.”
The mechanism is a Unified Proprioceptive Predictor (UPP) built on a Mixture-of-Experts architecture. The UPP organizes heterogeneous robot states into canonical body-part slots and predicts short-horizon future dynamics. A morphology-aware MoE inside the UPP learns which expert to activate for which robot body, enabling the same network to handle different kinematic chains without confusion. The routing analysis in the paper shows something elegant: early layers specialize by body part (stable across robots), while deeper layers specialize by task phase (shifting as the manipulation task progresses).
The Architecture: Three Modules, One Goal
HEX is a hierarchical system with a clean separation of concerns:
1. VLM with History Query Cache. At the top, a Qwen-VL backbone processes the current visual scene and natural language instruction. Rather than feeding a long sequence of past images into the model (which is computationally expensive), HEX uses lightweight history tokens that summarize recent semantic context. This preserves short-term memory without repeatedly encoding image stacks.
2. Unified Proprioceptive Predictor. The UPP takes the robot’s current proprioceptive state — joint angles, body pose, end-effector positions — and maps it into the shared body-part representation. It then predicts how that state will evolve over the next few timesteps. This predictive modeling is critical: by anticipating future body dynamics, HEX can generate actions that maintain balance and coordination rather than reacting to instability after it happens.
3. Action Expert with Flow Matching. The Action Expert fuses visual-language features with predicted proprioceptive dynamics through dual cross-attention and residual-gated fusion. It outputs high-level commands for the arms, hands, and waist. A flow-matching action head generates smooth, continuous action trajectories rather than discrete joint targets. The low-level leg control is handled by a separate reinforcement-learning-based whole-body controller that keeps the robot upright while the arms do the manipulation.
Training on 12M Frames From Seven Robots
The pretraining corpus is what makes HEX possible. The team aggregated data from four major open-source humanoid datasets:
- Tienkung series (EAI real-world tasks)
- Unitree G1 (Humanoid Everyday)
- Unitree H1 (Humanoid Everyday)
- Leju Kuavo (RoboCOIN)
- AgiBot-to-Unitree G1 (AgiBot World Colosseo + TrajBooster retargeting)
All told, over 12 million frames spanning diverse manipulation tasks: carrying boxes, pouring liquids, opening drawers, tidying tables, seasoning pots, and avoiding obstacles. The datasets differ in camera placement, action definitions, state dimensions, and even the number of joints — yet HEX integrates them into a single pretraining objective.
Real-World Results
The team evaluated HEX on eight real-world manipulation tasks against strong baselines including ACT, π0, π0.5, and SwitchVLA. In in-distribution settings, HEX achieved the best balance of success rate, motion smoothness, and reactive execution. In long-horizon multi-stage tasks, it outperformed all baselines at every stage, with particularly strong gains in the final placement phase — suggesting better stability and fewer cascading errors.
The generalization tests are where HEX diverges from the pack. Under out-of-distribution shifts — fast human motion, human interference, visual distractors, object repositioning, lighting changes, and dynamic scene alterations — HEX maintained robust performance. In a pouring task with distractors, every baseline collapsed to 0% success (often mistaking a red plate for a human pointing hand). HEX reached 53.3% success.
“Most VLA models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments.” — arXiv:2604.07993, May 2026
Open Source, Weights Included
HEX is not a paper-only release. The full stack is available now:
- Pretrained model weights (2.4B parameters) on Hugging Face
- Training and fine-tuning code on GitHub
- Eight real-world evaluation datasets for downstream fine-tuning
- Inference notebook with step-by-step examples
- Base VLM setup scripts (Qwen3-VL)
Inference latency on an RTX 4090 is 73.34 milliseconds per step — faster than π0.5, with higher task success. The model is designed to be practical for researchers who have real robots and need real results.
Why This Changes the Game
Cross-embodiment learning has been a theoretical goal in robotics for years. HEX is one of the first systems to make it work at scale on full-sized humanoids, with publicly available weights and training code. The implications are direct:
Data sharing. Labs running Unitree G1s can now share training data with labs running H1s or Leju Kuavos, and everyone benefits. The data barrier between platforms begins to dissolve.
Faster iteration. Researchers can pretrain on the aggregated multi-robot corpus and fine-tune on their specific platform with far less data than training from scratch.
Platform independence. A policy trained on HEX’s cross-embodiment representation is less likely to break when transferred to a new humanoid with slightly different kinematics — because it was never trained to overfit to one body in the first place.
Key Takeaways
- HEX is a 2.4B-parameter open-source VLA model that controls seven different humanoid robots from a single pretrained policy using cross-embodiment training.
- The Unified Proprioceptive Predictor aligns heterogeneous robot states into shared body-part slots and uses a morphology-aware Mixture-of-Experts to handle different kinematic chains.
- Pretrained on 12M frames from Unitree G1/H1, Leju Kuavo, Tienkung series, and AgiBot, HEX achieves state-of-the-art results in seen, long-horizon, and out-of-distribution real-world manipulation tasks.
- Full weights, code, datasets, and inference examples are publicly available — inference runs at 73ms per step on an RTX 4090.
📰 Sources: