TrajBooster

Boosting Humanoid Whole-Body Manipulation
via Trajectory-Centric Learning

Jiacheng Liu^1,2,4* Pengxiang Ding^1,2* Qihang Zhou^3,4 Yuxuan Wu^3,4 Da Huang^3,4
Zimian Peng^1,4 Wei Xiao² Weinan Zhang^3,4 Lixin Yang^3,4 Cewu Lu^3,4^† Donglin Wang^2,4^†

¹Zhejiang University ²Westlake University
³Shanghai Jiao Tong University ⁴Shanghai Innovation Institue

^*Equal Contributions (Co-project leaders) ^†Corresponding Authors

Abstract

Recent Vision-Language-Action (VLA) models generalize across embodiments but struggle to quickly align with a new robot’s action space when high-quality demonstrations are scarce, especially for bipedal humanoids. We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA. Our key idea is to use end-effector trajectories as a morphology-agnostic interface. TrajBooster (i) extracts 6D dual-arm end-effector trajectories from real-world wheeled humanoids, (ii) retargets them in simulation to Unitree G1 with a whole-body controller trained via a heuristic-enhanced harmonized online DAgger to lift low-dimensional trajectory references into feasible high-dimensional whole-body actions, and (iii) forms heterogeneous triplets that couple source vision/language with target humanoid-compatible actions to post-pre-train a VLA, followed by only 10 minutes of teleoperation data collection on the target humanoid domain. Deployed on Unitree G1, our policy achieves beyond-tabletop household tasks, enabling squatting, cross-height manipulation, and coordinated whole-body motion with markedly improved robustness and generalization. Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance, reducing reliance on costly same-embodiment data while enhancing action space understanding and zero-shot skill transfer capabilities.

Humanoid whole-body Teleoperation

Videos acquired via teleoperation

Train Retargeting Model in Isaac Gym

"Hello" in Isaac Gym

Action retargeting from Agibot G1 to Unitree G1

Accelerated Adaptation to Humanoid Action Space

The VLA after post-pre-training (PPT) demonstrates accelerated adaptation to humanoid action space during the post-training phase.

3K steps (w/o. PPT)

10K steps (w/o. PPT)

3K steps (w. PPT)

Trajectory Generalization

Task: Picking a Mickey Mouse toy placed on the humanoid's right side, a configuration unseen in teleoperation data (which only used front placements).
Left: With post-pre-training, the VLA adapts to grasp from below. Right: Without post-pre-training, the VLA mimics teleoperated motions, approaching from above.

Zero-Shot Skill Generalization

Zero-Shot Execution of "Pass Water" Task.
Left: First-person view examples of the pass Water task from the Agibot-World dataset. Right: The post-pre-trained VLA successfully executes this task despite receiving no fine-tuning using real-world teleoperation data for this specific task.

Method

TrajBooster pipeline comprises three stages:
Real Trajectory Extraction: the process begins with extracting end-effector trajectories from source robots. Rather than directly mapping full-body motions to the target humanoid, TrajBooster utilizes the 6D coordinates of dual-arm end-effectors as the goal, enabling a retargeting model within the Isaac Gym simulator to achieve whole-body motion retargeting by tracking this goal.
Retargeting in Simulation: the retargeting model is trained with our heuristic-enhanced harmonized online DAgger algorithm for the target humanoid Unitree G1 to track these reference trajectories using whole-body control. Through this process, the humanoid learns to coordinate its joints such that its end-effectors follow the retargeted goals, effectively mapping low-dimensional reference signals into feasible high-dimensional actions. This stage generates a large volume of action data that is compatible with the morphology of the real-world target humanoid.
Finetuning for real humanoid: using this newly generated data, TrajBooster constructs heterogeneous triplets of the form < source vision, source language, target action >, which link perceptual inputs with humanoid-compatible behaviors. The resulting synthetic dataset is then used to post-pre-train an existing VLA model. With just 10 minutes of additional real-world teleoperation data < target vision, target language, target action >, the post-pre-trained VLA is fine-tuned and deployed on the Unitree G1 across a wide spectrum of whole-body manipulation tasks.

Failure Cases

The failures may stem from insufficient depth perception capabilities. Even with the addition of wrist cameras, the model still cannot accurately fold clothes. We believe another critical reason why wrist cameras don't work effectively is that the visual information from wrist cameras in our retargeted data doesn't perfectly match the camera-to-object distances and camera parameters of our actual robot setup.

Scenario 1

This represents the two-stage training approach mentioned earlier, built upon the base model. The first stage involves pre-post-training with retargeted data, followed by the second stage where we fine-tune using our teleoperated data for 3,000 steps. Compared to Scenario 4, the robot crouches much lower and spreads both arms wide.

Scenario 2

This shows a single-stage fine-tuning approach on the base model, mixing both retargeted data and real teleoperated data. Training was conducted for 1 epoch, totaling 60,000 steps. The left hand attempts to assist, but the crouch is insufficient. Both hands exhibit upward turning motions, demonstrating folding-like behaviors. However, due to incorrect depth estimation, the task ultimately fails.

Scenario 3

Building upon Scenario 2, this version includes an additional 3,000 steps of training with real teleoperated data. Here, the right hand attempts to grasp the clothes.

Scenario 4

This demonstrates the effect of direct fine-tuning without using retargeted data, trained for 10,000 steps (since we previously found that 3,000 steps resulted in 0% success rate). The model shows virtually no tendency toward clothes folding, with both hands remaining closed. The arm movement trajectory in the first half of the video resembles directly reaching for objects, similar to our teleoperated grasping motions for the Mickey Mouse toy. However, since the model's vision doesn't detect Mickey Mouse at the target location, it doesn't attempt to grasp it (this should be attributed to freezing the backbone during our fine-tuning of GR00T N1.5).

3K steps (w. PPT)

60K steps (mixed PT)

60K mixed PT + 3K PT

10K steps (w/o. PPT)

Q & A

Q: During the post-pre-training phase, the end-effector configuration and camera parameters in the source vision data differ from those of the target Unitree G1 humanoid robot. Does this affect the VLA's performance?
A: Despite the visual inconsistencies in end-effector configurations and camera parameters between the source data and target Unitree G1 robot, we surprisingly observed certain levels of task-level generalization even without implementing visual replacement strategies. According to references 18-21, if replacement methods were implemented, the generalization capability would likely be further enhanced, which represents a promising direction for our future work.

BibTeX

@article{liu2025trajbooster,
title={TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning},
author={Liu, Jiacheng and Ding, Pengxiang and Zhou, Qihang and Wu, Yuxuan and Huang, Da and Peng, Zimian and Xiao, Wei and Zhang, Weinan and Yang, Lixin and Lu, Cewu and others},
journal={arXiv preprint arXiv:2509.11839},
year= {2025},
}

Website modified from TWIST.