Generalizable Hierarchical Skill Learning via Object-Centric Representation

Northeastern University; Stanford University
*Indicates Equal Advising

Abstract

We present Generalizable Hierarchical Skill Learning (GSL), a novel framework for hierarchical policy learning that significantly improves policy generalization and sample efficiency in robot manipulation. One core idea of GSL is to use object-centric skills as an interface that bridges the high-level vision-language model and the low-level visual-motor policy. Specifically, GSL decomposes demonstrations into transferable and object-canonicalized skill primitives using foundation models, ensuring efficient low-level skill learning in the object frame. At test time, the skill–object pairs predicted by the high-level agent are fed to the low-level module, where the inferred canonical actions are mapped back to the world frame for execution. This structured yet flexible design leads to substantial improvements in sample efficiency and generalization of our method across unseen spatial arrangements, object appearances, and task compositions. In simulation, GSL trained with only 3 demonstrations per task outperforms baselines trained with 30 times more data by 15.5% on unseen tasks. In real-world experiments, GSL also surpasses the baseline trained with 10 times more data.

Method Overview

GSL Overview

GSL uses object-centric skills as an interface that bridges high-level vision-language models and low-level visual-motor policies.

Object-Centric Skill Learning

  • High-level agent: Decomposes demonstrations into transferable skill primitives using foundation models and selects skill-object pairs for task execution.
  • Low-level policy: Learns canonical skills in object-centric coordinate frames, enabling generalization across object poses and appearances.

This structured yet flexible design enables the high-level to convey semantic intent while the low-level performs localized control without reasoning over the entire scene.

Contributions

  • We propose Generalizable Hierarchical Skill Learning (GSL), a novel framework for generalizable hierarchical policy learning.
  • We introduce an object-centric skill as the interface between high-level and low-level modules, providing a structured yet flexible communication channel.
  • We empirically show that GSL significantly improves generalization over strong baselines: trained with only 3 demonstrations, GSL outperforms baselines trained with 100 demonstrations by 15.5% on unseen tasks.

BibTeX

@misc{zhao2025generalizablehierarchicalskilllearning,
      title={Generalizable Hierarchical Skill Learning via Object-Centric Representation}, 
      author={Haibo Zhao and Yu Qi and Boce Hu and Yizhe Zhu and Ziyan Chen and Heng Tian and Xupeng Zhu and Owen Howell and Haojie Huang and Robin Walters and Dian Wang and Robert Platt},
      year={2025},
      eprint={2510.21121},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2510.21121}, 
}
-->