GSL uses object-centric skills as an interface that bridges high-level vision-language models and low-level visual-motor policies.
Object-Centric Skill Learning
- High-level agent: Decomposes demonstrations into transferable skill primitives using foundation models and selects skill-object pairs for task execution.
- Low-level policy: Learns canonical skills in object-centric coordinate frames, enabling generalization across object poses and appearances.
This structured yet flexible design enables the high-level to convey semantic intent while the low-level performs localized control without reasoning over the entire scene.
Contributions
- We propose Generalizable Hierarchical Skill Learning (GSL), a novel framework for generalizable hierarchical policy learning.
- We introduce an object-centric skill as the interface between high-level and low-level modules, providing a structured yet flexible communication channel.
- We empirically show that GSL significantly improves generalization over strong baselines: trained with only 3 demonstrations, GSL outperforms baselines trained with 100 demonstrations by 15.5% on unseen tasks.