In this work, we introduce OmniDexGrasp, a unified framework that achieves generalizable dexterous grasping solely guided by grasp demonstrations generated from foundation generative models. Without relying on robot data or additional training, OmniDexGrasp realizes omni-ability in functional grasping—covering six representative tasks, including semantic grasping, region/point grasping, grasping in cluttered scenes, one-shot demonstration grasping, human–robot handover, and fragile object grasping—while supporting multi-modal inputs such as language, visual prompts, and demonstration images.
Unlike traditional methods that train a dedicated network to predict grasp poses, OmniDexGrasp leverages both a foundation generative model (e.g., GPT-Image) and a foundation visual model to synthesize human grasp images and convert them into executable dexterous robot actions. The framework integrates a human-image-to-robot-action transfer strategy that reconstructs and retargets generated human grasps to robot joint configurations, together with a force-sensing adaptive grasping strategy that ensures stable and reliable execution. Moreover, our framework is naturally extensible to manipulation tasks.
Extensive experiments in both simulation and real-world settings demonstrate that foundation models can provide precise and semantically aligned guidance for dexterous grasping, achieving an average 87.9% success rate across six diverse tasks. As foundation models continue to advance, we envision future work extending OmniDexGrasp toward non-prehensile manipulation, further promoting the integration of foundation models into embodied intelligence.
 
        (a) Using a foundation generative model, a human grasp image is generated based on the given grasp instruction and the initial scene image. (b) Relying solely on foundation visual models, the human-image-to-robot-action transfer module reconstructs the 3D hand–object interaction from the generated grasp image, retargets the human grasp to the robot’s dexterous hand, and aligns the grasp with the real-world object 6D pose to obtain an executable dexterous grasp action. (c) A force-sensing adaptive grasping strategy executes the grasp by dynamically adjusting finger motions according to force feedback, ensuring stable and reliable grasp execution.
@article{wei2025omnidexgrasp,
  author    = {Yi-Lin Wei and Zhexi Luo and Yuhao Lin and Mu Lin and Zhizhao Liang and Shuoyu Chen and Wei-Shi Zheng},
  title     = {OmniDexGrasp: Generalizable Dexterous Grasping via Foundation Model and Force Feedback},
  journal   = {arXiv},
  year      = {2025},
} 
             
             
             
             
             
            