Dex4D: Task-Agnostic Point Track Policy for
Sim-to-Real Dexterous Manipulation

CMU
Equal Advising

Summary Video

Overview

+
=
Anypose-to-Anypose Point Track Policy
Learned in Simulation
Video Generation and 4D Reconstruction
as High-Level Planner
Generalizable Deployment to Diverse
Real-World Tasks

Task-Agnostic Point Track Policy

Task-Agnostic Sim-to-Real via Anypose-to-Anypose

interactive viewer đŸ•šī¸

ToyFigure

We propose Anypose-to-Anypose (AP2AP), a task-agnostic sim-to-real learning formulation for dexterous manipulation. AP2AP abstracts manipulation as directly transforming an object from an arbitrary initial pose to an arbitrary target pose in 3D space, without assuming task-specific structure, predefined grasps, or motion primitives. Conditioned on point tracks, we train our AP2AP policy on over 3,000 objects in simulation, and directly deploy the learned policy to real-world dexterous manipulation tasks without any real-robot data collection or policy fine-tuning.



Paired Point Encoding

framework

Comparison between our Paired Point Encoding with other representations. Point features encoded from our Paired Point Encoding keep correspondence and permutation-invariance of the current and target object points, which shows better performance for policy learning.


Teacher-Student Policy Learning

framework

Overview of our Dex4D teacher and student network architectures. (a) We first learn a teacher policy via RL with privileged states and full points sampled on the whole object, leveraging our proposed Paired Point Encoding representation. (b) Given partial observation, i.e., robot proprioception, last action, and masked paired points, we distill from the teacher and learn a transformer-based student action world model that jointly predicts actions and future robot states.

Video Planner and Deployment

Video Generation & 4D Reconstruction as High-Level Planner

"Put the broccoli đŸĨĻ on the plate đŸŊ."

Video Generation
â†˜ī¸
Relative Depth Estimation
âŦ‡ī¸
Point Tracking
â†™ī¸
Object-Centric Target Point Tracks (interactive viewer đŸ•šī¸)

Aside from using video generation to extract target object point tracks, such point tracks can also be obtained from diverse data sources, such as one-shot human video demonstrations or 3D point cloud forecast models.


Real-Time Point Tracking

We also develop a real-time sparse point tracker on top of CoTracker for perception for test-time deployment. The tracker takes in the first frame of the generated video as well as the initially tracked query points, and tracks the object points in real-time during execution. The tracked points can be directly used by our Dex4D policy for closed-loop control.

Real-World Demo

Autonomous, 2x
Object Mesh Not Known
Zero Real-Robot Demonstrations
Everything except the robot itself is unseen to the policy

Gallery

scroll down to see more âŦ‡ī¸



Comparison with Motion Planning Baseline

Task: LiftToy


Ours ✅
Baseline ❌

The baseline is unaware of the hand and object grasping. Therefore the object would gradually fall off the hand during arm moving due to the lack of feedback. In contrast, our method has hand reactivity and learns to adjust or regrasp the object and proceeds with the task.

Task: Pour


Ours ✅
Baseline ❌

The baseline is very vulnerable to few and noisy visible points, while our method performs robustly even if there are less than 10 visible object points left due to extensive simulation training.



Generalization

Background & Camera
Distractors
Objects & Tasks
Perturbation

We demonstrate strong generalization to unseen object types and poses, backgrounds, camera views, task trajectories, and external disturbances.


BibTex

If you find our work useful in your research, please consider citing:


TODO

If you have any questions, please contact Yuxuan Kuang.