Dex4D

Task-Agnostic Point Track Policy for
Sim-to-Real Dexterous Manipulation

PDF

arXiv

Video

Code

Dex4D: Task-Agnostic Point Track Policy for
Sim-to-Real Dexterous Manipulation

Yuxuan Kuang, Sungjae Park, Katerina Fragkiadaki^†, Shubham Tulsiani^†

^† Equal Advising

PDF arXiv Video Summary
Code (Simulation) Code (Vision) Code (Hardware)

Summary Video

Overview

🦾 Task-Agnostic Policy

🧠 Generalizable Planner

🎯 Generalizable Dexterous Manipulation

Anypose-to-Anypose Point Track Policy
Learned in Simulation

Video Generation and 4D Reconstruction
as High-Level Planner

Generalizable Deployment to Diverse
Real-World Tasks

Task-Agnostic Point Track Policy

Task-Agnostic Sim-to-Real via Anypose-to-Anypose

interactive viewer 🕹️

ToyFigure

We propose Anypose-to-Anypose (AP2AP), a task-agnostic sim-to-real learning formulation for dexterous manipulation. AP2AP abstracts manipulation as directly transforming an object from an arbitrary initial pose to an arbitrary target pose in 3D space, without assuming task-specific structure, predefined grasps, or motion primitives. Conditioned on point tracks, we train our AP2AP policy on over 3,000 objects in simulation, and directly deploy the learned policy to real-world dexterous manipulation tasks without any real-robot data collection or policy fine-tuning.

Paired Point Encoding

Comparison between our Paired Point Encoding with other representations. Point features encoded from our Paired Point Encoding keep correspondence and permutation-invariance of the current and target object points, which shows better performance for policy learning.

Teacher-Student Policy Learning

Overview of our Dex4D teacher and student network architectures. (a) We first learn a teacher policy via RL with privileged states and full points sampled on the whole object, leveraging our proposed Paired Point Encoding representation. (b) Given partial observation, i.e., robot proprioception, last action, and masked paired points, we distill from the teacher and learn a transformer-based student action world model that jointly predicts actions and future robot states.

Video Planner and Deployment

Video Generation & 4D Reconstruction as High-Level Planner

"Put the broccoli 🥦 on the plate 🍽."

Video Generation

↘️

Relative Depth Estimation

⬇️

Point Tracking

↙️

Object-Centric Target Point Tracks (interactive viewer 🕹️)

Aside from using video generation to extract target object point tracks, such point tracks can also be obtained from diverse data sources, such as one-shot human video demonstrations or 3D point cloud forecast models.

Real-Time Point Tracking

We also develop a real-time sparse point tracker on top of CoTracker for perception for test-time deployment. The tracker takes in the first frame of the generated video as well as the initially tracked query points, and tracks the object points in real-time during execution. The tracked points can be directly used by our Dex4D policy for closed-loop control.

Real-World Demo

Autonomous, 2x
Object Mesh Not Known
Zero Real-Robot Demonstrations
Everything except the robot itself is unseen to the policy

Gallery

scroll down to see more ⬇️

Broccoli2Plate

LiftToy (Dino)

Pour

StackCup

InsertCarrot

Unplug

Meat2Bowl

InsertTennisBall

Carrot2Rabbit

Apple2Bowl

GrabFootball

LiftToy (Sloth)

LiftToy (Owl)

LiftToy (Rabbit)

Comparison with Motion Planning Baseline

Task: LiftToy

Ours ✅

Baseline ❌

The baseline is unaware of the hand and object grasping. Therefore the object would gradually fall off the hand during arm moving due to the lack of feedback. In contrast, our method has hand reactivity and learns to adjust or regrasp the object and proceeds with the task.

Task: Pour

Ours ✅

Baseline ❌

The baseline is very vulnerable to few and noisy visible points, while our method performs robustly even if there are less than 10 visible object points left due to extensive simulation training.

Generalization

Background & Camera

Distractors

Objects & Tasks

Perturbation

We demonstrate strong generalization to unseen object types and poses, backgrounds, camera views, task trajectories, and external disturbances.

BibTex

If you find our work useful in your research, please consider citing:

TODO

If you have any questions, please contact Yuxuan Kuang.