Evaluating Real-World Robot Manipulation Policies in Simulation

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li^*¹, Kyle Hsu^*², Jiayuan Gu^*¹, Karl Pertsch^†^2,3, Oier Mees^†³, Homer Rich Walke³, Chuyuan Fu⁴, Ishikaa Lunawat², Isabel Sieh², Sean Kirmani⁴, Sergey Levine³, Jiajun Wu², Chelsea Finn², Hao Su^‡¹, Quan Vuong^‡⁴, Ted Xiao^‡⁴

^*Equal contribution ^†Core contributors ^‡Equal advising

¹UC San Diego, ²Stanford University, ³UC Berkeley, ⁴Google DeepMind

1. How should we evaluate policies trained on real robot data?

Train on real, evaluate in real

Generally, roboticist evaluate policies (trained on realworld) in realworld. However, there are some problems associated with it.

People bump into cameras
Gripper gets stuck
Real-world evaluation is slow and tedious
Difficult to reproduce experiments

2. Potential SIMPLER way

SIMPLER stands for SIMULATED MANIPULATION POLICY EVALUATION FOR REAL ROBOT SETUPS

Train on real, evaluate in real

3. How real and SIMPLER performance correlates?

Real and sim performance correlation

4. Problem definition

NOT to obtain 1:1 reproduction of policies’ real-world behavior
To guide policy improvement decisions
Construct a simulator S with a strong correlation between relative performances in real and sim

5. Metrics

The paper propose two metrics to measure the performance in sim vs real:

Pearson correlation coefficient (Pearson r)
Mean Maximum Rank Violation (MMRV)

Pearson correlation coefficient (r)

Mean Maximum Rank Violation to overcome some limitation of Pearson correlation coefficient (r)

6. Challenges to building a real-to-sim evaluation system

6.1 Mitigating the Real-to-Sim Control Gap - SysID (System Identification)

Optimize P,D values for sim controller (stiffness, damping factors)
Play a demo trajectory actions on both sim and real

Minimizing the loss function to mitigate the control gap.

6.2 Mitigating the Real-to-Sim Visual Gap

The goal is to match the simulator visuals to those of the real-world environment with only a modest amount of manual effort using Green screening and Texture matching.

Replacing simulation background with real-world background and object textures with real-world textures.

7. Simulation setup

Two manipulation setups are used for different tasks.

Simulation setup 1) Using Google Robot 2) Using WidowX robot (BrideData V2 dataset)

8. Investigations for simulation evaluations

The paper investigates the following key questions:

Relative performances in sim and real
Sensitivity to various visual distribution shifts
Sensitivity to control and visual gaps
Sensitivity to physical property gaps
Does results extend to different physics simulator?

9. Experiment setup

They evaluate different open-source robot policies on the simulation setup described above. For google robot, four versions of robot arm and gripper colors used.

RT-1 (Begin)
RT-1 (15%)
RT-1 (Converged)
RT-1-X
RT-2-X
Octo-Base
Octo-Small

10. Results

SIMPLER can be used to evaluate diverse sets of rigid-body tasks (non-articulated / articulated objects, tabletop / non-tabletop tasks, shorter / longer horizon tasks), with many intra-task variations (e.g., different object combinations; different object / robot positions and orientations), for each of two robot embodiments (Google Robot and WidowX).

10.1 Evaluating and comparing policies

Policy performances evaluated in SIMPLER have strong correlation with those in the real world (illustrated by low MMRV and high Pearson r).

10.2 Analyzing and predicting policy behaviors under distribution shifts

SIMPLER can be used to analyze the policies' finegrained behaviors, such as their robustness to common distribution shifts like lightings, backgrounds, camera poses, distractor objects, and table textures.

11. Ablations

Control loss is proportional to the Mean Maximum Rank Violation.

Real-Sim Success Gap is miminum when all visual aspects of experiments match.

If physical properties of the objects are varied, the correlation still remains intact, have <= 15% impact on success rates.

The Real-Sim performance correlation is invariant to the simulator.

12. Conclusion

SIMPLER seems good proxy for real world policy evaluations
Limitations
- No manipulation tasks with soft-objects
- No tasks with high motion dynamics
- Green screening
  - Fixed cameras
  - No shadows and visual details
- Manual effort in creating simulation evaluation environments is still high

13. BibTex

To cite this paper:

 @article{li24simpler,
        title={Evaluating Real-World Robot Manipulation Policies in Simulation},
        author={Xuanlin Li and Kyle Hsu and Jiayuan Gu and Karl Pertsch and Oier Mees and Homer Rich Walke and Chuyuan Fu and Ishikaa Lunawat and Isabel Sieh and Sean Kirmani and Sergey Levine and Jiajun Wu and Chelsea Finn and Hao Su and Quan Vuong and Ted Xiao},
        journal = {arXiv preprint arXiv:2405.05941},
        year={2024},
        }