SwitchVLA

Execution-Aware Task Switching for Vision-Language-Action Models

Meng Li1,*, Zhen Zhao1,*, Zhengping Che1,*, Fei Liao1,
Kun Wu1, Zhiyuan Xu1, Pei Ren1, Zhao Jin1, Ning Liu1, Jian Tang1, †,
*Equal contribution Corresponding author
1Beijing Innovation Center of Humanoid Robotics


We introduce SwitchVLA, a unified execution-aware framework for Vision-Language-Action (VLA) robots. By conditioning action generation on both task language and fine-grained execution signals, SwitchVLA enables seamless transitions across forward, rollback, and advance behaviors—without relying on additional demonstration data, modular planners or handcrafted switching logic. SwitchVLA serves as a more generalizable solution for instruction-conditioned control—capable of unifying diverse switching behaviors within a single policy framework.


(Task switch evaluation videos of SwitchVLA)

The SwitchVLA Policy

SwitchVLA establishes a unified architecture for robust and instruction-consistent task execution. The architecture consists of two core components: (i) Visual-Language-Contact (VLC) Embedding Module encodes visual, language, and contact cues into unified representations. (ii) Conditional Execution Expert decodes behavior-aware actions conditioned on the current multimodal embedding.

The real-robot dataset was collected via human teleoperation on two dual-armed Franka Emika Panda workstations . For each trajectory, recordings include data from two wrist-mounted cameras, one third-person RGB camera, and the robots’ proprioceptive state sensors. The simulation dataset is drawn from the LIBERO benchmark’s simulation task suite GOAL .

Experiments

Tasks Switch Evaluations on Robot Platforms

Task switching occurs when the execution of Task A is interrupted upon receipt of a new instruction for Task B . Based on this, we perform two types of task switching experiments: pairwise evaluation and long sequence evaluation. To facilitate a comprehensive evaluation of the model’s capabilities, in pairwise experiments, we send new instruction at different execution phases: early (pre-contact) , mid (in-contact) , and late (post-action) . We evaluate SwitchVLA against three prior manipulation policies and selected the more representative real-robot works: MT-ACT, Diffusion Policy (DP), and π0.



SwitchVLA Long Sequence Task Switching

The following videos showcase the SwitchVLA long sequence switching performance on real-world and simulation platforms.

Real-World Workstation 1

pick up lemon place on plate→pick up cookie place on plate→pick up coke bottle place on plate→pick up lemon place on plate→pick up coke bottle place on plate→slide open upper cabinet

Real-World Workstation 2

pick up red gum place on plate→pick up sandwich place on plate→push plate to customer→pick up tissue place on plate→pick up sandwich place on plate→pick up red gum place on plate

Simulation

put the cream cheese in the bowl→put the wine bottle on top of the cabinet→put the bowl on the stove→put the bowl on the plate→put the bowl on top of the cabinet→turn on the stove




Pairwise Task Switching Comparisons with State-of-the-Art Models

In real-world experiments, We evaluate SwitchVLA against MT-ACT, Diffusion Policy (DP), and π0. π0 is a re-implementation based on the original paper.

For simulation, we compare SwitchVLA with π0 and OpenVLA-OFT.

Workstation 1

Mid-Switch: pick up coke bottle place on plate → pick up lemon place on plate

SwitchVLA
Diffusion Policy (DP)
MT-ACT

Late-Switch: pick up cookie place on plate → pick up lemon place on plate

SwitchVLA
Diffusion Policy (DP)
π0

Mid-Switch: slide open upper cabinet → pick up coke bottle place on plate

SwitchVLA
Diffusion Policy (DP)
π0

Early-Switch: pick up lemon place on plate → pick up coke bottle place on plate

SwitchVLA
Diffusion Policy (DP)
MT-ACT



Workstation 2

Mid-Switch: pick up sandwich place on plate → push plate to customer

SwitchVLA
Diffusion Policy (DP)
MT-ACT

Early-Switch: pick up tissue place on plate → pick up red gum place on plate

SwitchVLA
Diffusion Policy (DP)
MT-ACT

Mid-Switch: push plate to customer → pick up tissue place on plate

SwitchVLA
π0
MT-ACT

Late-Switch: pick up red gum place on plate → push plate to customer

SwitchVLA
DP
MT-ACT



Simulation

Mid-Switch: put the bowl on the plate → turn on the stove

SwitchVLA
Openvla-oft
π0

Late-Switch: put the bowl on top of the cabinet → turn on the stove

SwitchVLA
Openvla-oft
π0

Mid-Switch: put the wine bottle on top of the cabinet → turn on the stove

SwitchVLA
Openvla-oft
π0

Mid-Switch: put the bowl on top of the cabinet → turn on the stove

SwitchVLA
Openvla-oft
π0

Late-Switch: put the bowl on the plate → turn on the stove

SwitchVLA
Openvla-oft
π0