Execution pipeline of a Vision-Language-Action (VLA) system under realistic deployment conditions.
Comparison between conventional VLA evaluation and our latency-aware execution model.
env.step calls using a synchronous or asynchronous execution protocol,
where the number of steps \(N\) is determined by \( \Delta t^{\text{sim}}_{\text{inf}} \).
Comparison of synchronous and asynchronous policy execution under inference latency. Each row illustrates action chunks produced at successive policy steps, while shaded regions indicate control timesteps. (a) Synchronous execution blocks policy updates during inference, causing the robot to reuse the last available action chunk and exhibit stop-and-go behavior. (b) Asynchronous execution decouples inference from execution by continuously consuming actions from an execution buffer, allowing new action chunks to replace future commands without stalling execution. This visualization highlights how inference latency induces temporal misalignment and affects trajectory continuity under different execution paradigms.
Overview of LIBERO-LQ.
1. Inference Latency
The delay between observing the environment and producing an action. Inference latency directly determines how long high-level commands are held while the robot continues executing, fundamentally shaping downstream motion behavior.2. End-to-End Task Completion Time
The total wall-clock time required to complete a task, including both inference and execution.3. Smoothness (Jerk)
Abrupt changes in motion, quantified via end-effector jerk. Latency often induces stop-and-go behavior and discontinuities that are invisible to success rate but manifest as high jerk, leading to unstable or unsafe motions.4. Stability (EE Oscillation)
Short-horizon oscillations of the end-effector position during task execution. Even successful rollouts may exhibit oscillatory behavior under delayed policy updates, revealing control instability caused by inference latency.5. Trajectory Efficiency
The total path length required to complete a task. Longer or unnecessarily winding trajectories indicate inefficiencies that success metrics alone fail to penalize.6. Joint Rotation Efficiency
The cumulative amount of joint motion executed over a trajectory. Excessive joint rotation—especially back-and-forth motion—signals inefficient execution that increases wear, energy consumption, and execution time.7. Energy Efficiency
Trajectory-level energy consumption estimated from joint torques and velocities, reported in normalized form. Latency-induced behaviors can significantly increase energy usage even when task success is unchanged, exposing inefficiencies hidden by conventional evaluation.We evaluate representative Vision-Language-Action (VLA) models under identical task settings while varying inference latency and execution protocols. LIBERO-LQ reveals systematic differences in execution behavior that are obscured under conventional success-only evaluation.
Latency Exposes Hidden Failure Modes
As inference latency increases, task success rates alone fail to capture degradation in execution quality. Even when success rates remain comparable, delayed policy updates introduce stop-and-go behavior, oscillations, and inefficient motion patterns.
Related metrics: Smoothness (Jerk), Stability (EE Oscillation)
Synchronous vs. Asynchronous Execution
Under identical simulated inference latency, synchronous execution blocks policy updates and reuses stale actions, leading to discontinuous trajectories. In contrast, asynchronous execution maintains continuous motion by decoupling inference from execution, resulting in lower jerk and reduced oscillatory behavior.
Related metrics: Smoothness (Jerk), Stability (EE Oscillation), End-to-End Time
Quality and Efficiency Are Not Correlated with Success
We observe cases where policies achieve similar task success rates but differ significantly in trajectory efficiency, joint rotation, and energy consumption. These discrepancies highlight inefficiencies that remain invisible under traditional evaluation protocols.
Related metrics: Trajectory Efficiency, Joint Rotation Efficiency, Energy Efficiency
Metric-Driven Evaluation Enables Fine-Grained Comparison
By jointly analyzing latency, execution protocols, and trajectory-level metrics, LIBERO-LQ enables fine-grained and deployment-relevant comparison between VLA models. This reveals trade-offs between robustness, efficiency, and energy consumption that are otherwise masked by binary success metrics.
Summary
LIBERO-LQ demonstrates that inference latency fundamentally reshapes execution behavior. Evaluating VLA models without accounting for latency and execution quality can lead to misleading conclusions about real-world performance.