LIBERO-LQ
A Latency- and Quality-Aware Benchmark for Vision-Language-Action Models
Anonymous Authors
Abstract.
Vision-Language-Action (VLA) models have shown strong performance on robotic manipulation tasks through large-scale multimodal pre- training. However, existing benchmarks typically assume negligible inference latency and rely primarily on binary success metrics, limiting their ability to evaluate deployment-relevant execution behavior. We introduce LIBERO-LQ, a Latency- and Quality-aware benchmark for evaluating VLA models under realistic execution constraints. LIBERO-LQ decouples policy inference from environment stepping, enabling simulated execution to progress during non-negligible inference latency and supporting systematic evaluation of synchronous and asynchronous execution paradigms under identical latency conditions. Beyond task success and completion time, LIBERO-LQ incorporates a suite of trajectory-level, non-functional metrics that capture execution quality, including smoothness, stability, efficiency, and energy consumption. Through experiments on representative VLA models, we show that latency-aware and quality-oriented evaluation exposes substantial differences in execution robustness and trajectory quality that are obscured under conventional synchronous evaluation.
Why VLA Latency Matters
controlstack

Execution pipeline of a Vision-Language-Action (VLA) system under realistic deployment conditions.

High-frequency sensing and low-level control operate continuously, while VLA policy inference runs at a much lower rate and updates position targets asynchronously. During inference latency, the robot continues executing previously issued commands via the low-level controller, resulting in decoupled policy computation and control execution. This mismatch is typically ignored in existing evaluation protocols.
Latency
1. Latency-Aware Simulation.
Comparison between between existing benchmarks and LIBERO-LQ

Comparison between conventional VLA evaluation and our latency-aware execution model.

Left: Traditional evaluation tightly couples inference and environment stepping, implicitly assuming zero simulated inference latency \( \Delta t^{\text{sim}}_{\text{inf}} = 0 \). Right: Our approach decouples inference from environment stepping by explicitly injecting simulated inference latency. During inference, the simulator advances via repeated env.step calls using a synchronous or asynchronous execution protocol, where the number of steps \(N\) is determined by \( \Delta t^{\text{sim}}_{\text{inf}} \).
2. Policy Execution Protocol.
Policy execution protocol

Comparison of synchronous and asynchronous policy execution under inference latency. Each row illustrates action chunks produced at successive policy steps, while shaded regions indicate control timesteps. (a) Synchronous execution blocks policy updates during inference, causing the robot to reuse the last available action chunk and exhibit stop-and-go behavior. (b) Asynchronous execution decouples inference from execution by continuously consuming actions from an execution buffer, allowing new action chunks to replace future commands without stalling execution. This visualization highlights how inference latency induces temporal misalignment and affects trajectory continuity under different execution paradigms.

(a) Synchronous execution blocks policy updates during inference, causing the robot to reuse the last available action chunk and exhibit stop-and-go behavior. (b) Asynchronous execution decouples inference from execution by continuously consuming actions from an execution buffer, allowing new action chunks to replace future commands without stalling execution. This visualization highlights how inference latency induces temporal misalignment and affects trajectory continuity under different execution paradigms.
Quality.
Overview of LIBERO-LQ

Overview of LIBERO-LQ.

Existing robotic manipulation benchmarks primarily evaluate policies using functional metrics such as binary task success. While effective for measuring task completion, such metrics provide limited insight into execution behavior and deployment-relevant performance. To address this limitation, we include several non-functional metrics, which characterize execution quality, efficiency, and robustness.

1. Inference Latency

The delay between observing the environment and producing an action. Inference latency directly determines how long high-level commands are held while the robot continues executing, fundamentally shaping downstream motion behavior.

2. End-to-End Task Completion Time

The total wall-clock time required to complete a task, including both inference and execution.

3. Smoothness (Jerk)

Abrupt changes in motion, quantified via end-effector jerk. Latency often induces stop-and-go behavior and discontinuities that are invisible to success rate but manifest as high jerk, leading to unstable or unsafe motions.

4. Stability (EE Oscillation)

Short-horizon oscillations of the end-effector position during task execution. Even successful rollouts may exhibit oscillatory behavior under delayed policy updates, revealing control instability caused by inference latency.

5. Trajectory Efficiency

The total path length required to complete a task. Longer or unnecessarily winding trajectories indicate inefficiencies that success metrics alone fail to penalize.

6. Joint Rotation Efficiency

The cumulative amount of joint motion executed over a trajectory. Excessive joint rotation—especially back-and-forth motion—signals inefficient execution that increases wear, energy consumption, and execution time.

7. Energy Efficiency

Trajectory-level energy consumption estimated from joint torques and velocities, reported in normalized form. Latency-induced behaviors can significantly increase energy usage even when task success is unchanged, exposing inefficiencies hidden by conventional evaluation.
Experimental Results.

We evaluate representative Vision-Language-Action (VLA) models under identical task settings while varying inference latency and execution protocols. LIBERO-LQ reveals systematic differences in execution behavior that are obscured under conventional success-only evaluation.

Latency Exposes Hidden Failure Modes

As inference latency increases, task success rates alone fail to capture degradation in execution quality. Even when success rates remain comparable, delayed policy updates introduce stop-and-go behavior, oscillations, and inefficient motion patterns.

Related metrics: Smoothness (Jerk), Stability (EE Oscillation)

Synchronous vs. Asynchronous Execution

Under identical simulated inference latency, synchronous execution blocks policy updates and reuses stale actions, leading to discontinuous trajectories. In contrast, asynchronous execution maintains continuous motion by decoupling inference from execution, resulting in lower jerk and reduced oscillatory behavior.

Related metrics: Smoothness (Jerk), Stability (EE Oscillation), End-to-End Time

Quality and Efficiency Are Not Correlated with Success

We observe cases where policies achieve similar task success rates but differ significantly in trajectory efficiency, joint rotation, and energy consumption. These discrepancies highlight inefficiencies that remain invisible under traditional evaluation protocols.

Related metrics: Trajectory Efficiency, Joint Rotation Efficiency, Energy Efficiency

Metric-Driven Evaluation Enables Fine-Grained Comparison

By jointly analyzing latency, execution protocols, and trajectory-level metrics, LIBERO-LQ enables fine-grained and deployment-relevant comparison between VLA models. This reveals trade-offs between robustness, efficiency, and energy consumption that are otherwise masked by binary success metrics.

Summary

LIBERO-LQ demonstrates that inference latency fundamentally reshapes execution behavior. Evaluating VLA models without accounting for latency and execution quality can lead to misleading conclusions about real-world performance.