A Vision-Language Model-based Language-Action Cycle for
Self-Improving Robotic Manipulation
Anonymous Authors
LACY (Language-Action CYcle) enables bidirectional grounding between language and action for robotic manipulation through a self-improving loop of three tasks: language-to-action (L2A), action-to-language explanation (A2L), and language-consistency verification (L2C).
We present LACY (Language-Action Cycle), a unified vision-language framework that enables bidirectional grounding between language and action for robotic manipulation.
Unlike prior works that focus solely on language-to-action (L2A) learning, LACY jointly trains three complementary tasks—L2A, action-to-language explanation (A2L), and language-consistency verification (L2C)—within a single model.
This design forms a self-improving loop that autonomously generates and filters training data through the L2A→A2L→L2C cycle, significantly enhancing data efficiency and generalization.
Experiments on both simulation and real-world pick-and-place tasks show that LACY improves manipulation success rates by over 56%, demonstrating robust and scalable language-action understanding.
Simulation Demo.
Absolute and relative spatial reasoning: The robot understands the both the absolute location of the object and the relative spatial relationship between the target object and the reference object, successfully executing the pick-and-place task.
Real-world Demo.
Relative spatial reasoning: The robot understands the spatial relationship between the target object and the reference object and successfully executes the pick-and-place task.
Absolute spatial reasoning: The robot understands the absolute spatial semantics of the target object on the workspace and successfully executes the pick-and-place task.
Approach.
Overview of the LACY framework: A single vision-language model jointly performs three tasks—L2A, A2L, and L2C—forming a self-improving cycle where actions and language refine each other through consistency verification.
LACY (Language-Action Cycle) unifies three synergistic tasks—language-to-action generation (L2A), action-to-language explanation (A2L), and language-to-consistency verification (L2C)—within a single vision-language model.
The framework operates as a closed-loop system, where L2A produces actions from language, A2L reconstructs a linguistic description from those actions, and L2C verifies whether the two language descriptions remain semantically consistent.
This bidirectional cycle enables self-improving data generation: the model autonomously produces new training triplets and selectively retains only high-confidence samples through the L2C filter.
By combining multi-task fine-tuning, chain-of-thought grounding, and confidence-based data selection, LACY continually enhances its manipulation policy without additional human supervision.
1. Action to Language.
Language to Action, Action to Language: While conventional VLAs learn only the language-to-action (L2A) mapping, LACY additionally trains the action-to-language (A2L) inverse mapping, enabling the model to generate and explain actions bidirectionally.
Traditional vision-language-action (VLA) models are trained solely in a language-to-action (L2A) direction—mapping task instructions and visual inputs to corresponding robot actions.
While this paradigm enables imitation from language, it lacks the ability to reason about why an action is appropriate, limiting generalization and interpretability.
To address this, we introduce the action-to-language (A2L) counterpart, which learns to describe observed actions in natural language given the same visual context.
This inverse mapping encourages the model to form richer bidirectional representations, allowing it to both act and explain.
When combined with the L2C verifier, these two tasks form the foundation of LACY's self-improving language-action cycle.
2. Language Consistency Verification.
Language Consistency Verification (L2C): The L2C module verifies whether the generated action and reconstructed language remain consistent with the original instruction, enabling self-supervised feedback for continual improvement.
The L2C (Language-to-Consistency) module completes the language-action cycle by verifying whether the model's predicted action truly reflects the intent of the original language instruction.
After generating an action from a command and explaining it back in language, L2C checks if the two descriptions convey the same meaning under the same visual context.
This consistency feedback allows the model to evaluate its own outputs and identify uncertain cases without human supervision.
By learning from these self-assessments, LACY continuously refines its understanding of how language and action correspond, leading to more reliable and generalizable behavior.
L2C Pipeline: The L2C verifier assigns a confidence score to each language pair, allowing LACY to keep only reliable samples during self-supervised learning.
To make self-verification reliable, the L2C module also estimates how confident the model is in its consistency judgment.
Instead of a simple yes-or-no decision, LACY converts the model's internal output into a smooth confidence score that reflects how strongly it believes two descriptions match.
This confidence is then used to filter uncertain or noisy samples—only high-quality, consistent pairs are added back to the training data.
Through this selective feedback, LACY improves efficiently while avoiding the accumulation of its own errors.
Failure Cases.
Failure Cases: In most cases, the picking action has assigned the correct object, but the picking location is at the boundary of the object, the object tends to be out of graspable range.