Vision-Language-Action (VLA) models integrate visual perception, natural language understanding, and embodied control into a unified framework, enabling end-to-end task execution from multimodal instructions. While such models have demonstrated impressive generalization across tasks and environments, their direct outputs—often in the form of discrete action tokens or waypoint sequences—frequently overlook key physical constraints, such as trajectory feasibility, collision avoidance, and dynamic consistency. This limitation hinders deployment in safety-critical and dynamic real-world settings. Integrating motion planning into VLA systems offers a principled solution, embedding geometric and dynamic constraints into the control pipeline to transform high-level semantic goals into safe, smooth, and executable trajectories. This work examines representative integration strategies alongside the trade-offs between discrete tokenized outputs and continuous control policies. Applications are analyzed highlighting performance gains in generalization, safety, and execution efficiency. A discussion of current challenges—such as the balance between planning speed and precision, and generalization across embodiments—is followed by prospective research directions, including continuous prediction with hierarchical control, low-resource edge deployment, and multi-robot collaborative planning. The study underscores motion planning as a critical enabler for reliable, adaptable, and scalable embodied intelligence.
Research Article
Open Access