LLaPa is a vision-language model (VLM) framework designed for multimodal procedural planning. It can generate executable action sequences based on textual task descriptions and visual environment ...