A better way would be a VLA as opposed to a VLM. VLAs are meant to take action, where as vlms are for geneeral use. https://cognitivedrone.github.io/