Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.
That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?
That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?