It can handle image and audio inputs, but it cannot produce those as outputs - it's purely a text output model.
Yeah you're right. Also, you're Simon :)
Yeah you're right. Also, you're Simon :)