logoalt Hacker News

adrian_byesterday at 11:15 AM0 repliesview on HN

Many of the open-weights LLMs accept either text or images as input.

Besides those, there are a few smaller open-weights models that are dedicated for OCR tasks, for instance DeepSeek-OCR-2 and IBM granite-vision-4.1-4b. (They can be found on huggingface.co)

The dedicated vision models can be run on much cheaper hardware, including smartphones, than the big models that can process images besides text.

Similarly, besides bigger multimodal models, that can accept audio, images or text as imput, there are smaller open-weights models that are dedicated for speech recognition, e.g. Xiaomi MiMo-V2.5-ASR and IBM granite-speech-4.1-2b.