logoalt Hacker News

woodson05/04/20250 repliesview on HN

Oh, that’s pretty slow. Have you tried using quantization (int8 or int8_float32)? In my experience that can help speed up CT2 execution.

Personally, I haven’t had much luck with small-ish decoder-only models (i.e., typical LLMs) for translation. Sure, GPT4 etc. work extremely well, but not so much local models capable of running on small form-factor devices. Perhaps I should revisit that.