Oh, that’s pretty slow. Have you tried using quantization (int8 or int8_float32)? In my experience t...

woodson • 05/04/2025 • 0 replies • view on HN

Oh, that’s pretty slow. Have you tried using quantization (int8 or int8_float32)? In my experience that can help speed up CT2 execution.

Personally, I haven’t had much luck with small-ish decoder-only models (i.e., typical LLMs) for translation. Sure, GPT4 etc. work extremely well, but not so much local models capable of running on small form-factor devices. Perhaps I should revisit that.

alt Hacker News