In my case, I was also running an ASR model and a TTS model so it was a bit much for my RTX 3090. I ...

Ey7NFZ3P0nzAe • today at 5:42 AM • 0 replies • view on HN

In my case, I was also running an ASR model and a TTS model so it was a bit much for my RTX 3090. I opted to offset like 5 layers to the cpu while adding a GPU-only speculative decoding with their 0.8B model.

Working well so far.

alt Hacker News