Thanks a lot! Much appreciated. To answer your questions: - yes, we rewrite the whole model code...

gaeld • today at 12:53 PM • 0 replies • view on HN

Thanks a lot! Much appreciated.

To answer your questions:

- yes, we rewrite the whole model code (while keeping the same logic) in CUDA/HIP and assembly, in order to optimize by hand for each GPU type. It's quite tedious for sure, but I guess this is the price to pay to get this kind of results.

- the batching question is a great one. In agentic systems, there is probably a trade-off between sequential thinking/iterations vs parallel exploration of multiple solutions. Also, there could just be multiple independent tasks running in parallel, depending on the use case.

We plan to support a small amount of batching, but it quickly becomes a trade-off vs speed. Pick one for your use case, I guess.

Also to consider: because we answer requests much faster, we are also able to process lots of them without needing high batches - and scaling on multiple nodes is possible.

- open sourcing: maybe, maybe not. I'm still undecided on this. We are a small startup and I'm told that giving our IP away might be shooting ourselves in the feet. On the other side, I think it could be of great benefit to the community and for us... we'll see

alt Hacker News