For that you only need high throughput which is much easier to achieve compared to high latency, tha...

lopuhin • yesterday at 6:08 PM • 0 replies • view on HN

For that you only need high throughput which is much easier to achieve compared to high latency, thanks to batching -- assuming the log lines or chunks can be processed independently. You can check TensorRT-LLM benchmarks (https://nvidia.github.io/TensorRT-LLM/developer-guide/perf-o...), or try running vllm on a card you have access to.

alt Hacker News