logoalt Hacker News

lopuhinyesterday at 6:08 PM0 repliesview on HN

For that you only need high throughput which is much easier to achieve compared to high latency, thanks to batching -- assuming the log lines or chunks can be processed independently. You can check TensorRT-LLM benchmarks (https://nvidia.github.io/TensorRT-LLM/developer-guide/perf-o...), or try running vllm on a card you have access to.