This is really cool! I am trying to find a way to accelerate LLM inference for PII detection purpose...

ThePhysicist • today at 1:05 PM • 2 replies • view on HN

This is really cool! I am trying to find a way to accelerate LLM inference for PII detection purposes, where speed is really necessary as we want to process millions of log lines per minute, I am wondering how fast we could get e.g. llama 3.1 to run on a conventional NVIDIA card? 10k tokens per second would be fantastic but even at 1k this would be very useful.

Replies

lopuhin • today at 6:08 PM

For that you only need high throughput which is much easier to achieve compared to high latency, thanks to batching -- assuming the log lines or chunks can be processed independently. You can check TensorRT-LLM benchmarks (https://nvidia.github.io/TensorRT-LLM/developer-guide/perf-o...), or try running vllm on a card you have access to.

freakynit • today at 1:35 PM

PII redaction is a really good use-case.

Also, "10k tokens per second would be fantastic" might not be sufficient (even remotely) if you want to "process millions of log lines per minute".

Assuming a single log line at just 100 tokens, you need (100 * 2 million / 60) ~ 3.3 million tokens per second processing speed :)

➕ show 1 reply

alt Hacker News

Replies