Open access for next 5 hours (8GiB model, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>
https://ofo1j9j6qh20a8-80.proxy.runpod.net
./build/bin/llama-server \
-m ../Bonsai-8B.gguf \
-ngl 999 \
--flash-attn on \
--host 0.0.0.0 \
--port 80 \
--ctx-size 65500 \
--batch-size 512 \
--ubatch-size 512 \
--parallel 5 \
--cont-batching \
--threads 8 \
--threads-batch 8 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--log-colors on
The server can serve 5 parallel request, with each request capped at around `13K` tokens...A bit of of benchmarks I did:
1. Input: 700 tokens, ttfs: ~0 second, outputs: 1822 tokens ~190t/s
1. Input: 6400+ tokens, ttfs: ~2 second, outputs: 2012 tokens at ~135t/s
Vram usage was consistently at ~4GiB.
I genuinely love talking to these models
https://ofo1j9j6qh20a8-80.proxy.runpod.net/#/chat/5554e479-0...
I'm contemplating whether I should drive or walk to the car wash (I just thought of that one HN post) and this is what it said after a few back-and-forths:
- Drive to the car (5 minutes), then park and wash.
- If you have a car wash nearby, you can walk there (2 minutes) and do the washing before driving to your car.
- If you're in a car wash location, drive to it and wash there.
Technically the last point was fine, but I like the creativity.
Update: this has been evicted by runpod as it was on spot.
Thank you! I am impressed by the speed of it.
That was really impressive. https://pastebin.com/PmJmTLJN pretty much instantly. (Very weak models can't do this.)
Kind sir, May I say to you thanks for doing so! I really appreciate it :D
[dead]
Better keep the KV cache in full precision