logoalt Hacker News

freakynittoday at 5:49 AM7 repliesview on HN

Open access for next 5 hours (8GiB model, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://ofo1j9j6qh20a8-80.proxy.runpod.net

  ./build/bin/llama-server \
   -m ../Bonsai-8B.gguf \
   -ngl 999 \
   --flash-attn on \
   --host 0.0.0.0 \
   --port 80 \
   --ctx-size 65500 \
   --batch-size 512 \
   --ubatch-size 512 \
   --parallel 5 \
   --cont-batching \
   --threads 8 \
   --threads-batch 8 \
   --cache-type-k q4_0 \
   --cache-type-v q4_0 \
   --log-colors on
The server can serve 5 parallel request, with each request capped at around `13K` tokens...

A bit of of benchmarks I did:

1. Input: 700 tokens, ttfs: ~0 second, outputs: 1822 tokens ~190t/s

1. Input: 6400+ tokens, ttfs: ~2 second, outputs: 2012 tokens at ~135t/s

Vram usage was consistently at ~4GiB.


Replies

ggerganovtoday at 6:53 AM

Better keep the KV cache in full precision

show 1 reply
ramon156today at 9:38 AM

I genuinely love talking to these models

https://ofo1j9j6qh20a8-80.proxy.runpod.net/#/chat/5554e479-0...

I'm contemplating whether I should drive or walk to the car wash (I just thought of that one HN post) and this is what it said after a few back-and-forths:

- Drive to the car (5 minutes), then park and wash.

- If you have a car wash nearby, you can walk there (2 minutes) and do the washing before driving to your car.

- If you're in a car wash location, drive to it and wash there.

Technically the last point was fine, but I like the creativity.

freakynittoday at 10:48 AM

Update: this has been evicted by runpod as it was on spot.

TRCattoday at 6:48 AM

Thank you! I am impressed by the speed of it.

logicalleetoday at 5:58 AM

That was really impressive. https://pastebin.com/PmJmTLJN pretty much instantly. (Very weak models can't do this.)

Imustaskforhelptoday at 8:39 AM

Kind sir, May I say to you thanks for doing so! I really appreciate it :D

kgeisttoday at 7:19 AM

[dead]