What years of production-grade concurrency teaches us about building AI agents

118 points • by ellieh • yesterday at 10:37 PM • 44 comments • view on HN

Comments

I’ve built fairly large OTP systems in the past, and I think the core claim is directionally right: long lived, stateful, failure prone "conversations" map very naturally to Erlang processes plus supervision trees. An agent session is basically a call session with worse latency and more nondeterminism.

That said, a lot of current agent workloads are I/O bound around external APIs. If 95% of the time is waiting on OpenAI or Anthropic, the scheduling model matters less than people think. The BEAM’s preemption and per process GC shine when you have real contention or CPU heavy work in the same runtime. Many teams quietly push embeddings, parsing, or model hosting to separate services anyway.

Hot code swapping is genuinely interesting in this context. Updating agent logic without dropping in flight sessions is non trivial on most mainstream stacks. In practice though, many startups are comfortable with draining connections behind a load balancer and calling it a day.

So my take is: if you actually need millions of concurrent, stateful, soft real time sessions with strong fault isolation, the BEAM is a very sane default. If you are mostly gluing API calls together for a few thousand users, the runtime differences are less decisive than the surrounding tooling and hiring pool.

mccoyb • today at 3:04 AM

Broadly agree with the author's points, except for this one:

> TypeScript/Node.js: Better concurrency story thanks to the event loop, but still fundamentally single-threaded. Worker threads exist but they're heavyweight OS threads, not 2KB processes. There's no preemptive scheduling: one CPU-bound operation blocks everything.

This cannot be a real protest: 100% of the time spent in agent frameworks is spent ... waiting for the agent to respond, or waiting for a tool call to execute. Almost no time is spent in the logic of the framework itself.

Even if you use heavyweight OS threads, I just don't believe this matters.

Now, the other points about hot code swapping ... so true, painfully obvious to those of us who have used Elixir or Erlang.

For instance, OpenClaw: how much easier would "in-place updating" be if the language runtime was just designed with the ability in mind in the first place.

➕ show 1 reply

znnajdla • today at 11:01 AM

Is someone at Anthropic reading my AI chats? I literally came to this conclusion a few weeks ago after studying the right framework for building long running browser agents. Have zero experience with Elixir but as soon as I looked at their language constructs it just "clicked" immediately that this is exactly suited to AI agent frameworks. Another problem you solve immediately: distributed deployment without kubernetes hell.

➕ show 1 reply

simianwords • today at 6:11 AM

I don’t see the point of agent frameworks. Other than durability and checkpoints how does it help me?

Claude code already works as an agent that calls tools when necessary so it’s not clear how an abstraction helps here.

I have been really confused by langchain and related tech because they seem so bloated without offering me any advantages?

I genuinely would like to know what I’m missing.

➕ show 3 replies

zknill • today at 1:09 PM

There's two parts to this article. The scheduler/preemption, and the transport over the network. The article is absolutely right that long-lived request/response over HTTP connections with SSE streamed responses suck.

The article touches very briefly on Phoenix LiveView and Websockets. I wrote about why chatbots hate page refresh[1], and it's not solved by just swapping to Websockets. By far the best mechanism is pub/sub, especially when you can get multi-user/multi-device, conversation hand-off, re-connection, history resumes, and token compaction basically for free from the transport.

1: https://zknill.io/posts/chatbots-worst-enemy-is-page-refresh...

quadruple • today at 1:01 PM

> The BEAM's "let it crash" philosophy takes the opposite approach. Instead of anticipating every failure mode, you write the happy path and let processes crash. The supervisor detects the crash and restarts the process in a clean state. The rest of the system continues unaffected.

Do I want this? If my request fails because the tool doesn't have a DB connection, I want the model to receive information about that error. If the LLM API returns an error because the conversation is too long, I want to run compacting or other context engineering strategies, I don't want to restart the process just to run into the same thing again. Am I misunderstanding Elixir's advantage here?

➕ show 3 replies

veunes • today at 1:14 PM

The Let it crash concept is perfect for deterministic bugs, but does it work for probabilistic errors?

If an LLM returns garbage, restarting the process (agent) with the same prompt and temperature 0 yields the same garbage. An Erlang Supervisor restarts a process in a clean state. For an agent "clean state" = lost conversation context

We don't just need Supervision Trees, we need Semantic Supervision Trees that can change strategy on restart. BEAM doesn't give this out of the box, you still code it manually

mackross • today at 12:22 PM

I’m a huge elixir fan, but imho it doesn’t solve durable execution out of the box which is a major problem that often gets swept under the rug by BEAM fanboys. Because ETS and supervision trees don’t play well with deployment via restart, you’ve got to write some level of execution state to relational database or files. You can choose persistent ETS, mnesia, etc, (which have their own tradeoffs and come with some kind of gnarley data-loss scenarios in deep documentation). But, whatever you choose, in my experience you will need to spend a fair amount of time considering how your processes are going to survive restarts. Alternatively, Oban is nice, but it’s a heavy layer that makes control flow more complex to follow. And, yes you can roll your own hot code deploy and run in persistent VMs/bare metal and be a true BEAM native, but it’s not easy out of the box and comes with its own set of foot guns. If I’m missing something, I would love for someone to explain to me how to do things better, as I find this to be a big pain point whenever I pick up elixir. I want to use the beautiful primitives, but I feel I’m always fighting durable execution in the event of a server restart. I wish a temporal.io client or something with similar guarantees was baked into the lang/frameworks.

➕ show 2 replies

codethief • today at 1:22 PM

@dang I think the original title is much better (more specific):

> Your Agent Framework Is Just a Bad Clone of Elixir: Concurrency Lessons from Telecom to AI

nottorp • today at 12:15 PM

I don't understand. Shouldn't we just task the "AI" agents to improve themselves?

What's that about years of experience? That's obsolete thinking!

vinnymac • today at 12:49 PM

Been using Elixir with agents for over a year. Seemed like an obvious choice to me.

Node is great, but scaling Elixir threads is more so.

manojlds • today at 11:11 AM

Surely they mean Erlang not Elixir

➕ show 2 replies

bitwize • today at 4:41 AM

Ackshually...

Erlang didn't introduce the actor model, any more than Java introduced garbage collection. That model was developed by Hewitt et al. in the 70s, and the Scheme language was developed to investigate it (core insights: actors and lambdas boil down to essentially the same thing, you really don't need much language to support some really abstract concepts).

Erlang was a fantastic implementation of the actor model for an industrial application, and probably proved out the model's utility for large-scale "real" work more than anything else. That and it being fairly semantically close to Scheme are why I like it.

➕ show 1 reply

fud101 • today at 9:01 AM

This is all well and good but Elixir with Phoenix and Liveview is super bloated and you have to have a real reason to buy into such a monster that you couldn't do with a simpler stack.

➕ show 2 replies

koakuma-chan • today at 11:32 AM

You Should Just Use Rust

alt Hacker News

What years of production-grade concurrency teaches us about building AI agents

Comments