These guard types are great and I've heavily used them in the past. But why codegen them?
E.g. the jwt auth example has some major problems since the verification rules aren't fully specified in the spec. The jwt-token verified rule only checks that the string isn't empty, but it doesn't actually verify that it is correctly parsed, non-expired, and signed by a trusted key. The authenticated-user rule doesn't check that the user-id actually came from the jwt. If you hand-wrote your constructor, you would ensure these things. Similarly, all the other constructors allow passing in whatever values you like instead of checking the connections of the real objects.
By calling the constructor for these types, you are making an assertion about the relationship of the parameter values. If AI is calling the constructor, then it's able to make it's own assertions and derive whatever result it wants. That seems backwards. AI should use the result of tenant-access to deduce that a user is a member of tenant, but if they can directly call `(tenant-access user-id tenant-id true)`, then they can "prove" tenant-access for anything. In the past, we have named the constructors for these types `TenantAccess.newUnverified`, and then heavily scrutinized all callers (typically just jwt-parsers and specific database lookups). You can then use `TenantAccess.{userId,tenantId}` without scrutiny elsewhere.
My naive thoughts on Formal Verification (FV) and LLM:
We could leverage existing FV tools for a given programming language by using an LLM to generate a translator that maps the language to the FV tool's input format. This would essentially require a finite number of "abstract interpretation" functions—one for each language construct. While the total number of constructs might be large (e.g., around 500), each function would be independent. A function would only need to reason about the abstract semantics of a single construct, assuming the others adhere to their respective semantics. We could then distribute these LLM-generated functions among a group of experts (e.g., 100 reviewers). Thanks to the modularity of the functions, reviewers could evaluate their assigned subset in parallel without bottlenecks. The end result would be a working FV tool for the target language.
What I've learned so far is that the tooling mentioned in this article can have dual purpose. Sure you can use it to gate the agent but you can also use them as tools for the agents to understand what they are working with at a more abstract level and with less tokens required.
In a way, you push some of the reasoning into deterministic tooling which is great both for reliability and performance.
At least on what I've been working on where I ended up creating type systems over SQL to solve some of the annoying issues I was having with agents reasoning over complex data infrastructure.
This matches my experience running coding agents daily: the agent is reliable at producing the shape of a thing and unreliable at holding an invariant across a long loop. Moving the rule into a type the compiler won't violate works because it relocates the check to the one place the loop can't quietly skip.
But singron's JWT point is the real limit. Backpressure doesn't remove the judgment, it moves it. The type still has to be written by someone who understands the actual invariant, and a guard type that compiles while only checking "string is non-empty" gives you the feeling of a gate with none of the guarantee. The compiler enforces what you encoded, not what you meant.
So this reads less like formal verification and more like forcing the human judgment to land up front, in the type definitions, instead of hoping it survives in a prompt. That's still a real win: a constraint in a type is reviewable and permanent, while a prompt rule decays the moment the context window moves past it. Worth being honest, though, that the hard part doesn't go away. It just changes address.
This is really cool, but why wouldn't you just use a more richly typed target language and skip this process? You could use Liquid Haskell (for refinement types) or Lean (for full dependent types) and be able to put these invariants directly in your program rather then in a sidecar.
I think it's all about keeping state in the determinant space. I've come across the same issue, the key was to not rely on LLM performing workflow - the runtime needs to enforce.
So, capabilities/type systems. Building code architecture guardrails steep enough the AI won't jump the fence/take shortcuts.
One question I have here: I think this type of thing would be trivial to do in Rust with constructors, private fields, and newtypes. What am I getting on top of it?
Author here. The TL;DR: move rules from prompts into types the compiler refuses to violate, then bounce the AI coding loop off those refusals. The repo is github.com/pyrex41/Shen-Backpressure. Builds a lot on Geoff Huntley's backpressure idea -- none of this is rocket science, just an effort to apply sound programming principles in a world of LLM coding agents.
[flagged]
[flagged]
I work in DevOps at a firm that has been very enthusiastic about using LLMs (in the good sense).
The phases were basically:
- try out having the LLM do "a lot"
- now even more
- now run multiple agents
- back to single agents but have the agents build tools
- tools that are deterministic AND usable by both the humans (EDIT: and the LLMs)
The reasons:
1. Deterministic tools (for both deployments and testing) get you a binary answer and it's repeatable
2. In the event of an outage, you can always fall back to the tool that a human can run
3. It's faster. A quick script can run in <30 seconds but "confabulating" always seemed to take 2-3 minutes.
Really, we are back to this article: https://spawn-queue.acm.org/doi/10.1145/3194653.3197520 aka "make a list of tasks, write scripts for each task, combine the scripts into functions, functions become a system"