Author here. A couple of things worth clarifying. The 3.1/5.0 score in the Frontiers paper is...

sarangk90 • today at 2:42 PM • 2 replies • view on HN

Author here. A couple of things worth clarifying.

The 3.1/5.0 score in the Frontiers paper is a user satisfaction rating on feature completeness. Researchers were asked how well the system met all of their needs, including features that simply didn't exist yet at that point. It's a product maturity signal, not an accuracy or reliability number. The paper is also about a year old and the system has moved on significantly since.

On hallucinations, I'd push back on the framing a bit. The fact that we monitor for hallucinations isn't an admission that the system is hallucinating undetected. It's the opposite. Every sentence in the response is linked back to the exact page and verbatim quote from the source document, so a researcher can verify any claim in one click. We also run faithfulness scoring on live traffic every single day using RAGAS, so if the system starts drifting we catch it fast, not at some quarterly review.

And for the regulatory document drafting use case, every output is explicitly reviewed and approved by a qualified scientist before it goes anywhere. The system drafts, the human decides. That's not incidental; it's a design constraint baked into the architecture.

No LLM eliminates hallucination entirely. That's just the reality of the technology right now. So the engineering question becomes: how do you make it as unlikely as possible, and when it does happen, how fast do you catch it? That's what the retrieval pipeline, the reflection agent, the citations, and the daily evals are all doing. It's not a perfect answer, but it's a serious one.

Replies

andrew_lettuce • today at 4:53 PM

It seems weird to ask users about feature completeness, especial regarding a new system. By definition, if you are hitting a valuable use case you WON'T be feature complete, and this all assumes users can even determine functional boundaries or useful features. I would have expected better from an organization that positions itself as an expert at guiding software development, but I guess they're a consultancy first and foremost.

cocoa19 • today at 5:09 PM

> On hallucinations, I'd push back on the framing a bit. > It's not a perfect answer, but it's a serious one.

Thanks Claude!

alt Hacker News

Replies