Most important piece of information is in the linked Frontiers article:
However, the overall capability of the chatbot to fully meet user needs received a lower average score (3.1/5.0), highlighting the need for further improvements.
Also there is still the problem of hallucinations, as we see in the „Evaluation“ paragraph: Live traffic evaluations are essential for monitoring system behavior, identifying potential issues like hallucinations in production, and understanding performance on diverse live queries.
This are quite devastating results. This is a system for scientific research on medicines and mediocrity and hallucinations will kill people.Would be interesting to know how much money was flushed down the toilet with these experts.
Author here. A couple of things worth clarifying.
The 3.1/5.0 score in the Frontiers paper is a user satisfaction rating on feature completeness. Researchers were asked how well the system met all of their needs, including features that simply didn't exist yet at that point. It's a product maturity signal, not an accuracy or reliability number. The paper is also about a year old and the system has moved on significantly since.
On hallucinations, I'd push back on the framing a bit. The fact that we monitor for hallucinations isn't an admission that the system is hallucinating undetected. It's the opposite. Every sentence in the response is linked back to the exact page and verbatim quote from the source document, so a researcher can verify any claim in one click. We also run faithfulness scoring on live traffic every single day using RAGAS, so if the system starts drifting we catch it fast, not at some quarterly review.
And for the regulatory document drafting use case, every output is explicitly reviewed and approved by a qualified scientist before it goes anywhere. The system drafts, the human decides. That's not incidental; it's a design constraint baked into the architecture.
No LLM eliminates hallucination entirely. That's just the reality of the technology right now. So the engineering question becomes: how do you make it as unlikely as possible, and when it does happen, how fast do you catch it? That's what the retrieval pipeline, the reflection agent, the citations, and the daily evals are all doing. It's not a perfect answer, but it's a serious one.