I sympathize with the piece, evaluating how LLMs interact with mentally vulnerable users is something I've been actively working on: https://vigil-eval.com/
The biggest observation so far is that the latest models are night and day from LLMs from even 6 months ago (from OpenAI + Anthropic, Google is still very poor!)
Interesting use of evals.
Might help interpretation to say on the front page that it's a five point scale with 0 (or 1?) being the safest score. This can be picked up from colors and the bars in the individual reports, but it takes a minute to figure it out.